* Adjust beacon node timeouts for validator client HTTP requests (#2352)
Resolves#2313
Provide `BeaconNodeHttpClient` with a dedicated `Timeouts` struct.
This will allow granular adjustment of the timeout duration for different calls made from the VC to the BN. These can either be a constant value, or as a ratio of the slot duration.
Improve timeout performance by using these adjusted timeout duration's only whenever a fallback endpoint is available.
Add a CLI flag called `use-long-timeouts` to revert to the old behavior.
Additionally set the default `BeaconNodeHttpClient` timeouts to the be the slot duration of the network, rather than a constant 12 seconds. This will allow it to adjust to different network specifications.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
* Use read_recursive locks in database (#2417)
Closes#2245
Replace all calls to `RwLock::read` in the `store` crate with `RwLock::read_recursive`.
* Unfortunately we can't run the deadlock detector on CI because it's pinned to an old Rust 1.51.0 nightly which cannot compile Lighthouse (one of our deps uses `ptr::addr_of!` which is too new). A fun side-project at some point might be to update the deadlock detector.
* The reason I think we haven't seen this deadlock (at all?) in practice is that _writes_ to the database's split point are quite infrequent, and a concurrent write is required to trigger the deadlock. The split point is only written when finalization advances, which is once per epoch (every ~6 minutes), and state reads are also quite sporadic. Perhaps we've just been incredibly lucky, or there's something about the timing of state reads vs database migration that protects us.
* I wrote a few small programs to demo the deadlock, and the effectiveness of the `read_recursive` fix: https://github.com/michaelsproul/relock_deadlock_mvp
* [The docs for `read_recursive`](https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.read_recursive) warn of starvation for writers. I think in order for starvation to occur the database would have to be spammed with so many state reads that it's unable to ever clear them all and find time for a write, in which case migration of states to the freezer would cease. If an attack could be performed to trigger this starvation then it would likely trigger a deadlock in the current code, and I think ceasing migration is preferable to deadlocking in this extreme situation. In practice neither should occur due to protection from spammy peers at the network layer. Nevertheless, it would be prudent to run this change on the testnet nodes to check that it doesn't cause accidental starvation.
* Return more detail when invalid data is found in the DB during startup (#2445)
- Resolves#2444
Adds some more detail to the error message returned when the `BeaconChainBuilder` is unable to access or decode block/state objects during startup.
NA
* Use hardware acceleration for SHA256 (#2426)
Modify the SHA256 implementation in `eth2_hashing` so that it switches between `ring` and `sha2` to take advantage of [x86_64 SHA extensions](https://en.wikipedia.org/wiki/Intel_SHA_extensions). The extensions are available on modern Intel and AMD CPUs, and seem to provide a considerable speed-up: on my Ryzen 5950X it dropped state tree hashing times by about 30% from 35ms to 25ms (on Prater).
The extensions became available in the `sha2` crate [last year](https://www.reddit.com/r/rust/comments/hf2vcx/ann_rustcryptos_sha1_and_sha2_now_support/), and are not available in Ring, which uses a [pure Rust implementation of sha2](https://github.com/briansmith/ring/blob/main/src/digest/sha2.rs). Ring is faster on CPUs that lack the extensions so I've implemented a runtime switch to use `sha2` only when the extensions are available. The runtime switching seems to impose a miniscule penalty (see the benchmarks linked below).
* Start a release checklist (#2270)
NA
Add a checklist to the release draft created by CI. I know @michaelsproul was also working on this and I suspect @realbigsean also might have useful input.
NA
* Serious banning
* fmt
Co-authored-by: Mac L <mjladson@pm.me>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
- Resolves#2452
## Proposed Changes
I've seen a few people confused by this and I don't think the message is really worth it.
## Additional Info
NA
## Issue Addressed
#635
## Proposed Changes
- Keep attestations that reference a block we have not seen for 30secs before being re processed
- If we do import the block before that time elapses, it is reprocessed in that moment
- The first time it fails, do nothing wrt to gossipsub propagation or peer downscoring. If after being re processed it fails, downscore with a `LowToleranceError` and ignore the message.
## Issue Addressed
NA
## Proposed Changes
Adds a metric to see how many set bits are in the sync aggregate for each beacon block being imported.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Adds more detail to the log when an attestation is ignored due to a prior one being known. This will help identify which validators are causing the issue.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Add a checklist to the release draft created by CI. I know @michaelsproul was also working on this and I suspect @realbigsean also might have useful input.
## Additional Info
NA
## Proposed Changes
Modify the SHA256 implementation in `eth2_hashing` so that it switches between `ring` and `sha2` to take advantage of [x86_64 SHA extensions](https://en.wikipedia.org/wiki/Intel_SHA_extensions). The extensions are available on modern Intel and AMD CPUs, and seem to provide a considerable speed-up: on my Ryzen 5950X it dropped state tree hashing times by about 30% from 35ms to 25ms (on Prater).
## Additional Info
The extensions became available in the `sha2` crate [last year](https://www.reddit.com/r/rust/comments/hf2vcx/ann_rustcryptos_sha1_and_sha2_now_support/), and are not available in Ring, which uses a [pure Rust implementation of sha2](https://github.com/briansmith/ring/blob/main/src/digest/sha2.rs). Ring is faster on CPUs that lack the extensions so I've implemented a runtime switch to use `sha2` only when the extensions are available. The runtime switching seems to impose a miniscule penalty (see the benchmarks linked below).
## Issue Addressed
- Resolves#2444
## Proposed Changes
Adds some more detail to the error message returned when the `BeaconChainBuilder` is unable to access or decode block/state objects during startup.
## Additional Info
NA
## Issue Addressed
Closes#2245
## Proposed Changes
Replace all calls to `RwLock::read` in the `store` crate with `RwLock::read_recursive`.
## Additional Info
* Unfortunately we can't run the deadlock detector on CI because it's pinned to an old Rust 1.51.0 nightly which cannot compile Lighthouse (one of our deps uses `ptr::addr_of!` which is too new). A fun side-project at some point might be to update the deadlock detector.
* The reason I think we haven't seen this deadlock (at all?) in practice is that _writes_ to the database's split point are quite infrequent, and a concurrent write is required to trigger the deadlock. The split point is only written when finalization advances, which is once per epoch (every ~6 minutes), and state reads are also quite sporadic. Perhaps we've just been incredibly lucky, or there's something about the timing of state reads vs database migration that protects us.
* I wrote a few small programs to demo the deadlock, and the effectiveness of the `read_recursive` fix: https://github.com/michaelsproul/relock_deadlock_mvp
* [The docs for `read_recursive`](https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.read_recursive) warn of starvation for writers. I think in order for starvation to occur the database would have to be spammed with so many state reads that it's unable to ever clear them all and find time for a write, in which case migration of states to the freezer would cease. If an attack could be performed to trigger this starvation then it would likely trigger a deadlock in the current code, and I think ceasing migration is preferable to deadlocking in this extreme situation. In practice neither should occur due to protection from spammy peers at the network layer. Nevertheless, it would be prudent to run this change on the testnet nodes to check that it doesn't cause accidental starvation.
## Issue Addressed
Resolves#2313
## Proposed Changes
Provide `BeaconNodeHttpClient` with a dedicated `Timeouts` struct.
This will allow granular adjustment of the timeout duration for different calls made from the VC to the BN. These can either be a constant value, or as a ratio of the slot duration.
Improve timeout performance by using these adjusted timeout duration's only whenever a fallback endpoint is available.
Add a CLI flag called `use-long-timeouts` to revert to the old behavior.
## Additional Info
Additionally set the default `BeaconNodeHttpClient` timeouts to the be the slot duration of the network, rather than a constant 12 seconds. This will allow it to adjust to different network specifications.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Proposed Changes
Implement the consensus changes necessary for the upcoming Altair hard fork.
## Additional Info
This is quite a heavy refactor, with pivotal types like the `BeaconState` and `BeaconBlock` changing from structs to enums. This ripples through the whole codebase with field accesses changing to methods, e.g. `state.slot` => `state.slot()`.
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
#2377
## Proposed Changes
Implement the same code used for block root lookups (from #2376) to state root lookups in order to improve performance and reduce associated memory spikes (e.g. from certain HTTP API requests).
## Additional Changes
- Tests using `rev_iter_state_roots` and `rev_iter_block_roots` have been refactored to use their `forwards` versions instead.
- The `rev_iter_state_roots` and `rev_iter_block_roots` functions are now unused and have been removed.
- The `state_at_slot` function has been changed to use the `forwards` iterator.
## Additional Info
- Some tests still need to be refactored to use their `forwards_iter` versions. These tests start their iteration from a specific beacon state and thus use the `rev_iter_state_roots_from` and `rev_iter_block_roots_from` functions. If they can be refactored, those functions can also be removed.
This updates some older dependencies to address a few cargo audit warnings.
The majority of warnings come from network dependencies which will be addressed in #2389.
This PR contains some minor dep updates that are not network related.
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
#2225
## Proposed Changes
Exposes the version given from the `lighthouse_version` crate to the Prometheus metrics server.
## Additional Info
- This metric appears in both the Beacon Node and Validator Client metrics servers.
- This is the simplest solution. It might be better to include the version and commit hash as separate labels rather than combined, however this would be more involved. Happy to do it that way if this is too cumbersome to use.
- The metric appears as:
```
# HELP lighthouse_info The build of Lighthouse running on the server
# TYPE lighthouse_info gauge
lighthouse_info{version="Lighthouse/v1.4.0-379664a+"} 1
```
## Issue Addressed
Closes#1661
## Proposed Changes
Add a dummy package called `target_check` which gets compiled early in the build and fails if the target is 32-bit
## Additional Info
You can test the efficacy of this check with:
```
cross build --release --manifest-path lighthouse/Cargo.toml --target i686-unknown-linux-gnu
```
In which case this compilation error is shown:
```
error: Lighthouse requires a 64-bit CPU and operating system
--> common/target_check/src/lib.rs:8:1
|
8 | / assert_cfg!(
9 | | target_pointer_width = "64",
10 | | "Lighthouse requires a 64-bit CPU and operating system",
11 | | );
| |__^
```
## Issue Addressed
None.
## Proposed Changes
Run apt-get upgrade to install latest security updates.
## Additional Info
Images often take a long time to get the latest security updates, while running apt-get upgrade will pull the latest updates.
Co-authored-by: Age Manning <Age@AgeManning.com>
## Issue Addressed
Closes#2354
## Proposed Changes
Add a `minify` method to `slashing_protection::Interchange` that keeps only the maximum-epoch attestation and maximum-slot block for each validator. Specifically, `minify` constructs "synthetic" attestations (with no `signing_root`) containing the maximum source epoch _and_ the maximum target epoch from the input. This is equivalent to the `minify_synth` algorithm that I've formally verified in this repository:
https://github.com/michaelsproul/slashing-proofs
## Additional Info
Includes the JSON loading optimisation from #2347
## Issue Addressed
`make lint` failing on rust 1.53.0.
## Proposed Changes
1.53.0 updates
## Additional Info
I haven't figure out why yet, we were now hitting the recursion limit in a few crates. So I had to add `#![recursion_limit = "256"]` in a few places
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Proposed Changes
A user on Discord (`@ChewsMacRibs`) reported that the validator monitor was logging `WARN Attested to an incorrect head` for their validator while it was awaiting activation.
This PR modifies the monitor so that it ignores inactive validators, by the logic that they are either awaiting activation, or have already exited. Either way, there's no way for an inactive validator to have their attestations included on chain, so no need for the monitor to report on them.
## Additional Info
To reproduce the bug requires registering validator keys manually with `--validator-monitor-pubkeys`. I don't think the bug will present itself with `--validator-monitor-auto`.
## Issue Addressed
#2293
## Proposed Changes
- Modify the handler for the `eth_chainId` RPC (i.e., `get_chain_id`) to explicitly match against the Geth error string returned for pre-EIP-155 synced Geth nodes
- ~~Add a new helper function, `rpc_error_msg`, to aid in the above point~~
- Refactor `response_result` into `response_result_or_error` and patch reliant RPC handlers accordingly (thanks to @pawanjay176)
## Additional Info
Geth, as of Pangaea Expanse (v1.10.0), returns an explicit error when it is not synced past the EIP-155 block (2675000). Previously, Geth simply returned a chain ID of 0 (which was obviously much easier to handle on Lighthouse's part).
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Proposed Changes
Unescape text for json comparison in:
3a24ca5f14/remote_signer/tests/sign.rs (L282-L285)
Which causes this error:
```
---- sign::invalid_field_fork stdout ----
thread 'sign::invalid_field_fork' panicked at 'assertion failed: `(left == right)`
left: `"Unable to parse body message from JSON: Error(\"invalid hex (InvalidHexCharacter { c: 'I', index: 0 })\", line: 1, column: 237097)"`,
right: `"Unable to parse body message from JSON: Error(\"invalid hex (InvalidHexCharacter { c: \\'I\\', index: 0 })\", line: 1, column: 237097)"`', testing/remote_signer_test/src/consumer.rs:144:5
```
This is my first contribution and happy to receive feedback if you have any. Thanks
## Issue Addressed
Resolves#2322
## Proposed Changes
Adds a `modify` command to `lighthouse account validator` with subcommands to enable and disable specific or all pubkeys.
## Issue Addressed
NA
## Proposed Changes
I've noticed some of the SigP Prater nodes struggling on v1.4.0-rc.0. I suspect this is due to the changes in #2296. Specifically, the trade-off which lowered the memory footprint whilst increasing runtime on some functions.
Presently, this PR is documenting my testing on Prater.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Only run `configure_memory_alllocator` for the BN process.
I noticed that VC memory usage increases significantly with the new malloc tuning parameters. This was also raised by a user on [r/ethstaker](https://www.reddit.com/r/ethstaker/comments/nr8998/lighthouse_prerelease_v140rc0/h0fnt9l?utm_source=share&utm_medium=web2x&context=3).
There wasn't any issue with memory usage by the VC before we implemented #2296, so I think we were a bit overzealous when we allowed these changes to affect it. This PR allows things that weren't broken to remain unfixed.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
I am starting to see a lot of slog-async overflows (i.e., too many logs) on Prater whenever we see attestations for an unknown block. Since these logs are identical (except for peer id) and we expose volume/count of these errors via `metrics::GOSSIP_ATTESTATION_ERRORS_PER_TYPE`, I took the following actions to remove them from `DEBUG` logs:
- Push the "Attestation for unknown block" log to trace.
- Add a debug log in `search_for_block`. In effect, this should serve as a de-duped version of the previous, downgraded log.
## Additional Info
TBC
## Issue Addressed
NA
## Proposed Changes
Bump versions.
## Additional Info
This is not exactly the v1.4.0 release described in [Lighthouse Update #36](https://lighthouse.sigmaprime.io/update-36.html).
Whilst it contains:
- Beta Windows support
- A reduction in Eth1 queries
- A reduction in memory footprint
It does not contain:
- Altair
- Doppelganger Protection
- The remote signer
We have decided to release some features early. This is primarily due to the desire to allow users to benefit from the memory saving improvements as soon as possible.
## TODO
- [x] Wait for #2340, #2356 and #2376 to merge and then rebase on `unstable`.
- [x] Ensure discovery issues are fixed (see #2388)
- [x] Ensure https://github.com/sigp/lighthouse/pull/2382 is merged/removed.
- [x] Ensure https://github.com/sigp/lighthouse/pull/2383 is merged/removed.
- [x] Ensure https://github.com/sigp/lighthouse/pull/2384 is merged/removed.
- [ ] Double-check eth1 cache is carried between boots
## Issue Addressed
NA
## Proposed Changes
Reverts #2345 in the interests of getting v1.4.0 out this week. Once we have released that, we can go back to testing this again.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
When observing `jemallocator` heap profiles and Grafana, it became clear that Lighthouse is spending significant RAM/CPU on processing blocks from the RPC. On investigation, it seems that we are loading the parent of the block *before* we check to see if the block is already known. This is a big waste of resources.
This PR adds an additional `check_block_relevancy` call as the first thing we do when we try to process a `SignedBeaconBlock` via the RPC (or other similar methods). Ultimately, `check_block_relevancy` will be called again later in the block processing flow. It's a very light function and I don't think trying to optimize it out is worth the risk of a bad block slipping through.
Also adds a `New RPC block received` info log when we process a new RPC block. This seems like interesting and infrequent info.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Return a very specific error when at attestation reads shuffling from a frozen `BeaconState`. Previously, this was returning `MissingBeaconState` which indicates a much more serious issue.
## Additional Info
Since `get_inconsistent_state_for_attestation_verification_only` is only called once in `BeaconChain::with_committee_cache`, it is quite easy to reason about the impact of this change.
## Issue Addressed
NA
## Proposed Changes
Whilst investigating #2372, I [learned](https://github.com/sigp/lighthouse/issues/2372#issuecomment-851725049) that the error message returned from some failed Eth1 requests are always `NotReachable`. This makes debugging quite painful.
This PR adds more detail to these errors. For example:
- Bad infura key: `ERRO Failed to update eth1 cache error: Failed to update Eth1 service: "All fallback errored: https://mainnet.infura.io/ => EndpointError(RequestFailed(\"Response HTTP status was not 200 OK: 401 Unauthorized.\"))", retry_millis: 60000, service: eth1_rpc`
- Unreachable server: `ERRO Failed to update eth1 cache error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8545/ => EndpointError(RequestFailed(\"Request failed: reqwest::Error { kind: Request, url: Url { scheme: \\\"http\\\", cannot_be_a_base: false, username: \\\"\\\", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(8545), path: \\\"/\\\", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError(\\\"tcp connect error\\\", Os { code: 111, kind: ConnectionRefused, message: \\\"Connection refused\\\" })) }\"))", retry_millis: 60000, service: eth1_rpc`
- Bad server: `ERRO Failed to update eth1 cache error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8545/ => EndpointError(RequestFailed(\"Response HTTP status was not 200 OK: 501 Not Implemented.\"))", retry_millis: 60000, service: eth1_rpc`
## Additional Info
NA
## Issue Addressed
NA
## Primary Change
When investigating memory usage, I noticed that retrieving a block from an early slot (e.g., slot 900) would cause a sharp increase in the memory footprint (from 400mb to 800mb+) which seemed to be ever-lasting.
After some investigation, I found that the reverse iteration from the head back to that slot was the likely culprit. To counter this, I've switched the `BeaconChain::block_root_at_slot` to use the forwards iterator, instead of the reverse one.
I also noticed that the networking stack is using `BeaconChain::root_at_slot` to check if a peer is relevant (`check_peer_relevance`). Perhaps the steep, seemingly-random-but-consistent increases in memory usage are caused by the use of this function.
Using the forwards iterator with the HTTP API alleviated the sharp increases in memory usage. It also made the response much faster (before it felt like to took 1-2s, now it feels instant).
## Additional Changes
In the process I also noticed that we have two functions for getting block roots:
- `BeaconChain::block_root_at_slot`: returns `None` for a skip slot.
- `BeaconChain::root_at_slot`: returns the previous root for a skip slot.
I unified these two functions into `block_root_at_slot` and added the `WhenSlotSkipped` enum. Now, the caller must be explicit about the skip-slot behaviour when requesting a root.
Additionally, I replaced `vec![]` with `Vec::with_capacity` in `store::chunked_vector::range_query`. I stumbled across this whilst debugging and made this modification to see what effect it would have (not much). It seems like a decent change to keep around, but I'm not concerned either way.
Also, `BeaconChain::get_ancestor_block_root` is unused, so I got rid of it 🗑️.
## Additional Info
I haven't also done the same for state roots here. Whilst it's possible and a good idea, it's more work since the fwds iterators are presently block-roots-specific.
Whilst there's a few places a reverse iteration of state roots could be triggered (e.g., attestation production, HTTP API), they're no where near as common as the `check_peer_relevance` call. As such, I think we should get this PR merged first, then come back for the state root iters. I made an issue here https://github.com/sigp/lighthouse/issues/2377.
## Issue Addressed
#2325
## Proposed Changes
This pull request changes the behavior of the Peer Manager by including a minimum outbound-only peers requirement. The peer manager will continue querying for peers if this outbound-only target number hasn't been met. Additionally, when peers are being removed, an outbound-only peer will not be disconnected if doing so brings us below the minimum.
## Additional Info
Unit test for heartbeat function tests that disconnection behavior is correct. Continual querying for peers if outbound-only hasn't been met is not directly tested, but indirectly through unit testing of the helper function that counts the number of outbound-only peers.
EDIT: Am concerned about the behavior of ```update_peer_scores```. If we have connected to a peer with a score below the disconnection threshold (-20), then its connection status will remain connected, while its score state will change to disconnected.
```rust
let previous_state = info.score_state();
// Update scores
info.score_update();
Self::handle_score_transitions(
previous_state,
peer_id,
info,
&mut to_ban_peers,
&mut to_unban_peers,
&mut self.events,
&self.log,
);
```
```previous_state``` will be set to Disconnected, and then because ```handle_score_transitions``` only changes connection status for a peer if the state changed, the peer remains connected. Then in the heartbeat code, because we only disconnect healthy peers if we have too many peers, these peers don't get disconnected. I'm not sure realistically how often this scenario would occur, but it might be better to adjust the logic to account for scenarios where the score state implies a connection status different from the current connection status.
Co-authored-by: Kevin Lu <kevlu93@gmail.com>