lighthouse

Author	SHA1	Message	Date
Michael Sproul	c27f2bf9c6	Avoid excessive logging of BN online status (#4315 ) ## Issue Addressed https://github.com/sigp/lighthouse/pull/4309#issuecomment-1556052261 ## Proposed Changes Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after #4295, whereby the `Connected to beacon node` message would be logged every slot. The new reduced logging is _slightly different_ from what we had prior to my changes in #4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.	2023-05-22 02:36:43 +00:00
Michael Sproul	3052db29fe	Implement `el_offline` and use it in the VC (#4295 ) ## Issue Addressed Closes https://github.com/sigp/lighthouse/issues/4291, part of #3613. ## Proposed Changes - Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if: - The EL's internal status is `Offline` or `AuthFailed`, _or_ - The most recent call to `newPayload` resulted in an error (more on this in a moment). - Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes. - Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated. - Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check. ## Why track `newPayload` errors? Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because: - If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](`693886b941/beacon_node/execution_layer/src/engines.rs (L372-L380)`), which would have the effect of masking the failure and keeping the status _online_. - The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms). - If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them. ## Additional Changes - Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too. - Delete some stale comments from when we used to support multiple execution nodes.	2023-05-17 05:51:56 +00:00
Paul Hauner	f775404c10	Log a `WARN` in the VC for a mismatched Capella fork epoch (#4050 ) ## Issue Addressed NA ## Proposed Changes - Adds a `WARN` statement for Capella, just like the previous forks. - Adds a hint message to all those WARNs to suggest the user update the BN or VC. ## Additional Info NA	2023-03-06 04:08:48 +00:00
Paul Hauner	6e15533b54	Add latency measurement service to VC (#4024 ) ## Issue Addressed NA ## Proposed Changes Adds a service which periodically polls (11s into each mainnet slot) the `node/version` endpoint on each BN and roughly measures the round-trip latency. The latency is exposed as a `DEBG` log and a Prometheus metric. The `--latency-measurement-service` has been added to the VC, with the following options: - `--latency-measurement-service true`: enable the service (default). - `--latency-measurement-service`: (without a value) has the same effect. - `--latency-measurement-service false`: disable the service. ## Additional Info Whilst looking at our staking setup, I think the BN+VC latency is contributing to late blocks. Now that we have to wait for the builders to respond it's nice to try and do everything we can to reduce that latency. Having visibility is the first step.	2023-03-05 23:43:29 +00:00
Atanas Minkov	2b6348781a	Log a debug message when a request fails for a beacon node candidate (#4036 ) ## Issue Addressed #3985 ## Proposed Changes Log a debug message when a BN candidate returns an error. `Mar 01 16:40:24.011 DEBG Request to beacon node failed error: ServerMessage(ErrorMessage { code: 503, message: "SERVICE_UNAVAILABLE: beacon node is syncing: head slot is 8416, current slot is 5098402", stacktraces: [] }), node: http://localhost:5052/`	2023-03-02 05:26:14 +00:00
Pawan Dhananjay	6779912fe4	Publish subscriptions to all beacon nodes (#3529 ) ## Issue Addressed Resolves #3516 ## Proposed Changes Adds a beacon fallback function for running a beacon node http query on all available fallbacks instead of returning on a first successful result. Uses the new `run_on_all` method for attestation and sync committee subscriptions. ## Additional Info Please provide any additional information. For example, future considerations or information useful for reviewers.	2022-09-28 19:53:35 +00:00
realbigsean	2ce86a0830	Validator registration request failures do not cause us to mark BNs offline (#3488 ) ## Issue Addressed Relates to https://github.com/sigp/lighthouse/issues/3416 ## Proposed Changes - Add an `OfflineOnFailure` enum to the `first_success` method for querying beacon nodes so that a val registration request failure from the BN -> builder does not result in the BN being marked offline. This seems important because these failures could be coming directly from a connected relay and actually have no bearing on BN health. Other messages that are sent to a relay have a local fallback so shouldn't result in errors - Downgrade the following log to a `WARN` ``` ERRO Unable to publish validator registrations to the builder network, error: All endpoints failed https://BN_B => RequestFailed(ServerMessage(ErrorMessage { code: 500, message: "UNHANDLED_ERROR: BuilderMissing", stacktraces: [] })), https://XXXX/ => Unavailable(Offline), [omitted] ``` ## Additional Info I think this change at least improves the UX of having a VC connected to some builder and some non-builder beacon nodes. I think we need to balance potentially alerting users that there is a BN <> VC misconfiguration and also allowing this type of fallback to work. If we want to fully support this type of configuration we may want to consider adding a flag `--builder-beacon-nodes` and track whether a VC should be making builder queries on a per-beacon node basis. But I think the changes in this PR are independent of that type of extension. PS: Sorry for the big diff here, it's mostly formatting changes after I added a new arg to a bunch of methods calls. Co-authored-by: realbigsean <sean@sigmaprime.io>	2022-08-29 11:35:59 +00:00
Michael Sproul	4e05f19fb5	Serve Bellatrix preset in BN API (#3425 ) ## Issue Addressed Resolves #3388 Resolves #2638 ## Proposed Changes - Return the `BellatrixPreset` on `/eth/v1/config/spec` by default. - Allow users to opt out of this by providing `--http-spec-fork=altair` (unless there's a Bellatrix fork epoch set). - Add the Altair constants from #2638 and make serving the constants non-optional (the `http-disable-legacy-spec` flag is deprecated). - Modify the VC to only read the `Config` and not to log extra fields. This prevents it from having to muck around parsing the `ConfigAndPreset` fields it doesn't need. ## Additional Info This change is backwards-compatible for the VC and the BN, but is marked as a breaking change for the removal of `--http-disable-legacy-spec`. I tried making `Config` a `superstruct` too, but getting the automatic decoding to work was a huge pain and was going to require a lot of hacks, so I gave up in favour of keeping the default-based approach we have now.	2022-08-10 07:52:59 +00:00
Paul Hauner	5a0b049049	Avoid hogging the fallback `status` lock in the VC (#3022 ) ## Issue Addressed Addresses https://github.com/sigp/lighthouse/issues/2926 ## Proposed Changes Appropriated from https://github.com/sigp/lighthouse/issues/2926#issuecomment-1039676768: When a node returns any error we call [`CandidateBeaconNode::set_offline`](`c3a793fd73/validator_client/src/beacon_node_fallback.rs (L424)`) which sets it's `status` to `CandidateError::Offline`. That node will then be ignored until the routine [`fallback_updater_service`](`c3a793fd73/validator_client/src/beacon_node_fallback.rs (L44)`) manages to reconnect to it. However, I believe there was an issue in the [`CanidateBeaconNode::refesh_status`](`c3a793fd73/validator_client/src/beacon_node_fallback.rs (L157-L178)`) method, which is used by the updater service to see if the node has come good again. It was holding a [write lock on the `status` field](`c3a793fd73/validator_client/src/beacon_node_fallback.rs (L165)`) whilst it polled the node status. This means a long timeout would hog the write lock and starve other processes. When a VC is trying to access a beacon node for whatever purpose (getting duties, posting blocks, etc), it performs [three passes](`c3a793fd73/validator_client/src/beacon_node_fallback.rs (L432-L482)`) through the lists of nodes, trying to run some generic `function` (closure, lambda, etc) on each node: - 1st pass: only try running `function` on all nodes which are both synced and online. - 2nd pass: try running `function` on all nodes that are online, but not necessarily synced. - 3rd pass: for each offline node, try refreshing its status and then running `function` on it. So, it turns out that if the `CanidateBeaconNode::refesh_status` function from the routine update service is hogging the write-lock, the 1st pass gets blocked whilst trying to read the status of the first node. So, nodes that should be left until the 3rd pass are blocking the process of the 1st and 2nd passes, hence the behaviour described in #2926. ## Additional Info NA	2022-02-22 03:09:00 +00:00
Michael Sproul	69288f6164	VC: don't warn if BN config doesn't match exactly (#2952 ) ## Proposed Changes Remove the check for exact equality on the beacon node spec when polling `/config/spec` from the VC. This check was always overzealous, and mostly served to check that the BN was configured for upcoming forks. I've replaced it by explicit checks of the `altair_fork_epoch` and `bellatrix_fork_epoch` instead. ## Additional Info We should come back to this and clean it up so that we can retain compatibility while removing the field `default`s we installed.	2022-01-24 22:33:04 +00:00
Michael Sproul	c0122e1a52	Refine VC->BN config check (#2636 ) ## Proposed Changes Instead of checking for strict equality between a BN's spec and the VC's local spec, just check the genesis fork version. This prevents us from failing eagerly for minor differences, while still protecting the VC from connecting to a completely incompatible BN. A warning is retained for the previous case where the specs are not exactly equal, which is to be expected if e.g. running against Infura before Infura configures the mainnet Altair fork epoch.	2021-09-27 04:22:07 +00:00
realbigsean	50321c6671	Updates to make crates publishable (#2472 ) ## Issue Addressed Related to: #2259 Made an attempt at all the necessary updates here to publish the crates to crates.io. I incremented the minor versions on all the crates that have been previously published. We still might run into some issues as we try to publish because I'm not able to test this out but I think it's a good starting point. ## Proposed Changes - Add description and license to `ssz_types` and `serde_util` - rename `serde_util` to `eth2_serde_util` - increment minor versions - remove path dependencies - remove patch dependencies ## Additional Info Crates published: - [x] `tree_hash` -- need to publish `tree_hash_derive` and `eth2_hashing` first - [x] `eth2_ssz_types` -- need to publish `eth2_serde_util` first - [x] `tree_hash_derive` - [x] `eth2_ssz` - [x] `eth2_ssz_derive` - [x] `eth2_serde_util` - [x] `eth2_hashing` Co-authored-by: realbigsean <seananderson33@gmail.com>	2021-09-03 01:10:25 +00:00
Michael Sproul	b4689e20c6	Altair consensus changes and refactors (#2279 ) ## Proposed Changes Implement the consensus changes necessary for the upcoming Altair hard fork. ## Additional Info This is quite a heavy refactor, with pivotal types like the `BeaconState` and `BeaconBlock` changing from structs to enums. This ripples through the whole codebase with field accesses changing to methods, e.g. `state.slot` => `state.slot()`. Co-authored-by: realbigsean <seananderson33@gmail.com>	2021-07-09 06:15:32 +00:00
Pawan Dhananjay	fdaeec631b	Monitoring service api (#2251 ) ## Issue Addressed N/A ## Proposed Changes Adds a client side api for collecting system and process metrics and pushing it to a monitoring service.	2021-05-26 05:58:41 +00:00
Michael Sproul	f9d60f5436	VC: accept unknown fields in chain spec (#2277 ) ## Issue Addressed Closes #2274 ## Proposed Changes * Modify the `YamlConfig` to collect unknown fields into an `extra_fields` map, instead of failing hard. * Log a debug message if there are extra fields returned to the VC from one of its BNs. This restores Lighthouse's compatibility with Teku beacon nodes (and therefore Infura)	2021-03-26 04:53:57 +00:00
Paul Hauner	c2eac8e5bd	Remove duplicate log in BN fallback (#2116 ) ## Issue Addressed NA ## Proposed Changes - Removes a duplicated log in the fallback code for the VC. - Updates the text in the remaining de-duped log. ## Additional Info Example ``` Dec 23 05:19:54.003 WARN Beacon node is syncing endpoint: http://xxxx:5052/, head_slot: 88224, sync_distance: 161774 Dec 23 05:19:54.003 WARN Beacon node is not synced endpoint: http://xxxxx:5052/ ```	2021-01-06 03:01:48 +00:00
Paul Hauner	a62dc65ca4	BN Fallback v2 (#2080 ) ## Issue Addressed - Resolves #1883 ## Proposed Changes This follows on from @blacktemplar's work in #2018. - Allows the VC to connect to multiple BN for redundancy. - Update the simulator so some nodes always need to rely on their fallback. - Adds some extra deprecation warnings for `--eth1-endpoint` - Pass `SignatureBytes` as a reference instead of by value. ## Additional Info NA Co-authored-by: blacktemplar <blacktemplar@a1.net>	2020-12-18 09:17:03 +00:00

17 Commits