3052db29fe
## Issue Addressed
Closes https://github.com/sigp/lighthouse/issues/4291, part of #3613.
## Proposed Changes
- Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if:
- The EL's internal status is `Offline` or `AuthFailed`, _or_
- The most recent call to `newPayload` resulted in an error (more on this in a moment).
- Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes.
- Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated.
- Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.
## Why track `newPayload` errors?
Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:
- If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](693886b941/beacon_node/execution_layer/src/engines.rs (L372-L380)
), which would have the effect of masking the failure and keeping the status _online_.
- The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms).
- If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them.
## Additional Changes
- Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too.
- Delete some stale comments from when we used to support multiple execution nodes.
83 lines
2.8 KiB
Rust
83 lines
2.8 KiB
Rust
use crate::beacon_node_fallback::CandidateError;
|
|
use eth2::BeaconNodeHttpClient;
|
|
use slog::{debug, error, warn, Logger};
|
|
use slot_clock::SlotClock;
|
|
|
|
/// A distance in slots.
|
|
const SYNC_TOLERANCE: u64 = 4;
|
|
|
|
/// Returns
|
|
///
|
|
/// `Ok(())` if the beacon node is synced and ready for action,
|
|
/// `Err(CandidateError::Offline)` if the beacon node is unreachable,
|
|
/// `Err(CandidateError::NotSynced)` if the beacon node indicates that it is syncing **AND**
|
|
/// it is more than `SYNC_TOLERANCE` behind the highest
|
|
/// known slot.
|
|
///
|
|
/// The second condition means the even if the beacon node thinks that it's syncing, we'll still
|
|
/// try to use it if it's close enough to the head.
|
|
pub async fn check_synced<T: SlotClock>(
|
|
beacon_node: &BeaconNodeHttpClient,
|
|
slot_clock: &T,
|
|
log_opt: Option<&Logger>,
|
|
) -> Result<(), CandidateError> {
|
|
let resp = match beacon_node.get_node_syncing().await {
|
|
Ok(resp) => resp,
|
|
Err(e) => {
|
|
if let Some(log) = log_opt {
|
|
warn!(
|
|
log,
|
|
"Unable connect to beacon node";
|
|
"error" => %e
|
|
)
|
|
}
|
|
|
|
return Err(CandidateError::Offline);
|
|
}
|
|
};
|
|
|
|
// Default EL status to "online" for backwards-compatibility with BNs that don't include it.
|
|
let el_offline = resp.data.el_offline.unwrap_or(false);
|
|
let bn_is_synced = !resp.data.is_syncing || (resp.data.sync_distance.as_u64() < SYNC_TOLERANCE);
|
|
let is_synced = bn_is_synced && !el_offline;
|
|
|
|
if let Some(log) = log_opt {
|
|
if !is_synced {
|
|
debug!(
|
|
log,
|
|
"Beacon node sync status";
|
|
"status" => format!("{:?}", resp),
|
|
);
|
|
|
|
warn!(
|
|
log,
|
|
"Beacon node is not synced";
|
|
"sync_distance" => resp.data.sync_distance.as_u64(),
|
|
"head_slot" => resp.data.head_slot.as_u64(),
|
|
"endpoint" => %beacon_node,
|
|
"el_offline" => el_offline,
|
|
);
|
|
}
|
|
|
|
if let Some(local_slot) = slot_clock.now() {
|
|
let remote_slot = resp.data.head_slot + resp.data.sync_distance;
|
|
if remote_slot + 1 < local_slot || local_slot + 1 < remote_slot {
|
|
error!(
|
|
log,
|
|
"Time discrepancy with beacon node";
|
|
"msg" => "check the system time on this host and the beacon node",
|
|
"beacon_node_slot" => remote_slot,
|
|
"local_slot" => local_slot,
|
|
"endpoint" => %beacon_node,
|
|
);
|
|
}
|
|
}
|
|
}
|
|
|
|
if is_synced {
|
|
Ok(())
|
|
} else {
|
|
Err(CandidateError::NotSynced)
|
|
}
|
|
}
|