lighthouse/beacon_node/execution_layer/src/engines.rs
Divma 8c69d57c2c Pause sync when EE is offline (#3428)
## Issue Addressed

#3032

## Proposed Changes

Pause sync when ee is offline. Changes include three main parts:
- Online/offline notification system
- Pause sync
- Resume sync

#### Online/offline notification system
- The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism.
- The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc.
- Sync waits for state changes concurrently with normal messages.

#### Pause Sync
Sync has four components, pausing is done differently in each:
- **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers.
- **Backfill**: Not affected by ee states, we don't pause.

#### Resume Sync
- **Block lookups**: Enabled again.
- **Parent lookups**: Enabled again.
- **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them.
- **Backfill**: Not affected by ee states, no need to resume.

## Additional Info

**QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed

Next gen of #3094

Will work best with #3439 

Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
2022-08-24 23:34:56 +00:00

379 lines
12 KiB
Rust

//! Provides generic behaviour for multiple execution engines, specifically fallback behaviour.
use crate::engine_api::{
Error as EngineApiError, ForkchoiceUpdatedResponse, PayloadAttributes, PayloadId,
};
use crate::HttpJsonRpc;
use lru::LruCache;
use slog::{debug, error, info, Logger};
use std::future::Future;
use std::sync::Arc;
use task_executor::TaskExecutor;
use tokio::sync::{watch, Mutex, RwLock};
use tokio_stream::wrappers::WatchStream;
use types::{Address, ExecutionBlockHash, Hash256};
/// The number of payload IDs that will be stored for each `Engine`.
///
/// Since the size of each value is small (~100 bytes) a large number is used for safety.
const PAYLOAD_ID_LRU_CACHE_SIZE: usize = 512;
/// Stores the remembered state of a engine.
#[derive(Copy, Clone, PartialEq, Debug, Eq, Default)]
enum EngineStateInternal {
Synced,
#[default]
Offline,
Syncing,
AuthFailed,
}
/// A subset of the engine state to inform other services if the engine is online or offline.
#[derive(Debug, Clone, PartialEq, Eq, Copy)]
pub enum EngineState {
Online,
Offline,
}
impl From<EngineStateInternal> for EngineState {
fn from(state: EngineStateInternal) -> Self {
match state {
EngineStateInternal::Synced | EngineStateInternal::Syncing => EngineState::Online,
EngineStateInternal::Offline | EngineStateInternal::AuthFailed => EngineState::Offline,
}
}
}
/// Wrapper structure that ensures changes to the engine state are correctly reported to watchers.
struct State {
/// The actual engine state.
state: EngineStateInternal,
/// Notifier to watch the engine state.
notifier: watch::Sender<EngineState>,
}
impl std::ops::Deref for State {
type Target = EngineStateInternal;
fn deref(&self) -> &Self::Target {
&self.state
}
}
impl Default for State {
fn default() -> Self {
let state = EngineStateInternal::default();
let (notifier, _receiver) = watch::channel(state.into());
State { state, notifier }
}
}
impl State {
// Updates the state and notifies all watchers if the state has changed.
pub fn update(&mut self, new_state: EngineStateInternal) {
self.state = new_state;
self.notifier.send_if_modified(|last_state| {
let changed = *last_state != new_state.into(); // notify conditionally
*last_state = new_state.into(); // update the state unconditionally
changed
});
}
/// Gives access to a channel containing whether the last state is online.
///
/// This can be called several times.
pub fn watch(&self) -> WatchStream<EngineState> {
self.notifier.subscribe().into()
}
}
#[derive(Copy, Clone, PartialEq, Debug)]
pub struct ForkChoiceState {
pub head_block_hash: ExecutionBlockHash,
pub safe_block_hash: ExecutionBlockHash,
pub finalized_block_hash: ExecutionBlockHash,
}
#[derive(Hash, PartialEq, std::cmp::Eq)]
struct PayloadIdCacheKey {
pub head_block_hash: ExecutionBlockHash,
pub timestamp: u64,
pub prev_randao: Hash256,
pub suggested_fee_recipient: Address,
}
#[derive(Debug)]
pub enum EngineError {
Offline,
Api { error: EngineApiError },
BuilderApi { error: EngineApiError },
Auth,
}
/// An execution engine.
pub struct Engine {
pub api: HttpJsonRpc,
payload_id_cache: Mutex<LruCache<PayloadIdCacheKey, PayloadId>>,
state: RwLock<State>,
latest_forkchoice_state: RwLock<Option<ForkChoiceState>>,
executor: TaskExecutor,
log: Logger,
}
impl Engine {
/// Creates a new, offline engine.
pub fn new(api: HttpJsonRpc, executor: TaskExecutor, log: &Logger) -> Self {
Self {
api,
payload_id_cache: Mutex::new(LruCache::new(PAYLOAD_ID_LRU_CACHE_SIZE)),
state: Default::default(),
latest_forkchoice_state: Default::default(),
executor,
log: log.clone(),
}
}
/// Gives access to a channel containing the last engine state.
///
/// This can be called several times.
pub async fn watch_state(&self) -> WatchStream<EngineState> {
self.state.read().await.watch()
}
pub async fn get_payload_id(
&self,
head_block_hash: ExecutionBlockHash,
timestamp: u64,
prev_randao: Hash256,
suggested_fee_recipient: Address,
) -> Option<PayloadId> {
self.payload_id_cache
.lock()
.await
.get(&PayloadIdCacheKey {
head_block_hash,
timestamp,
prev_randao,
suggested_fee_recipient,
})
.cloned()
}
pub async fn notify_forkchoice_updated(
&self,
forkchoice_state: ForkChoiceState,
payload_attributes: Option<PayloadAttributes>,
log: &Logger,
) -> Result<ForkchoiceUpdatedResponse, EngineApiError> {
let response = self
.api
.forkchoice_updated_v1(forkchoice_state, payload_attributes)
.await?;
if let Some(payload_id) = response.payload_id {
if let Some(key) =
payload_attributes.map(|pa| PayloadIdCacheKey::new(&forkchoice_state, &pa))
{
self.payload_id_cache.lock().await.put(key, payload_id);
} else {
debug!(
log,
"Engine returned unexpected payload_id";
"payload_id" => ?payload_id
);
}
}
Ok(response)
}
async fn get_latest_forkchoice_state(&self) -> Option<ForkChoiceState> {
*self.latest_forkchoice_state.read().await
}
pub async fn set_latest_forkchoice_state(&self, state: ForkChoiceState) {
*self.latest_forkchoice_state.write().await = Some(state);
}
async fn send_latest_forkchoice_state(&self) {
let latest_forkchoice_state = self.get_latest_forkchoice_state().await;
if let Some(forkchoice_state) = latest_forkchoice_state {
if forkchoice_state.head_block_hash == ExecutionBlockHash::zero() {
debug!(
self.log,
"No need to call forkchoiceUpdated";
"msg" => "head does not have execution enabled",
);
return;
}
info!(
self.log,
"Issuing forkchoiceUpdated";
"forkchoice_state" => ?forkchoice_state,
);
// For simplicity, payload attributes are never included in this call. It may be
// reasonable to include them in the future.
if let Err(e) = self.api.forkchoice_updated_v1(forkchoice_state, None).await {
debug!(
self.log,
"Failed to issue latest head to engine";
"error" => ?e,
);
}
} else {
debug!(
self.log,
"No head, not sending to engine";
);
}
}
/// Returns `true` if the engine has a "synced" status.
pub async fn is_synced(&self) -> bool {
**self.state.read().await == EngineStateInternal::Synced
}
/// Run the `EngineApi::upcheck` function if the node's last known state is not synced. This
/// might be used to recover the node if offline.
pub async fn upcheck(&self) {
let state: EngineStateInternal = match self.api.upcheck().await {
Ok(()) => {
let mut state = self.state.write().await;
if **state != EngineStateInternal::Synced {
info!(
self.log,
"Execution engine online";
);
// Send the node our latest forkchoice_state.
self.send_latest_forkchoice_state().await;
} else {
debug!(
self.log,
"Execution engine online";
);
}
state.update(EngineStateInternal::Synced);
**state
}
Err(EngineApiError::IsSyncing) => {
let mut state = self.state.write().await;
state.update(EngineStateInternal::Syncing);
**state
}
Err(EngineApiError::Auth(err)) => {
error!(
self.log,
"Failed jwt authorization";
"error" => ?err,
);
let mut state = self.state.write().await;
state.update(EngineStateInternal::AuthFailed);
**state
}
Err(e) => {
error!(
self.log,
"Error during execution engine upcheck";
"error" => ?e,
);
let mut state = self.state.write().await;
state.update(EngineStateInternal::Offline);
**state
}
};
debug!(
self.log,
"Execution engine upcheck complete";
"state" => ?state,
);
}
/// Run `func` on the node regardless of the node's current state.
///
/// ## Note
///
/// This function takes locks on `self.state`, holding a conflicting lock might cause a
/// deadlock.
pub async fn request<'a, F, G, H>(self: &'a Arc<Self>, func: F) -> Result<H, EngineError>
where
F: Fn(&'a Engine) -> G,
G: Future<Output = Result<H, EngineApiError>>,
{
match func(self).await {
Ok(result) => {
// Take a clone *without* holding the read-lock since the `upcheck` function will
// take a write-lock.
let state: EngineStateInternal = **self.state.read().await;
// Keep an up to date engine state.
if state != EngineStateInternal::Synced {
// Spawn the upcheck in another task to avoid slowing down this request.
let inner_self = self.clone();
self.executor.spawn(
async move { inner_self.upcheck().await },
"upcheck_after_success",
);
}
Ok(result)
}
Err(error) => {
error!(
self.log,
"Execution engine call failed";
"error" => ?error,
);
// The node just returned an error, run an upcheck so we can update the endpoint
// state.
//
// Spawn the upcheck in another task to avoid slowing down this request.
let inner_self = self.clone();
self.executor.spawn(
async move { inner_self.upcheck().await },
"upcheck_after_error",
);
Err(EngineError::Api { error })
}
}
}
}
impl PayloadIdCacheKey {
fn new(state: &ForkChoiceState, attributes: &PayloadAttributes) -> Self {
Self {
head_block_hash: state.head_block_hash,
timestamp: attributes.timestamp,
prev_randao: attributes.prev_randao,
suggested_fee_recipient: attributes.suggested_fee_recipient,
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use tokio_stream::StreamExt;
#[tokio::test]
async fn test_state_notifier() {
let mut state = State::default();
let initial_state: EngineState = state.state.into();
assert_eq!(initial_state, EngineState::Offline);
state.update(EngineStateInternal::Synced);
// a watcher that arrives after the first update.
let mut watcher = state.watch();
let new_state = watcher.next().await.expect("Last state is always present");
assert_eq!(new_state, EngineState::Online);
}
}