44fa54004c
## Issue Addressed NA ## Proposed Changes Missed head votes on attestations is a well-known issue. The primary cause is a block getting set as the head *after* the attestation deadline. This PR aims to shorten the overall time between "block received" and "block set as head" by: 1. Persisting the head and fork choice *after* setting the canonical head - Informal measurements show this takes ~200ms 1. Pruning the op pool *after* setting the canonical head. 1. No longer persisting the op pool to disk during `BeaconChain::fork_choice` - Informal measurements show this can take up to 1.2s. I also add some metrics to help measure the effect of these changes. Persistence changes like this run the risk of breaking assumptions downstream. However, I have considered these risks and I think we're fine here. I will describe my reasoning for each change. ## Reasoning ### Change 1: Persisting the head and fork choice *after* setting the canonical head For (1), although the function is called `persist_head_and_fork_choice`, it only persists: - Fork choice - Head tracker - Genesis block root Since `BeaconChain::fork_choice_internal` does not modify these values between the original time we were persisting it and the current time, I assert that the change I've made is non-substantial in terms of what ends up on-disk. There's the possibility that some *other* thread has modified fork choice in the extra time we've given it, but that's totally fine. Since the only time we *read* those values from disk is during startup, I assert that this has no impact during runtime. ### Change 2: Pruning the op pool after setting the canonical head Similar to the argument above, we don't modify the op pool during `BeaconChain::fork_choice_internal` so it shouldn't matter when we prune. This change should be non-substantial. ### Change 3: No longer persisting the op pool to disk during `BeaconChain::fork_choice` This change *is* substantial. With the proposed changes, we'll only be persisting the op pool to disk when we shut down cleanly (i.e., the `BeaconChain` gets dropped). This means we'll save disk IO and time during usual operation, but a `kill -9` or similar "crash" will probably result in an out-of-date op pool when we reboot. An out-of-date op pool can only have an impact when producing blocks or aggregate attestations/sync committees. I think it's pretty reasonable that a crash might result in an out-of-date op pool, since: - Crashes are fairly rare. Practically the only time I see LH suffer a full crash is when the OOM killer shows up, and that's a very serious event. - It's generally quite rare to produce a block/aggregate immediately after a reboot. Just a few slots of runtime is probably enough to have a decent-enough op pool again. ## Additional Info Credits to @macladson for the timings referenced here. |
||
---|---|---|
.. | ||
beacon_chain | ||
client | ||
eth1 | ||
eth2_libp2p | ||
genesis | ||
http_api | ||
http_metrics | ||
network | ||
operation_pool | ||
src | ||
store | ||
tests | ||
timer | ||
websocket_server | ||
Cargo.toml |