lighthouse

Author	SHA1	Message	Date
realbigsean	33dd13c798	Refactor deneb block processing (#4511 ) * Revert "fix merge" This reverts commit `405e95b0ce`. * refactor deneb block processing * cargo fmt * fix ci	2023-07-25 10:51:10 -04:00
realbigsean	0f514cbb36	fixes after merge	2023-07-17 09:50:32 -04:00
realbigsean	405e95b0ce	fix merge	2023-07-14 16:15:28 -04:00
realbigsean	42f54ee561	fix merge conflict issues	2023-07-14 16:01:57 -04:00
realbigsean	a6f48f5ecb	Merge branch 'unstable' of https://github.com/sigp/lighthouse into merge-unstable-deneb-june-6th	2023-07-12 13:05:30 -04:00
Paul Hauner	c25825a539	Move the `BeaconProcessor` into a new crate (#4435 ) Replaces #4434. It is identical, but this PR has a smaller diff due to a curated commit history. ## Issue Addressed NA ## Proposed Changes This PR moves the scheduling logic for the `BeaconProcessor` into a new crate in `beacon_node/beacon_processor`. Previously it existed in the `beacon_node/network` crate. This addresses a circular-dependency problem where it's not possible to use the `BeaconProcessor` from the `beacon_chain` crate. The `network` crate depends on the `beacon_chain` crate (`network -> beacon_chain`), but importing the `BeaconProcessor` into the `beacon_chain` crate would create a circular dependancy of `beacon_chain -> network`. The `BeaconProcessor` was designed to provide queuing and prioritized scheduling for messages from the network. It has proven to be quite valuable and I believe we'd make Lighthouse more stable and effective by using it elsewhere. In particular, I think we should use the `BeaconProcessor` for: 1. HTTP API requests. 1. Scheduled tasks in the `BeaconChain` (e.g., state advance). Using the `BeaconProcessor` for these tasks would help prevent the BN from becoming overwhelmed and would also help it to prioritize operations (e.g., choosing to process blocks from gossip before responding to low-priority HTTP API requests). ## Additional Info This PR is intended to have zero impact on runtime behaviour. It aims to simply separate the scheduling code (i.e., the `BeaconProcessor`) from the business logic in the `network` crate (i.e., the `Worker` impls). Future PRs (see #4462) can build upon these works to actually use the `BeaconProcessor` for more operations. I've gone to some effort to use `git mv` to make the diff look more like "file was moved and modified" rather than "file was deleted and a new one added". This should reduce review burden and help maintain commit attribution.	2023-07-10 07:45:54 +00:00
realbigsean	cfe2452533	Merge branch 'remove-into-gossip-verified-block' of https://github.com/realbigsean/lighthouse into merge-unstable-deneb-june-6th	2023-07-06 16:51:35 -04:00
Eitan Seri-Levi	edd093293a	added debounce to log (#4269 ) ## Issue Addressed [#4259](https://github.com/sigp/lighthouse/issues/4259) ## Proposed Changes debounce spammy `Unable to send message to the beacon processor` log messages ## Additional Info We could potentially debounce other logs that have the potential to be "spammy". After some feedback we decided to additionally add the following change: create a newtype wrapper around `mpsc::Sender<BeaconWorkEvent<T>>`. When there is an error on the try_send method on the wrapper, we increase a counter metric with one label per work type.	2023-06-30 01:13:03 +00:00
Jimmy Chen	d1146ec8b5	Sync finalized sync to 2 epochs + 1 slot past our peer's finalized slot in order to finalize the chain locally	2023-06-28 16:15:37 +10:00
realbigsean	a62e52f319	Single blob lookups (#4152 ) * some blob reprocessing work * remove ForceBlockLookup * reorder enum match arms in sync manager * a lot more reprocessing work * impl logic for triggerng blob lookups along with block lookups * deal with rpc blobs in groups per block in the da checker. don't cache missing blob ids in the da checker. * make single block lookup generic * more work * add delayed processing logic and combine some requests * start fixing some compile errors * fix compilation in main block lookup mod * much work * get things compiling * parent blob lookups * fix compile * revert red/stevie changes * fix up sync manager delay message logic * add peer usefulness enum * should remove lookup refactor * consolidate retry error handling * improve peer scoring during certain failures in parent lookups * improve retry code * drop parent lookup if either req has a peer disconnect during download * refactor single block processed method * processing peer refactor * smol bugfix * fix some todos * fix lints * fix lints * fix compile in lookup tests * fix lints * fix lints * fix existing block lookup tests * renamings * fix after merge * cargo fmt * compilation fix in beacon chain tests * fix * refactor lookup tests to work with multiple forks and response types * make tests into macros * wrap availability check error * fix compile after merge * add random blobs * start fixing up lookup verify error handling * some bug fixes and the start of deneb only tests * make tests work for all forks * track information about peer source * error refactoring * improve peer scoring * fix test compilation * make sure blobs are sent for processing after stream termination, delete copied tests * add some tests and fix a bug * smol bugfixes and moar tests * add tests and fix some things * compile after merge * lots of refactoring * retry on invalid block/blob * merge unknown parent messages before current slot lookup * get tests compiling * penalize blob peer on invalid blobs * Check disk on in-memory cache miss * Update beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs * Update beacon_node/network/src/sync/network_context.rs Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com> * fix bug in matching blocks and blobs in range sync * pr feedback * fix conflicts * upgrade logs from warn to crit when we receive incorrect response in range * synced_and_connected_within_tolerance -> should_search_for_block * remove todo * Fix Broken Overflow Tests * fix merge conflicts * checkpoint sync without alignment * add import * query for checkpoint state by slot rather than state root (teku doesn't serve by state root) * get state first and query by most recent block root * simplify delay logic * rename unknown parent sync message variants * rename parameter, block_slot -> slot * add some docs to the lookup module * use interval instead of sleep * drop request if blocks and blobs requests both return `None` for `Id` * clean up `find_single_lookup` logic * add lookup source enum * clean up `find_single_lookup` logic * add docs to find_single_lookup_request * move LookupSource our of param where unnecessary * remove unnecessary todo * query for block by `state.latest_block_header.slot` * fix lint * fix test * fix test * fix observed blob sidecars test * PR updates * use optional params instead of a closure * create lookup and trigger request in separate method calls * remove `LookupSource` * make sure duplicate lookups are not dropped --------- Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com> Co-authored-by: Mark Mackey <mark@sigmaprime.io> Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com>	2023-06-15 12:59:10 -04:00
Jimmy Chen	70c4ae35ab	Merge branch 'unstable' into deneb-free-blobs # Conflicts: # .github/workflows/docker.yml # .github/workflows/local-testnet.yml # .github/workflows/test-suite.yml # Cargo.lock # Cargo.toml # beacon_node/beacon_chain/src/beacon_chain.rs # beacon_node/beacon_chain/src/builder.rs # beacon_node/beacon_chain/src/test_utils.rs # beacon_node/execution_layer/src/engine_api/json_structures.rs # beacon_node/network/src/beacon_processor/mod.rs # beacon_node/network/src/beacon_processor/worker/gossip_methods.rs # beacon_node/network/src/sync/backfill_sync/mod.rs # beacon_node/store/src/config.rs # beacon_node/store/src/hot_cold_store.rs # common/eth2_network_config/Cargo.toml # consensus/ssz/src/decode/impls.rs # consensus/ssz_derive/src/lib.rs # consensus/ssz_derive/tests/tests.rs # consensus/ssz_types/src/serde_utils/mod.rs # consensus/tree_hash/src/impls.rs # consensus/tree_hash/src/lib.rs # consensus/types/Cargo.toml # consensus/types/src/beacon_state.rs # consensus/types/src/chain_spec.rs # consensus/types/src/eth_spec.rs # consensus/types/src/fork_name.rs # lcli/Cargo.toml # lcli/src/main.rs # lcli/src/new_testnet.rs # scripts/local_testnet/el_bootnode.sh # scripts/local_testnet/genesis.json # scripts/local_testnet/geth.sh # scripts/local_testnet/setup.sh # scripts/local_testnet/start_local_testnet.sh # scripts/local_testnet/vars.env # scripts/tests/doppelganger_protection.sh # scripts/tests/genesis.json # scripts/tests/vars.env # testing/ef_tests/Cargo.toml # validator_client/src/block_service.rs	2023-05-30 22:44:05 +10:00
Age Manning	616bee6757	Maintain trusted peers (#4159 ) ## Issue Addressed #4150 ## Proposed Changes Maintain trusted peers in the pruning logic. ~~In principle the changes here are not necessary as a trusted peer has a max score (100) and all other peers can have at most 0 (because we don't implement positive scores). This means that we should never prune trusted peers unless we have more trusted peers than the target peer count.~~ This change shifts this logic to explicitly never prune trusted peers which I expect is the intuitive behaviour. ~~I suspect the issue in #4150 arises when a trusted peer disconnects from us for one reason or another and then we remove that peer from our peerdb as it becomes stale. When it re-connects at some large time later, it is no longer a trusted peer.~~ Currently we do disconnect trusted peers, and this PR corrects this to maintain trusted peers in the pruning logic. As suggested in #4150 we maintain trusted peers in the db and thus we remember them even if they disconnect from us.	2023-05-03 04:12:10 +00:00
Emilia Hane	2672cf40bb	Better fix for debug tests	2023-02-15 11:47:56 +01:00
Emilia Hane	9e4abc79fb	Comment out tests that use system time	2023-02-14 14:12:50 +01:00
Emilia Hane	73c7ad73b8	Disable use of system time in tests	2023-02-14 13:33:38 +01:00
Emilia Hane	6beca6defc	Fix range sync tests	2023-02-10 09:41:24 +01:00
Emilia Hane	8365d76277	fixup! Debug tests	2023-02-10 09:39:22 +01:00
Emilia Hane	16cb9cfca2	fixup! Debug tests	2023-02-10 09:39:22 +01:00
Emilia Hane	7220f35ff6	Debug tests	2023-02-10 09:39:21 +01:00
Emilia Hane	3676ce78b5	Fix rebase conflicts	2023-02-10 09:39:21 +01:00
realbigsean	cbd09dc281	finish refactor	2023-01-21 04:48:25 -05:00
realbigsean	8a70d80a2f	Revert "Revert "renames, remove , wrap BlockWrapper enum to make descontruction private"" This reverts commit `1931a442dc`.	2022-12-28 10:31:18 -05:00
realbigsean	1931a442dc	Revert "renames, remove , wrap BlockWrapper enum to make descontruction private" This reverts commit `5b3b34a9d7`.	2022-12-28 10:30:36 -05:00
realbigsean	5b3b34a9d7	renames, remove , wrap BlockWrapper enum to make descontruction private	2022-12-28 10:28:45 -05:00
Divma	240854750c	cleanup: remove unused imports, unusued fields (#3834 )	2022-12-23 17:16:10 -05:00
Diva M	cd6655dba9	handle no blobs from peers instead of empty blobs in range requests	2022-12-22 17:30:04 -05:00
realbigsean	a0d4aecf30	requests block + blob always post eip4844	2022-12-07 15:30:08 -05:00
realbigsean	2157d91b43	process single block and blob	2022-11-30 11:51:18 -05:00
Divma	bf5005244e	Blob syncing (#24 ) * add a rt is_blob_batch * use the mixed type everywhere * glue * more glue * minor fixes * fix range tests * filling in the gaps * moore filling in the gaps	2022-11-24 07:45:38 -05:00
realbigsean	8d45e48775	cargo fix	2022-10-03 21:52:16 -04:00
realbigsean	fe6fc55449	fix compilation errors, rename capella -> shanghai, cleanup some rebase issues	2022-09-29 12:43:13 -04:00
realbigsean	ebc0ccd02a	some more sync boilerplate	2022-09-29 12:34:09 -04:00
Divma	8c69d57c2c	Pause sync when EE is offline (#3428 ) ## Issue Addressed #3032 ## Proposed Changes Pause sync when ee is offline. Changes include three main parts: - Online/offline notification system - Pause sync - Resume sync #### Online/offline notification system - The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism. - The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc. - Sync waits for state changes concurrently with normal messages. #### Pause Sync Sync has four components, pausing is done differently in each: - Block lookups: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - Parent lookups: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - Range: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers. - Backfill: Not affected by ee states, we don't pause. #### Resume Sync - Block lookups: Enabled again. - Parent lookups: Enabled again. - Range: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them. - Backfill: Not affected by ee states, no need to resume. ## Additional Info QUESTION: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed Next gen of #3094 Will work best with #3439 Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>	2022-08-24 23:34:56 +00:00
Divma	f4ffa9e0b4	Handle processing results of non faulty batches (#3439 ) ## Issue Addressed Solves #3390 So after checking some logs @pawanjay176 got, we conclude that this happened because we blacklisted a chain after trying it "too much". Now here, in all occurrences it seems that "too much" means we got too many download failures. This happened very slowly, exactly because the batch is allowed to stay alive for very long times after not counting penalties when the ee is offline. The error here then was not that the batch failed because of offline ee errors, but that we blacklisted a chain because of download errors, which we can't pin on the chain but on the peer. This PR fixes that. ## Proposed Changes Adds a missing piece of logic so that if a chain fails for errors that can't be attributed to an objectively bad behavior from the peer, it is not blacklisted. The issue at hand occurred when new peers arrived claiming a head that had wrongfully blacklisted, even if the original peers participating in the chain were not penalized. Another notable change is that we need to consider a batch invalid if it processed correctly but its next non empty batch fails processing. Now since a batch can fail processing in non empty ways, there is no need to mark as invalid previous batches. Improves some logging as well. ## Additional Info We should do this regardless of pausing sync on ee offline/unsynced state. This is because I think it's almost impossible to ensure a processing result will reach in a predictable order with a synced notification from the ee. Doing this handles what I think are inevitable data races when we actually pause sync This also fixes a return that reports which batch failed and caused us some confusion checking the logs	2022-08-12 00:56:38 +00:00
Paul Hauner	be4e261e74	Use async code when interacting with EL (#3244 ) ## Overview This rather extensive PR achieves two primary goals: 1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state. 2. Refactors fork choice, block production and block processing to `async` functions. Additionally, it achieves: - Concurrent forkchoice updates to the EL and cache pruning after a new head is selected. - Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production. - Concurrent per-block-processing and execution payload verification during block processing. - The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?): - I had to do this to deal with sending blocks into spawned tasks. - Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones. - We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap. - Avoids cloning all the blocks in every chain segment during sync. - It also has the potential to clean up our code where we need to pass an owned block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅) - The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs. For motivation for this change, see https://github.com/sigp/lighthouse/pull/3244#issuecomment-1160963273 ## Changes to `canonical_head` and `fork_choice` Previously, the `BeaconChain` had two separate fields: ``` canonical_head: RwLock<Snapshot>, fork_choice: RwLock<BeaconForkChoice> ``` Now, we have grouped these values under a single struct: ``` canonical_head: CanonicalHead { cached_head: RwLock<Arc<Snapshot>>, fork_choice: RwLock<BeaconForkChoice> } ``` Apart from ergonomics, the only actual change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously. ## Breaking Changes ### The `state` (root) field in the `finalized_checkpoint` SSE event Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event: 1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`. 4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots. Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](`de2b2801c8/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java (L171-L182)`) it uses [`getStateRootFromBlockRoot`](`de2b2801c8/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java (L336-L341)`) which uses (1). I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku. ## Notes for Reviewers I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct. I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking". I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run at least once means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows when it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it. I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around. Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3bc2. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3e2. You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests: - Changing tests to be `tokio::async` tests. - Adding `.await` to fork choice, block processing and block production functions. - Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`. - Wrapping `SignedBeaconBlock` in an `Arc`. - In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant. I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic. Co-authored-by: Mac L <mjladson@pm.me>	2022-07-03 05:36:50 +00:00
Divma	7366266bd1	keep failed finalized chains to avoid retries (#3142 ) ## Issue Addressed In very rare occasions we've seen most if not all our peers in a chain with which we don't agree. Purging these peers can take a very long time: number of retries of the chain. Meanwhile sync is caught in a loop trying the chain again and again. This makes it so that we fast track purging peers via registering the failed chain to prevent retrying for some time (30 seconds). Longer times could be dangerous since a chain can fail if a batch fails to download for example. In this case, I think it's still acceptable to fast track purging peers since they are nor providing the required info anyway Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com>	2022-04-13 01:10:55 +00:00
Divma	4bf1af4e85	Custom RPC request management for sync (#3029 ) ## Proposed Changes Make `lighthouse_network` generic over request ids, now usable by sync	2022-03-02 22:07:17 +00:00
Divma	56d596ee42	Unban peers at the swarm level when purged (#2855 ) ## Issue Addressed #2840	2021-12-20 23:45:21 +00:00
Divma	413b0b5b2b	Correctly update range status when outdated chains are removed (#2827 ) We were batch removing chains when purging, and then updating the status of the collection for each of those. This makes the range status be out of sync with the real status. This represented no harm to the global sync status, but I've changed it to comply with a correct debug assertion that I got triggered while doing some testing. Also added tests and improved code quality as per @paulhauner 's suggestions.	2021-11-26 01:13:49 +00:00
Divma	31386277c3	Sync wrong dbg assertion (#2821 ) ## Issue Addressed Running a beacon node I triggered a sync debug panic. And so finally the time to create tests for sync arrived. Fortunately, te bug was not in the sync algorithm itself but a wrong assertion ## Proposed Changes - Split Range's impl from the BeaconChain via a trait. This is needed for testing. The TestingRig/Harness is way bigger than needed and does not provide the modification functionalities that are needed to test sync. I find this simpler, tho some could disagree. - Add a regression test for sync that fails before the changes. - Fix the wrong assertion.	2021-11-19 02:38:25 +00:00
Age Manning	df40700ddd	Rename eth2_libp2p to lighthouse_network (#2702 ) ## Description The `eth2_libp2p` crate was originally named and designed to incorporate a simple libp2p integration into lighthouse. Since its origins the crates purpose has expanded dramatically. It now houses a lot more sophistication that is specific to lighthouse and no longer just a libp2p integration. As of this writing it currently houses the following high-level lighthouse-specific logic: - Lighthouse's implementation of the eth2 RPC protocol and specific encodings/decodings - Integration and handling of ENRs with respect to libp2p and eth2 - Lighthouse's discovery logic, its integration with discv5 and logic about searching and handling peers. - Lighthouse's peer manager - This is a large module handling various aspects of Lighthouse's network, such as peer scoring, handling pings and metadata, connection maintenance and recording, etc. - Lighthouse's peer database - This is a collection of information stored for each individual peer which is specific to lighthouse. We store connection state, sync state, last seen ips and scores etc. The data stored for each peer is designed for various elements of the lighthouse code base such as syncing and the http api. - Gossipsub scoring - This stores a collection of gossipsub 1.1 scoring mechanisms that are continuously analyssed and updated based on the ethereum 2 networks and how Lighthouse performs on these networks. - Lighthouse specific types for managing gossipsub topics, sync status and ENR fields - Lighthouse's network HTTP API metrics - A collection of metrics for lighthouse network monitoring - Lighthouse's custom configuration of all networking protocols, RPC, gossipsub, discovery, identify and libp2p. Therefore it makes sense to rename the crate to be more akin to its current purposes, simply that it manages the majority of Lighthouse's network stack. This PR renames this crate to `lighthouse_network` Co-authored-by: Paul Hauner <paul@paulhauner.com>	2021-10-19 00:30:39 +00:00
Michael Sproul	9667dc2f03	Implement checkpoint sync (#2244 ) ## Issue Addressed Closes #1891 Closes #1784 ## Proposed Changes Implement checkpoint sync for Lighthouse, enabling it to start from a weak subjectivity checkpoint. ## Additional Info - [x] Return unavailable status for out-of-range blocks requested by peers (#2561) - [x] Implement sync daemon for fetching historical blocks (#2561) - [x] Verify chain hashes (either in `historical_blocks.rs` or the calling module) - [x] Consistency check for initial block + state - [x] Fetch the initial state and block from a beacon node HTTP endpoint - [x] Don't crash fetching beacon states by slot from the API - [x] Background service for state reconstruction, triggered by CLI flag or API call. Considered out of scope for this PR: - Drop the requirement to provide the `--checkpoint-block` (this would require some pretty heavy refactoring of block verification) Co-authored-by: Diva M <divma@protonmail.com>	2021-09-22 00:37:28 +00:00
Paul Hauner	a764c3b247	Handle early blocks (#2155 ) ## Issue Addressed NA ## Problem this PR addresses There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events: 1. Gossip block 0xabc arrives ~200ms early - It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets). - However, it is not imported to our database since the block is early. 2. Attestations for 0xabc arrive, but the block was not imported. - The peer that sent the attestation is down-voted. - Each unknown-block attestation causes a score loss of 1, the peer is banned at -100. - When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc). ## Potential solutions I can think of three solutions to this: 1. Wait for attestation-queuing (#635) to arrive and solve this. - Easy - Not immediate fix. - Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better. 1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`. - Easy - ~~I have implemented this, for now.~~ 1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot. - More difficult - Feels like the best solution, I will try to implement this. This PR takes approach (3). ## Changes included - Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them. - Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event. - In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic). - Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester. - To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect. - The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed any failures due to races on my machine or on CI yet. - To assist with testing I allowed for directly setting the time on the `ManualSlotClock`. - I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.	2021-02-24 03:08:52 +00:00
divma	8fcd22992c	No string in slog (#2017 ) ## Issue Addressed Following slog's documentation, this should help a bit with string allocations. I left it run for two days and mem usage is lower. This is of course anecdotal, but shouldn't harm anyway ## Proposed Changes remove `String` creation in logs when possible	2020-11-30 10:33:00 +00:00
divma	d727e55abe	Move some rpc processing to the beacon_processor (#1936 ) ## Issue Addressed `BlocksByRange` requests were the main culprit of a series of timeouts to peer's requests in general because they produce build up in the router's processor. Those were moved to the blocking executor but a task is being spawned for each; also not ideal since the amount of resources we give to those is not controlled ## Proposed Changes - Move `BlocksByRange` and `BlocksByRoots` to the `beacon_processor`. The processor crafts the responses and sends them. - Move too the processing of `StatusMessage`s from other peers. This is a fast operation but it can also build up and won't scale if we keep it in the router (processing one at the time). These don't need to send an answer, so there is no harm in processing them "later" if that were to happen. Sending responses to status requests is still in the router, so we answer as soon as we see them. - Some "extras" that are basically clean up: - Split the `Worker` logic in sync methods (chain processing and rpc blocks), gossip methods (the majority of methods) and rpc methods (the new ones) - Move the `status_message` function previously provided by the router's processor to a more central place since it is used by the router, sync, network_context and beacon_processor - Some spelling ## Additional Info What's left to decide/test more thoroughly is the length of the queues and the priority rules. @paulhauner suggested at some point to put status above attestations, and @AgeManning had described an importance of "protecting gossipsub" so my solution is leaving status requests in the router and RPC methods below attestations. Slashings and Exits are at the end.	2020-11-19 23:33:44 +00:00
divma	8a16548715	Misc Peer sync info adjustments (#1896 ) ## Issue Addressed #1856 ## Proposed Changes - For clarity, the router's processor now only decides if a peer is compatible and it disconnects it or sends it to sync accordingly. No logic here regarding how useful is the peer. - Update peer_sync_info's rules - Add an `IrrelevantPeer` sync status to account for incompatible peers (maybe this should be "IncompatiblePeer" now that I think about it?) this state is update upon receiving an internal goodbye in the peer manager - Misc code cleanups - Reduce the need to create `StatusMessage`s (and thus, `Arc` accesses ) - Add missing calls to update the global sync state The overall effect should be: - More peers recognized as Behind, and less as Unknown - Peers identified as incompatible	2020-11-13 09:00:10 +00:00
divma	6c0c050fbb	Tweak head syncing (#1845 ) ## Issue Addressed Fixes head syncing ## Proposed Changes - Get back to statusing peers after removing chain segments and making the peer manager deal with status according to the Sync status, preventing an old known deadlock - Also a bug where a chain would get removed if the optimistic batch succeeds being empty ## Additional Info Tested on Medalla and looking good	2020-11-01 23:37:39 +00:00
divma	9f45ac2f5e	More sync edge cases + prettify range (#1834 ) ## Issue Addressed Sync edge case when we get an empty optimistic batch that passes validation and is inside the download buffer. Eventually the chain would reach the batch and treat it as an ugly state. ## Proposed Changes - Handle the edge case advancing the chain's target + code clarification - Some largey changes for readability + ergonomics since rust has try ops - Better handling of bad batch and chain states	2020-10-29 02:29:24 +00:00
divma	2acf75785c	More sync updates (#1791 ) ## Issue Addressed #1614 and a couple of sync-stalling problems, the most important is a cyclic dependency between the sync manager and the peer manager	2020-10-20 22:34:18 +00:00
Paul Hauner	da44821e39	Clean up obsolete TODOs (#1734 ) Squashed commit of the following: commit f99373cbaec9adb2bdbae3f7e903284327962083 Author: Age Manning <Age@AgeManning.com> Date: Mon Oct 5 18:44:09 2020 +1100 Clean up obsolute TODOs	2020-10-05 21:08:14 +11:00

1 2

78 Commits