## Overview
There are forked chains which get referenced by blocks and attestations on a network. Typically if these chains are very long, we stop looking up the chain and downvote the peer. In extreme circumstances, many peers are on many chains, the chains can be very deep and become time consuming performing lookups.
This PR adds a cache to known failed chain lookups. This prevents us from starting a parent-lookup (or stopping one half way through) if we have attempted the chain lookup in the past.
## Description
Currently lighthouse load-balances across peers a single finalized chain. The chain is selected via the most peers. Once synced to the latest finalized epoch Lighthouse creates chains amongst its peers and syncs them all in parallel amongst each peer (grouped by their current head block).
This is typically fast and relatively efficient under normal operations. However if the chain has not finalized in a long time, the head chains can grow quite long. Peer's head chains will update every slot as new blocks are added to the head. Syncing all head chains in parallel is a bottleneck and highly inefficient in block duplication leads to RPC timeouts when attempting to handle all new heads chains at once.
This PR limits the parallelism of head syncing chains to 2. We now sync at most two head chains at a time. This allows for the possiblity of sync progressing alongside a peer being slow and holding up one chain via RPC timeouts.
The changes are somewhat simple but should solve two issues:
- When quickly changing between chains once and a second time back again, batchIds would collide and cause havoc.
- If we got an out of range response from a peer, sync would remain in syncing but without advancing
Changes:
- remove the batch id. Identify each batch (inside a chain) by its starting epoch. Target epochs for downloading and processing now advance by EPOCHS_PER_BATCH
- for the same reason, move the "to_be_downloaded_id" to be an epoch
- remove a sneaky line that dropped an out of range batch without downloading it
- bonus: put the chain_id in the log given to the chain. This is why explicitly logging the chain_id is removed
## Proposed Changes
To mitigate the impact of minority forks on RAM and disk usage, this change rejects blocks whose parent lies more than 320 slots (10 epochs, ~1 hour) in the past. The behaviour is configurable via `lighthouse bn --max-skip-slots N`, and can be turned off entirely using `--max-skip-slots none`.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
NA
## Proposed Changes
Moves beacon block processing over to the newly-added `GossipProcessor`. This moves the task off the core executor onto the blocking one.
## Additional Info
- With this PR, gossip blocks are being ignored during sync.
## Issue Addressed
#1384
Only catch, as currently implemented, when dialing the multiaddr nodes, there is no way to ask the peer manager if they are already connected or dialing
## Issue Addressed
N/A
## Proposed Changes
Introduces the `GossipProcessor`, a multi-threaded (multi-tasked?), non-blocking processor for some messages from the network which require verification and import into the `BeaconChain`.
Initial testing indicates that this massively improves system stability by (a) moving block tasks from the normal executor (b) spreading out attestation load.
## Additional Info
TBC
## Issue Addressed
#1028
A bit late, but I think if `BlockError` had a kind (the current `BlockError` minus everything on the variants that comes directly from the block) and the original block, more clones could be removed
## Issue Addressed
Sync was breaking occasionally. The root cause appears to be identify crashing as events we being sent to the protocol after nodes were banned. Have not been able to reproduce sync issues since this update.
## Proposed Changes
Only send messages to sub-behaviour protocols if the peer manager thinks the peer is connected. All other messages are dropped.
## Issue Addressed
Recurring sync loop and invalid batch downloading
## Proposed Changes
Shifts the batches to include the first slot of each epoch. This ensures the finalized is always downloaded once a chain has completed syncing.
Also add in logic to prevent re-dialing disconnected peers. Non-performant peers get disconnected during sync, this prevents re-connection to these during sync.
## Additional Info
N/A
## Issue Addressed
N/A
## Proposed Changes
This provides a number of corrections and improvements to gossipsub. Specifically
- Enables options for greater privacy around the message author
- Provides greater flexibility on message validation
- Prevents unvalidated messages from being gossiped
- Shifts the duplicate cache to a time-based cache inside gossipsub
- Updates the message-id to handle bytes
- Bug fixes related to mesh maintenance and topic subscription. This should improve our attestation inclusion rate.
## Issue Addressed
#1388 partially (eth2_libp2p & network)
## Proposed Changes
TLDR at the end
- *Complex types* are 3 on the handlers/Behaviours but the types are `Poll<ComplexType>` where `ComplexType` comes from the traits of libp2p. Those, I don't thing are worth an alias. A couple more were from using tokio combinators and were removed writing things the async way and using [`BoxFuture`](https://docs.rs/futures/0.3.5/futures/future/type.BoxFuture.html)
- The *cognitive complexity*.. I tried to address those before (they come from the poll functions too) and tbh they are cognitively simpler to understand the way they are now. Moving separate parts to functions doesn't add much since that code is not repeated and they all do early returns. If moved those returns would now need to be wrapped in an Option, probably, and checked to be returned again. I would leave them like that but that's just preference.
- *Too many arguments*: They are not easily put together in a wrapping struct since the parameters don't relate semantically (Ex: fn new with a log, a reference to the chain, a peer, etc) but some may differ.
- *Needless returns* were indeed needless
## Additional Info
TLDR: removed needless return, used BoxFuture and async, left the rest untouched since those lgtm
* Process exits and slashings off the network
* Fix rest_api tests
* Add op verification tests
* Add tests for pruning of slashings in the op pool
* Address Paul's review comments
* Update `milagro_bls` to new release (#1183)
* Update milagro_bls to new release
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* Tidy up fake cryptos
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* move SecretHash to bls and put plaintext back
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* Update v0.12.0 to v0.12.1
* Add compute_subnet_for_attestation
* Replace CommitteeIndex topic with Attestation
* Fix warnings
* Fix attestation service tests
* fmt
* Appease clippy
* return error from validator_subscriptions
* move state out of loop
* Fix early break on error
* Get state from slot clock
* Fix beacon state in attestation tests
* Add failing test for lookahead > 1
* Minor change
* Address some review comments
* Add subnet verification to beacon chain
* Move subnet verification to processor
* Pass committee_count_at_slot to ValidatorDuty and ValidatorSubscription
* Pass subnet id for publishing attestations
* Fix attestation service tests
* Fix more tests
* Fix fork choice test
* Remove unused code
* Remove more unused and expensive code
Co-authored-by: Kirk Baird <baird.k@outlook.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
Co-authored-by: Age Manning <Age@AgeManning.com>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
* Layer do_atomically() abstractions properly
* Reduce allocs and DRY get_key_for_col()
* Parameterize HotColdDB with hot and cold item stores
* -impl Store for MemoryStore
* Replace Store uses with HotColdDB
* Ditch Store trait
* cargo fmt
* Style fix
* Readd missing dep that broke the build
* Update `milagro_bls` to new release (#1183)
* Update milagro_bls to new release
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* Tidy up fake cryptos
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* move SecretHash to bls and put plaintext back
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* Update v0.12.0 to v0.12.1
* Use ssz types for Request and error types
* Fix errors
* Constrain BlocksByRangeRequest count to MAX_REQUEST_BLOCKS
* Fix issues after rebasing
* Address review comments
Co-authored-by: Kirk Baird <baird.k@outlook.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
Co-authored-by: Age Manning <Age@AgeManning.com>
* add retry logic to peer discovery and an expiration time for peers
* Restructure discovery
* Add mac build to CI
* Always return an error for Health when not linux
* Change macos workflow
* Rename macos tests
* Update DiscoverPeers messages to pass Instants. Implement PartialEq for AttServiceMessage
* update discover peer queueing to always check existing messages and extend min_ttl as necessary
* update method name and comment
* Correct merge issues
* Add subnet id check to partialeq, fix discover peer message dups
* fix discover peer message dups
* fix discover peer message dups for real this time
Co-authored-by: Age Manning <Age@AgeManning.com>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
* wip: mwake the request id optional
* make the request_id optional
* cleanup
* address clippy lints inside rpc
* WIP: Separate sent RPC events from received ones
* WIP: Separate sent RPC events from received ones
* cleanup
* Separate request ids from substream ids
* Make RPC's message handling independent of RequestIds
* Change behaviour RPC events to be more outside-crate friendly
* Propage changes across the network + router + processor
* Propage changes across the network + router + processor
* fmt
* "tiny" refactor
* more tiny refactors
* fmt eth2-libp2p
* wip: propagating changes
* wip: propagating changes
* cleaning up
* more cleanup
* fmt
* tests HOT fix
Co-authored-by: Age Manning <Age@AgeManning.com>
* Add logging on shutdown
* Replace tokio::spawn with handle.spawn
* Upgrade tokio
* Add a task executor
* Beacon chain tasks use task executor
* Validator client tasks use task executor
* Rename runtime_handle to executor
* Add duration histograms; minor fixes
* Cleanup
* Fix logs
* Fix tests
* Remove random file
* Get enr dependency instead of libp2p
* Address some review comments
* Libp2p takes a TaskExecutor
* Ugly fix libp2p tests
* Move TaskExecutor to own file
* Upgrade Dockerfile rust version
* Minor fixes
* Revert "Ugly fix libp2p tests"
This reverts commit 58d4bb690f52de28d893943b7504d2d0c6621429.
* Pretty fix libp2p tests
* Add spawn_without_exit; change Counter to Gauge
* Tidy
* Move log from RuntimeContext to TaskExecutor
* Fix errors
* Replace histogram with int_gauge for async tasks
* Fix todo
* Fix memory leak in test by exiting all spawned tasks at the end
* Remove redundant method
* Pull out a method out of a struct
* More precise db access abstractions
* Move fake trait method out of it
* cargo fmt
* Fix compilation error after refactoring
* Move another fake method out the Store trait
* Get rid of superfluous method
* Fix refactoring bug
* Rename: SimpleStoreItem -> StoreItem
* Get rid of the confusing DiskStore type alias
* Get rid of SimpleDiskStore type alias
* Correction: A method took both self and a ref to Self