lighthouse

Author	SHA1	Message	Date
Age Manning	059d9ec1b1	Gossipsub scoring improvements (#2391 ) * Tweak gossipsub parameters for improved scoring * Modify gossip history * Update settings * Make mesh window constant * Decrease the mesh message deliveries weight * Fmt	2021-07-15 16:43:18 +10:00
Age Manning	c62810b408	Update to Libp2p to 39.1 (#2448 ) * Adjust beacon node timeouts for validator client HTTP requests (#2352) Resolves #2313 Provide `BeaconNodeHttpClient` with a dedicated `Timeouts` struct. This will allow granular adjustment of the timeout duration for different calls made from the VC to the BN. These can either be a constant value, or as a ratio of the slot duration. Improve timeout performance by using these adjusted timeout duration's only whenever a fallback endpoint is available. Add a CLI flag called `use-long-timeouts` to revert to the old behavior. Additionally set the default `BeaconNodeHttpClient` timeouts to the be the slot duration of the network, rather than a constant 12 seconds. This will allow it to adjust to different network specifications. Co-authored-by: Paul Hauner <paul@paulhauner.com> * Use read_recursive locks in database (#2417) Closes #2245 Replace all calls to `RwLock::read` in the `store` crate with `RwLock::read_recursive`. * Unfortunately we can't run the deadlock detector on CI because it's pinned to an old Rust 1.51.0 nightly which cannot compile Lighthouse (one of our deps uses `ptr::addr_of!` which is too new). A fun side-project at some point might be to update the deadlock detector. * The reason I think we haven't seen this deadlock (at all?) in practice is that _writes_ to the database's split point are quite infrequent, and a concurrent write is required to trigger the deadlock. The split point is only written when finalization advances, which is once per epoch (every ~6 minutes), and state reads are also quite sporadic. Perhaps we've just been incredibly lucky, or there's something about the timing of state reads vs database migration that protects us. * I wrote a few small programs to demo the deadlock, and the effectiveness of the `read_recursive` fix: https://github.com/michaelsproul/relock_deadlock_mvp * [The docs for `read_recursive`](https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.read_recursive) warn of starvation for writers. I think in order for starvation to occur the database would have to be spammed with so many state reads that it's unable to ever clear them all and find time for a write, in which case migration of states to the freezer would cease. If an attack could be performed to trigger this starvation then it would likely trigger a deadlock in the current code, and I think ceasing migration is preferable to deadlocking in this extreme situation. In practice neither should occur due to protection from spammy peers at the network layer. Nevertheless, it would be prudent to run this change on the testnet nodes to check that it doesn't cause accidental starvation. * Return more detail when invalid data is found in the DB during startup (#2445) - Resolves #2444 Adds some more detail to the error message returned when the `BeaconChainBuilder` is unable to access or decode block/state objects during startup. NA * Use hardware acceleration for SHA256 (#2426) Modify the SHA256 implementation in `eth2_hashing` so that it switches between `ring` and `sha2` to take advantage of [x86_64 SHA extensions](https://en.wikipedia.org/wiki/Intel_SHA_extensions). The extensions are available on modern Intel and AMD CPUs, and seem to provide a considerable speed-up: on my Ryzen 5950X it dropped state tree hashing times by about 30% from 35ms to 25ms (on Prater). The extensions became available in the `sha2` crate [last year](https://www.reddit.com/r/rust/comments/hf2vcx/ann_rustcryptos_sha1_and_sha2_now_support/), and are not available in Ring, which uses a [pure Rust implementation of sha2](https://github.com/briansmith/ring/blob/main/src/digest/sha2.rs). Ring is faster on CPUs that lack the extensions so I've implemented a runtime switch to use `sha2` only when the extensions are available. The runtime switching seems to impose a miniscule penalty (see the benchmarks linked below). * Start a release checklist (#2270) NA Add a checklist to the release draft created by CI. I know @michaelsproul was also working on this and I suspect @realbigsean also might have useful input. NA * Serious banning * fmt Co-authored-by: Mac L <mjladson@pm.me> Co-authored-by: Paul Hauner <paul@paulhauner.com> Co-authored-by: Michael Sproul <michael@sigmaprime.io>	2021-07-15 16:43:18 +10:00
Age Manning	3c0d3227ab	Global Network Behaviour Refactor (#2442 ) * Network upgrades (#2345) * Discovery patch (#2382) * Upgrade libp2p and unstable gossip * Network protocol upgrades * Correct dependencies, reduce incoming bucket limit * Clean up dirty DHT entries before repopulating * Update cargo lock * Update lockfile * Update ENR dep * Update deps to specific versions * Update test dependencies * Update docker rust, and remote signer tests * More remote signer test fixes * Temp commit * Update discovery * Remove cached enrs after dialing * Increase the session capacity, for improved efficiency * Bleeding edge discovery (#2435) * Update discovery banning logic and tokio * Update to latest discovery * Shift to latest discovery * Fmt * Initial re-factor of the behaviour * More progress * Missed changes * First draft * Discovery as a behaviour * Adding back event waker (not convinced its neccessary, but have made this many changes already) * Corrections * Speed up discovery * Remove double log * Fmt * After disconnect inform swarm about ban * More fmt * Appease clippy * Improve ban handling * Update tests * Update cargo.lock * Correct tests * Downgrade log	2021-07-15 16:43:17 +10:00
Pawan Dhananjay	64226321b3	Relax requirement for enr fork digest predicate (#2433 )	2021-07-15 16:43:17 +10:00
Age Manning	c1d2e35c9e	Bleeding edge discovery (#2435 ) * Update discovery banning logic and tokio * Update to latest discovery * Shift to latest discovery * Fmt	2021-07-15 16:43:17 +10:00
Age Manning	f4bc9db16d	Change the window mode of yamux (#2390 )	2021-07-15 16:43:17 +10:00
Age Manning	6fb48b45fa	Discovery patch (#2382 ) * Upgrade libp2p and unstable gossip * Network protocol upgrades * Correct dependencies, reduce incoming bucket limit * Clean up dirty DHT entries before repopulating * Update cargo lock * Update lockfile * Update ENR dep * Update deps to specific versions * Update test dependencies * Update docker rust, and remote signer tests * More remote signer test fixes * Temp commit * Update discovery * Remove cached enrs after dialing * Increase the session capacity, for improved efficiency	2021-07-15 16:43:17 +10:00
Age Manning	4aa06c9555	Network upgrades (#2345 )	2021-07-15 16:43:10 +10:00
Paul Hauner	b0f5c4c776	Clarify eth1 error message (#2461 ) ## Issue Addressed - Closes #2452 ## Proposed Changes Addresses: https://github.com/sigp/lighthouse/issues/2452#issuecomment-879873511 ## Additional Info NA	2021-07-15 04:22:06 +00:00
realbigsean	a3a7f39b0d	[Altair] Sync committee pools (#2321 ) Add pools supporting sync committees: - naive sync aggregation pool - observed sync contributions pool - observed sync contributors pool - observed sync aggregators pool Add SSZ types and tests related to sync committee signatures. Co-authored-by: Michael Sproul <michael@sigmaprime.io> Co-authored-by: realbigsean <seananderson33@gmail.com>	2021-07-15 00:52:02 +00:00
Paul Hauner	fc4c611476	Remove msg about longer sync with remote eth1 nodes (#2453 ) ## Issue Addressed - Resolves #2452 ## Proposed Changes I've seen a few people confused by this and I don't think the message is really worth it. ## Additional Info NA	2021-07-14 05:24:09 +00:00
divma	304fb05e44	Maintain attestations that reference unknown blocks (#2319 ) ## Issue Addressed #635 ## Proposed Changes - Keep attestations that reference a block we have not seen for 30secs before being re processed - If we do import the block before that time elapses, it is reprocessed in that moment - The first time it fails, do nothing wrt to gossipsub propagation or peer downscoring. If after being re processed it fails, downscore with a `LowToleranceError` and ignore the message.	2021-07-14 05:24:08 +00:00
Paul Hauner	9656ffee7c	Metrics for sync aggregate fullness (#2439 ) ## Issue Addressed NA ## Proposed Changes Adds a metric to see how many set bits are in the sync aggregate for each beacon block being imported. ## Additional Info NA	2021-07-13 02:22:55 +00:00
Paul Hauner	27aec1962c	Add more detail to "Prior attestation known" log (#2447 ) ## Issue Addressed NA ## Proposed Changes Adds more detail to the log when an attestation is ignored due to a prior one being known. This will help identify which validators are causing the issue. ## Additional Info NA	2021-07-13 01:02:03 +00:00
Paul Hauner	a7b7134abb	Return more detail when invalid data is found in the DB during startup (#2445 ) ## Issue Addressed - Resolves #2444 ## Proposed Changes Adds some more detail to the error message returned when the `BeaconChainBuilder` is unable to access or decode block/state objects during startup. ## Additional Info NA	2021-07-12 07:31:27 +00:00
Michael Sproul	371c216ac3	Use read_recursive locks in database (#2417 ) ## Issue Addressed Closes #2245 ## Proposed Changes Replace all calls to `RwLock::read` in the `store` crate with `RwLock::read_recursive`. ## Additional Info * Unfortunately we can't run the deadlock detector on CI because it's pinned to an old Rust 1.51.0 nightly which cannot compile Lighthouse (one of our deps uses `ptr::addr_of!` which is too new). A fun side-project at some point might be to update the deadlock detector. * The reason I think we haven't seen this deadlock (at all?) in practice is that _writes_ to the database's split point are quite infrequent, and a concurrent write is required to trigger the deadlock. The split point is only written when finalization advances, which is once per epoch (every ~6 minutes), and state reads are also quite sporadic. Perhaps we've just been incredibly lucky, or there's something about the timing of state reads vs database migration that protects us. * I wrote a few small programs to demo the deadlock, and the effectiveness of the `read_recursive` fix: https://github.com/michaelsproul/relock_deadlock_mvp * [The docs for `read_recursive`](https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.read_recursive) warn of starvation for writers. I think in order for starvation to occur the database would have to be spammed with so many state reads that it's unable to ever clear them all and find time for a write, in which case migration of states to the freezer would cease. If an attack could be performed to trigger this starvation then it would likely trigger a deadlock in the current code, and I think ceasing migration is preferable to deadlocking in this extreme situation. In practice neither should occur due to protection from spammy peers at the network layer. Nevertheless, it would be prudent to run this change on the testnet nodes to check that it doesn't cause accidental starvation.	2021-07-12 07:31:26 +00:00
Mac L	b3c7e59a5b	Adjust beacon node timeouts for validator client HTTP requests (#2352 ) ## Issue Addressed Resolves #2313 ## Proposed Changes Provide `BeaconNodeHttpClient` with a dedicated `Timeouts` struct. This will allow granular adjustment of the timeout duration for different calls made from the VC to the BN. These can either be a constant value, or as a ratio of the slot duration. Improve timeout performance by using these adjusted timeout duration's only whenever a fallback endpoint is available. Add a CLI flag called `use-long-timeouts` to revert to the old behavior. ## Additional Info Additionally set the default `BeaconNodeHttpClient` timeouts to the be the slot duration of the network, rather than a constant 12 seconds. This will allow it to adjust to different network specifications. Co-authored-by: Paul Hauner <paul@paulhauner.com>	2021-07-12 01:47:48 +00:00
Michael Sproul	b4689e20c6	Altair consensus changes and refactors (#2279 ) ## Proposed Changes Implement the consensus changes necessary for the upcoming Altair hard fork. ## Additional Info This is quite a heavy refactor, with pivotal types like the `BeaconState` and `BeaconBlock` changing from structs to enums. This ripples through the whole codebase with field accesses changing to methods, e.g. `state.slot` => `state.slot()`. Co-authored-by: realbigsean <seananderson33@gmail.com>	2021-07-09 06:15:32 +00:00
Paul Hauner	78e5c0c157	Capture a missed VC error (#2436 ) ## Issue Addressed Related to #2430, #2394 ## Proposed Changes As per https://github.com/sigp/lighthouse/issues/2430#issuecomment-875323615, ensure that the `ProductionValidatorClient::new` error raises a log and shuts down the VC. Also, I implemened `spawn_ignoring_error`, as per @michaelsproul's suggestion in https://github.com/sigp/lighthouse/pull/2436#issuecomment-876084419. I got unlucky and CI picked up a [new rustsec vuln](https://rustsec.org/advisories/RUSTSEC-2021-0072). To fix this, I had to update the following crates: - `tokio` - `web3` - `tokio-compat-02` ## Additional Info NA	2021-07-09 03:20:24 +00:00
Mac L	406e3921d9	Use forwards iterator for state root lookups (#2422 ) ## Issue Addressed #2377 ## Proposed Changes Implement the same code used for block root lookups (from #2376) to state root lookups in order to improve performance and reduce associated memory spikes (e.g. from certain HTTP API requests). ## Additional Changes - Tests using `rev_iter_state_roots` and `rev_iter_block_roots` have been refactored to use their `forwards` versions instead. - The `rev_iter_state_roots` and `rev_iter_block_roots` functions are now unused and have been removed. - The `state_at_slot` function has been changed to use the `forwards` iterator. ## Additional Info - Some tests still need to be refactored to use their `forwards_iter` versions. These tests start their iteration from a specific beacon state and thus use the `rev_iter_state_roots_from` and `rev_iter_block_roots_from` functions. If they can be refactored, those functions can also be removed.	2021-07-06 02:38:53 +00:00
Age Manning	73d002ef92	Update outdated dependencies (#2425 ) This updates some older dependencies to address a few cargo audit warnings. The majority of warnings come from network dependencies which will be addressed in #2389. This PR contains some minor dep updates that are not network related. Co-authored-by: Michael Sproul <michael@sigmaprime.io>	2021-07-05 00:54:17 +00:00
realbigsean	b84ff9f793	rust 1.53.0 updates (#2411 ) ## Issue Addressed `make lint` failing on rust 1.53.0. ## Proposed Changes 1.53.0 updates ## Additional Info I haven't figure out why yet, we were now hitting the recursion limit in a few crates. So I had to add `#![recursion_limit = "256"]` in a few places Co-authored-by: realbigsean <seananderson33@gmail.com> Co-authored-by: Michael Sproul <michael@sigmaprime.io>	2021-06-18 05:58:01 +00:00
Michael Sproul	3dc1eb5eb6	Ignore inactive validators in validator monitor (#2396 ) ## Proposed Changes A user on Discord (`@ChewsMacRibs`) reported that the validator monitor was logging `WARN Attested to an incorrect head` for their validator while it was awaiting activation. This PR modifies the monitor so that it ignores inactive validators, by the logic that they are either awaiting activation, or have already exited. Either way, there's no way for an inactive validator to have their attestations included on chain, so no need for the monitor to report on them. ## Additional Info To reproduce the bug requires registering validator keys manually with `--validator-monitor-pubkeys`. I don't think the bug will present itself with `--validator-monitor-auto`.	2021-06-17 02:10:48 +00:00
Jack	98ab00cc52	Handle Geth pre-EIP-155 block sync error condition (#2304 ) ## Issue Addressed #2293 ## Proposed Changes - Modify the handler for the `eth_chainId` RPC (i.e., `get_chain_id`) to explicitly match against the Geth error string returned for pre-EIP-155 synced Geth nodes - ~~Add a new helper function, `rpc_error_msg`, to aid in the above point~~ - Refactor `response_result` into `response_result_or_error` and patch reliant RPC handlers accordingly (thanks to @pawanjay176) ## Additional Info Geth, as of Pangaea Expanse (v1.10.0), returns an explicit error when it is not synced past the EIP-155 block (2675000). Previously, Geth simply returned a chain ID of 0 (which was obviously much easier to handle on Lighthouse's part). Co-authored-by: Paul Hauner <paul@paulhauner.com>	2021-06-17 02:10:47 +00:00
realbigsean	b1657a60e9	Reorg events (#2090 ) ## Issue Addressed Resolves #2088 ## Proposed Changes Add the `chain_reorg` SSE event topic ## Additional Info Co-authored-by: realbigsean <seananderson33@gmail.com> Co-authored-by: Paul Hauner <paul@paulhauner.com>	2021-06-17 02:10:46 +00:00
divma	3261eff0bf	split outbound and inbound codecs encoded types (#2410 ) Splits the inbound and outbound requests, for maintainability.	2021-06-17 00:40:16 +00:00
Paul Hauner	3b600acdc5	v1.4.0 (#2402 ) ## Issue Addressed NA ## Proposed Changes - Bump versions and update `Cargo.lock` ## Additional Info NA ## TODO - [x] Ensure #2398 gets merged succesfully	2021-06-10 01:44:49 +00:00
Paul Hauner	93100f221f	Make less logs for attn with unknown head (#2395 ) ## Issue Addressed NA ## Proposed Changes I am starting to see a lot of slog-async overflows (i.e., too many logs) on Prater whenever we see attestations for an unknown block. Since these logs are identical (except for peer id) and we expose volume/count of these errors via `metrics::GOSSIP_ATTESTATION_ERRORS_PER_TYPE`, I took the following actions to remove them from `DEBUG` logs: - Push the "Attestation for unknown block" log to trace. - Add a debug log in `search_for_block`. In effect, this should serve as a de-duped version of the previous, downgraded log. ## Additional Info TBC	2021-06-07 02:34:09 +00:00
Pawan Dhananjay	502402c6b9	Fix options for `--eth1-endpoints` flag (#2392 ) ## Issue Addressed N/A ## Proposed Changes Set `config.sync_eth1_chain` to true when using just the `--eth1-endpoints` flag (without `--eth1`).	2021-06-04 00:10:59 +00:00
Paul Hauner	f6280aa663	v1.4.0-rc.0 (#2379 ) ## Issue Addressed NA ## Proposed Changes Bump versions. ## Additional Info This is not exactly the v1.4.0 release described in [Lighthouse Update #36](https://lighthouse.sigmaprime.io/update-36.html). Whilst it contains: - Beta Windows support - A reduction in Eth1 queries - A reduction in memory footprint It does not contain: - Altair - Doppelganger Protection - The remote signer We have decided to release some features early. This is primarily due to the desire to allow users to benefit from the memory saving improvements as soon as possible. ## TODO - [x] Wait for #2340, #2356 and #2376 to merge and then rebase on `unstable`. - [x] Ensure discovery issues are fixed (see #2388) - [x] Ensure https://github.com/sigp/lighthouse/pull/2382 is merged/removed. - [x] Ensure https://github.com/sigp/lighthouse/pull/2383 is merged/removed. - [x] Ensure https://github.com/sigp/lighthouse/pull/2384 is merged/removed. - [ ] Double-check eth1 cache is carried between boots	2021-06-03 00:13:02 +00:00
Paul Hauner	90ea075c62	Revert "Network protocol upgrades (#2345 )" (#2388 ) ## Issue Addressed NA ## Proposed Changes Reverts #2345 in the interests of getting v1.4.0 out this week. Once we have released that, we can go back to testing this again. ## Additional Info NA	2021-06-02 01:07:28 +00:00
Paul Hauner	d34f922c1d	Add early check for RPC block relevancy (#2289 ) ## Issue Addressed NA ## Proposed Changes When observing `jemallocator` heap profiles and Grafana, it became clear that Lighthouse is spending significant RAM/CPU on processing blocks from the RPC. On investigation, it seems that we are loading the parent of the block before we check to see if the block is already known. This is a big waste of resources. This PR adds an additional `check_block_relevancy` call as the first thing we do when we try to process a `SignedBeaconBlock` via the RPC (or other similar methods). Ultimately, `check_block_relevancy` will be called again later in the block processing flow. It's a very light function and I don't think trying to optimize it out is worth the risk of a bad block slipping through. Also adds a `New RPC block received` info log when we process a new RPC block. This seems like interesting and infrequent info. ## Additional Info NA	2021-06-02 01:07:27 +00:00
Paul Hauner	bf4e02e2cc	Return a specific error for frozen attn states (#2384 ) ## Issue Addressed NA ## Proposed Changes Return a very specific error when at attestation reads shuffling from a frozen `BeaconState`. Previously, this was returning `MissingBeaconState` which indicates a much more serious issue. ## Additional Info Since `get_inconsistent_state_for_attestation_verification_only` is only called once in `BeaconChain::with_committee_cache`, it is quite easy to reason about the impact of this change.	2021-06-01 06:59:43 +00:00
Paul Hauner	ba9c4c5eea	Return more detail in Eth1 HTTP errors (#2383 ) ## Issue Addressed NA ## Proposed Changes Whilst investigating #2372, I [learned](https://github.com/sigp/lighthouse/issues/2372#issuecomment-851725049) that the error message returned from some failed Eth1 requests are always `NotReachable`. This makes debugging quite painful. This PR adds more detail to these errors. For example: - Bad infura key: `ERRO Failed to update eth1 cache error: Failed to update Eth1 service: "All fallback errored: https://mainnet.infura.io/ => EndpointError(RequestFailed(\"Response HTTP status was not 200 OK: 401 Unauthorized.\"))", retry_millis: 60000, service: eth1_rpc` - Unreachable server: `ERRO Failed to update eth1 cache error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8545/ => EndpointError(RequestFailed(\"Request failed: reqwest::Error { kind: Request, url: Url { scheme: \\\"http\\\", cannot_be_a_base: false, username: \\\"\\\", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(8545), path: \\\"/\\\", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError(\\\"tcp connect error\\\", Os { code: 111, kind: ConnectionRefused, message: \\\"Connection refused\\\" })) }\"))", retry_millis: 60000, service: eth1_rpc` - Bad server: `ERRO Failed to update eth1 cache error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8545/ => EndpointError(RequestFailed(\"Response HTTP status was not 200 OK: 501 Not Implemented.\"))", retry_millis: 60000, service: eth1_rpc` ## Additional Info NA	2021-06-01 06:59:41 +00:00
Paul Hauner	4c7bb4984c	Use the forwards iterator more often (#2376 ) ## Issue Addressed NA ## Primary Change When investigating memory usage, I noticed that retrieving a block from an early slot (e.g., slot 900) would cause a sharp increase in the memory footprint (from 400mb to 800mb+) which seemed to be ever-lasting. After some investigation, I found that the reverse iteration from the head back to that slot was the likely culprit. To counter this, I've switched the `BeaconChain::block_root_at_slot` to use the forwards iterator, instead of the reverse one. I also noticed that the networking stack is using `BeaconChain::root_at_slot` to check if a peer is relevant (`check_peer_relevance`). Perhaps the steep, seemingly-random-but-consistent increases in memory usage are caused by the use of this function. Using the forwards iterator with the HTTP API alleviated the sharp increases in memory usage. It also made the response much faster (before it felt like to took 1-2s, now it feels instant). ## Additional Changes In the process I also noticed that we have two functions for getting block roots: - `BeaconChain::block_root_at_slot`: returns `None` for a skip slot. - `BeaconChain::root_at_slot`: returns the previous root for a skip slot. I unified these two functions into `block_root_at_slot` and added the `WhenSlotSkipped` enum. Now, the caller must be explicit about the skip-slot behaviour when requesting a root. Additionally, I replaced `vec![]` with `Vec::with_capacity` in `store::chunked_vector::range_query`. I stumbled across this whilst debugging and made this modification to see what effect it would have (not much). It seems like a decent change to keep around, but I'm not concerned either way. Also, `BeaconChain::get_ancestor_block_root` is unused, so I got rid of it 🗑️. ## Additional Info I haven't also done the same for state roots here. Whilst it's possible and a good idea, it's more work since the fwds iterators are presently block-roots-specific. Whilst there's a few places a reverse iteration of state roots could be triggered (e.g., attestation production, HTTP API), they're no where near as common as the `check_peer_relevance` call. As such, I think we should get this PR merged first, then come back for the state root iters. I made an issue here https://github.com/sigp/lighthouse/issues/2377.	2021-05-31 04:18:20 +00:00
Kevin Lu	320a683e72	Minimum Outbound-Only Peers Requirement (#2356 ) ## Issue Addressed #2325 ## Proposed Changes This pull request changes the behavior of the Peer Manager by including a minimum outbound-only peers requirement. The peer manager will continue querying for peers if this outbound-only target number hasn't been met. Additionally, when peers are being removed, an outbound-only peer will not be disconnected if doing so brings us below the minimum. ## Additional Info Unit test for heartbeat function tests that disconnection behavior is correct. Continual querying for peers if outbound-only hasn't been met is not directly tested, but indirectly through unit testing of the helper function that counts the number of outbound-only peers. EDIT: Am concerned about the behavior of ```update_peer_scores```. If we have connected to a peer with a score below the disconnection threshold (-20), then its connection status will remain connected, while its score state will change to disconnected. ```rust let previous_state = info.score_state(); // Update scores info.score_update(); Self::handle_score_transitions( previous_state, peer_id, info, &mut to_ban_peers, &mut to_unban_peers, &mut self.events, &self.log, ); ``` ```previous_state``` will be set to Disconnected, and then because ```handle_score_transitions``` only changes connection status for a peer if the state changed, the peer remains connected. Then in the heartbeat code, because we only disconnect healthy peers if we have too many peers, these peers don't get disconnected. I'm not sure realistically how often this scenario would occur, but it might be better to adjust the logic to account for scenarios where the score state implies a connection status different from the current connection status. Co-authored-by: Kevin Lu <kevlu93@gmail.com>	2021-05-31 04:18:19 +00:00
Mac L	0847986936	Reduce outbound requests to eth1 endpoints (#2340 ) ## Issue Addressed #2282 ## Proposed Changes Reduce the outbound requests made to eth1 endpoints by caching the results from `eth_chainId` and `net_version`. Further reduce the overall request count by increasing `auto_update_interval_millis` from `7_000` (7 seconds) to `60_000` (1 minute). This will result in a reduction from ~2000 requests per hour to 360 requests per hour (during normal operation). A reduction of 82%. ## Additional Info If an endpoint fails, its state is dropped from the cache and the `eth_chainId` and `net_version` calls will be made for that endpoint again during the regular update cycle (once per minute) until it is back online. Co-authored-by: Paul Hauner <paul@paulhauner.com>	2021-05-31 04:18:18 +00:00
Age Manning	ec5cceba50	Correct issue with dialing peers (#2375 ) The ordering of adding new peers to the peerdb and deciding when to dial them was not considered in a previous update. This adds the condition that if a peer is not in the peer-db then it is an acceptable peer to dial. This makes #2374 obsolete.	2021-05-29 07:25:06 +00:00
Age Manning	d12e746b50	Network protocol upgrades (#2345 ) This provides a number of upgrades to gossipsub and discovery. The updates are extensive and this needs thorough testing.	2021-05-28 22:02:10 +00:00
Paul Hauner	456b313665	Tune GNU malloc (#2299 ) ## Issue Addressed NA ## Proposed Changes Modify the configuration of [GNU malloc](https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html) to reduce memory footprint. - Set `M_ARENA_MAX` to 4. - This reduces memory fragmentation at the cost of contention between threads. - Set `M_MMAP_THRESHOLD` to 2mb - This means that any allocation >= 2mb is allocated via an anonymous mmap, instead of on the heap/arena. This reduces memory fragmentation since we don't need to keep growing the heap to find big contiguous slabs of free memory. - ~~Run `malloc_trim` every 60 seconds.~~ - ~~This shaves unused memory from the top of the heap, preventing the heap from constantly growing.~~ - Removed, see: https://github.com/sigp/lighthouse/pull/2299#issuecomment-825322646 Note: this only provides memory savings on the Linux (glibc) platform. ## Additional Info I'm going to close #2288 in favor of this for the following reasons: - I've managed to get the memory footprint smaller here than with jemalloc. - This PR seems to be less of a dramatic change than bringing in the jemalloc dep. - The changes in this PR are strictly runtime changes, so we can create CLI flags which disable them completely. Since this change is wide-reaching and complex, it's nice to have an easy "escape hatch" if there are undesired consequences. ## TODO - [x] Allow configuration via CLI flags - [x] Test on Mac - [x] Test on RasPi. - [x] Determine if GNU malloc is present? - I'm not quite sure how to detect for glibc.. This issue suggests we can't really: https://github.com/rust-lang/rust/issues/33244 - [x] Make a clear argument regarding the affect of this on CPU utilization. - [x] Test with higher `M_ARENA_MAX` values. - [x] Test with longer trim intervals - [x] Add some stats about memory savings - [x] Remove `malloc_trim` calls & code	2021-05-28 05:59:45 +00:00
Pawan Dhananjay	fdaeec631b	Monitoring service api (#2251 ) ## Issue Addressed N/A ## Proposed Changes Adds a client side api for collecting system and process metrics and pushing it to a monitoring service.	2021-05-26 05:58:41 +00:00
Age Manning	55aada006f	More stringent dialing (#2363 ) * More stringent dialing * Cover cached enr dialing	2021-05-26 14:21:44 +10:00
ethDreamer	ba55e140ae	Enable Compatibility with Windows (#2333 ) ## Issue Addressed Windows incompatibility. ## Proposed Changes On windows, lighthouse needs to default to STDIN as tty doesn't exist. Also Windows uses ACLs for file permissions. So to mirror chmod 600, we will remove every entry in a file's ACL and add only a single SID that is an alias for the file owner. Beyond that, there were several changes made to different unit tests because windows has slightly different error messages as well as frustrating nuances around killing a process :/ ## Additional Info Tested on my Windows VM and it appears to work, also compiled & tested on Linux with these changes. Permissions look correct on both platforms now. Just waiting for my validator to activate on Prater so I can test running full validator client on windows. Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com> Co-authored-by: Michael Sproul <micsproul@gmail.com>	2021-05-19 23:05:16 +00:00
ethDreamer	cb47388ad7	Updated to comply with new clippy formatting rules (#2336 ) ## Issue Addressed The latest version of Rust has new clippy rules & the codebase isn't up to date with them. ## Proposed Changes Small formatting changes that clippy tells me are functionally equivalent	2021-05-10 00:53:09 +00:00
Mac L	bacc38c3da	Add testing for beacon node and validator client CLI flags (#2311 ) ## Issue Addressed N/A ## Proposed Changes Add unit tests for the various CLI flags associated with the beacon node and validator client. These changes require the addition of two new flags: `dump-config` and `immediate-shutdown`. ## Additional Info Both `dump-config` and `immediate-shutdown` are marked as hidden since they should only be used in testing and other advanced use cases. Note: This requires changing `main.rs` so that the flags can adjust the program behavior as necessary. Co-authored-by: Paul Hauner <paul@paulhauner.com>	2021-05-06 00:36:22 +00:00
Mac L	4cc613d644	Add `SensitiveUrl` to redact user secrets from endpoints (#2326 ) ## Issue Addressed #2276 ## Proposed Changes Add the `SensitiveUrl` struct which wraps `Url` and implements custom `Display` and `Debug` traits to redact user secrets from being logged in eth1 endpoints, beacon node endpoints and metrics. ## Additional Info This also includes a small rewrite of the eth1 crate to make requests using `Url` instead of `&str`. Some error messages have also been changed to remove `Url` data.	2021-05-04 01:59:51 +00:00
ethDreamer	0aa8509525	Filter Disconnected Peers from Discv5 DHT (#2219 ) ## Issue Addressed #2107 ## Proposed Change The peer manager will mark peers as disconnected in the discv5 DHT when they disconnect or dial fails ## Additional Info Rationale for this particular change is explained in my comment on #2107	2021-04-28 04:07:37 +00:00
realbigsean	2c2c443718	404's on API requests for slots that have been skipped or orphaned (#2272 ) ## Issue Addressed Resolves #2186 ## Proposed Changes 404 for any block-related information on a slot that was skipped or orphaned Affected endpoints: - `/eth/v1/beacon/blocks/{block_id}` - `/eth/v1/beacon/blocks/{block_id}/root` - `/eth/v1/beacon/blocks/{block_id}/attestations` - `/eth/v1/beacon/headers/{block_id}` ## Additional Info Co-authored-by: realbigsean <seananderson33@gmail.com>	2021-04-25 03:59:59 +00:00
Paul Hauner	3a24ca5f14	v1.3.0 (#2310 ) ## Issue Addressed NA ## Proposed Changes Bump versions. ## Additional Info This is a minor release (not patch) due to the very slight change introduced by #2291.	2021-04-13 22:46:34 +00:00
Michael Sproul	3b901dc5ec	Pack attestations into blocks in parallel (#2307 ) ## Proposed Changes Use two instances of max cover when packing attestations into blocks: one for the previous epoch, and one for the current epoch. This reduces the amount of computation done by roughly half due to the `O(n^2)` running time of max cover (`2 * (n/2)^2 = n^2/2`). This should help alleviate some load on block proposal, particularly on Prater.	2021-04-13 05:27:42 +00:00
Paul Hauner	c1203f5e52	Add specific log and metric for delayed blocks (#2308 ) ## Issue Addressed NA ## Proposed Changes - Adds a specific log and metric for when a block is enshrined as head with a delay that will caused bad attestations - We technically already expose this information, but it's a little tricky to determine during debugging. This makes it nice and explicit. - Fixes a minor reporting bug with the validator monitor where it was expecting agg. attestations too early (at half-slot rather than two-thirds-slot). ## Additional Info NA	2021-04-13 02:16:59 +00:00
Paul Hauner	0df7be1814	Add check for aggregate target (#2306 ) ## Issue Addressed NA ## Proposed Changes - Ensure that the [target consistency check](`b356f52c5c`) is always performed on aggregates. - Add a regression test. ## Additional Info NA	2021-04-13 00:24:39 +00:00
Age Manning	aaa14073ff	Clean up warnings (#2240 ) This is a small PR that cleans up compiler warnings. The most controversial change is removing the `data_dir` field from the `BeaconChainBuilder`. It was removed because it was never read. Co-authored-by: Paul Hauner <paul@paulhauner.com> Co-authored-by: Herman Junge <hermanjunge@protonmail.com> Co-authored-by: Michael Sproul <michael@sigmaprime.io>	2021-04-12 00:57:43 +00:00
Mac L	f6f64cf0f5	Correcting `disable-enr-auto-update` flag definition (#2303 ) ## Issue Addressed N/A ## Proposed Changes Correct the `disable-enr-auto-update` boolean flag so that it no longer requires a value. Previously it would require a value which was never used. ## Additional Info Flag is read here: https://github.com/sigp/lighthouse/blob/unstable/beacon_node/src/config.rs#L585-L587	2021-04-11 23:52:29 +00:00
Paul Hauner	e7e5878953	Avoid BeaconState clone during metrics scrape (#2298 ) ## Issue Addressed Which issue # does this PR address? ## Proposed Changes Avoids cloning the `BeaconState` each time Prometheus scrapes our metrics (generally every 5s 😱). I think the original motivation behind this was "don't hold the lock on the head whilst we do computation on it", however I think is flawed since our computation here is so small that it'll be quicker than the clone. The primary motivation here is to maintain a small memory footprint by holding less in memory (i.e., the cloned `BeaconState`) and to avoid the fragmentation-creep that occurs when cloning the big contiguous slabs of memory in the `BeaconState`. I also collapsed the active/slashed/withdrawn counters into a single loop to increase efficiency. ## Additional Info NA	2021-04-07 01:02:56 +00:00
Pawan Dhananjay	95a362213d	Fix local testnet scripts (#2229 ) ## Issue Addressed Resolves #2094 ## Proposed Changes Fixes scripts for creating local testnets. Adds an option in `lighthouse boot_node` to run with a previously generated enr.	2021-03-30 05:17:58 +00:00
Paul Hauner	9eb1945136	v1.2.2 (#2287 ) ## Issue Addressed NA ## Proposed Changes - Bump versions ## Additional Info NA	2021-03-30 04:07:03 +00:00
Paul Hauner	3d239b85ac	Allow for a clock disparity on the duties endpoints (#2283 ) ## Issue Addressed Resolves #2280 ## Proposed Changes Allows for API consumers to call the proposer/attester duties endpoints [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](`b34a79dc0b/beacon_node/beacon_chain/src/beacon_chain.rs (L99-L102)`) earlier than the current epoch. For additional reasoning, see https://github.com/sigp/lighthouse/issues/2280#issuecomment-805358897. ## Additional Info NA	2021-03-29 23:42:35 +00:00
Paul Hauner	03cefd0065	Expand observed attestations capacity (#2266 ) ## Issue Addressed NA ## Proposed Changes I noticed the following error on one of our nodes: ``` Mar 18 00:03:35 ip-xxxx lighthouse-bn[333503]: Mar 18 00:03:35.103 ERRO Unable to validate aggregate error: ObservedAttestersError(EpochTooLow { epoch: Epoch(23961), lowest_permissible_epoch: Epoch(23962) }), peer_id: 16Uiu2HAm5GL5KzPLhvfg9MBBFSpBqTVGRFSiTg285oezzWcZzwEv ``` The slot during this log was 766,815 (the last slot of the epoch). I believe this is due to an off-by-one error in `observed_attesters` where we were failing to provide enough capacity to store observations from the previous, current and next epochs. See code comments for further reasoning. Here's a link to the spec: https://github.com/ethereum/eth2.0-specs/blob/v1.0.1/specs/phase0/p2p-interface.md#beacon_aggregate_and_proof ## Additional Info NA	2021-03-29 23:42:34 +00:00
Michael Sproul	f9d60f5436	VC: accept unknown fields in chain spec (#2277 ) ## Issue Addressed Closes #2274 ## Proposed Changes * Modify the `YamlConfig` to collect unknown fields into an `extra_fields` map, instead of failing hard. * Log a debug message if there are extra fields returned to the VC from one of its BNs. This restores Lighthouse's compatibility with Teku beacon nodes (and therefore Infura)	2021-03-26 04:53:57 +00:00
Paul Hauner	b34a79dc0b	v1.2.1 (#2263 ) ## Issue Addressed NA ## Proposed Changes - Bump version. - Add some new ENR for Prater - Afri: https://github.com/eth2-clients/eth2-testnets/pull/42 - Prysm: https://github.com/eth2-clients/eth2-testnets/pull/43 - Apply the fixes from #2181 to the no-eth1-sim to try fix CI issues. ## Additional Info NA	2021-03-18 04:20:46 +00:00
Paul Hauner	015ab7d0a7	Optimize validator duties (#2243 ) ## Issue Addressed Closes #2052 ## Proposed Changes - Refactor the attester/proposer duties endpoints in the BN - Performance improvements - Fixes some potential inconsistencies with the dependent root fields. - Removes `http_api::beacon_proposer_cache` and just uses the one on the `BeaconChain` instead. - Move the code for the proposer/attester duties endpoints into separate files, for readability. - Refactor the `DutiesService` in the VC - Required to reduce the delay on broadcasting new blocks. - Gets rid of the `ValidatorDuty` shim struct that came about when we adopted the standard API. - Separate block/attestation duty tasks so that they don't block each other when one is slow. - In the VC, use `PublicKeyBytes` to represent validators instead of `PublicKey`. `PublicKey` is a legit crypto object whilst `PublicKeyBytes` is just a byte-array, it's much faster to clone/hash `PublicKeyBytes` and this change has had a significant impact on runtimes. - Unfortunately this has created lots of dust changes. - In the BN, store `PublicKeyBytes` in the `beacon_proposer_cache` and allow access to them. The HTTP API always sends `PublicKeyBytes` over the wire and the conversion from `PublicKey` -> `PublickeyBytes` is non-trivial, especially when queries have 100s/1000s of validators (like Pyrmont). - Add the `state_processing::state_advance` mod which dedups a lot of the "apply `n` skip slots to the state" code. - This also fixes a bug with some functions which were failing to include a state root as per [this comment](`072695284f/consensus/state_processing/src/state_advance.rs (L69-L74)`). I couldn't find any instance of this bug that resulted in anything more severe than keying a shuffling cache by the wrong block root. - Swap the VC block service to use `mpsc` from `tokio` instead of `futures`. This is consistent with the rest of the code base. ~~This PR reduces the size of the codebase 🎉~~ It used to reduce the size of the code base before I added more comments. ## Observations on Prymont - Proposer duties times down from peaks of 450ms to consistent <1ms. - Current epoch attester duties times down from >1s peaks to a consistent 20-30ms. - Block production down from +600ms to 100-200ms. ## Additional Info - ~~Blocked on #2241~~ - ~~Blocked on #2234~~ ## TODO - [x] ~~Refactor this into some smaller PRs?~~ Leaving this as-is for now. - [x] Address `per_slot_processing` roots. - [x] Investigate slow next epoch times. Not getting added to cache on block processing? - [x] Consider [this](`072695284f/beacon_node/store/src/hot_cold_store.rs (L811-L812)`) in the scenario of replacing the state roots Co-authored-by: pawan <pawandhananjay@gmail.com> Co-authored-by: Michael Sproul <michael@sigmaprime.io>	2021-03-17 05:09:57 +00:00
Michael Sproul	3919737978	Release v1.2.0 (#2249 ) ## Proposed Changes Release v1.2.0 unchanged from the release candidate.	2021-03-10 01:28:32 +00:00
Michael Sproul	770a2ca030	Fix proposer cache priming upon state advance (#2252 ) ## Proposed Changes While investigating an incorrect head + target vote for the epoch boundary block 708544, I noticed that the state advance failed to prime the proposer cache, as per these logs: ``` Mar 09 21:42:47.448 DEBG Subscribing to subnet target_slot: 708544, subnet: Y, service: attestation_service Mar 09 21:49:08.063 DEBG Advanced head state one slot current_slot: 708543, state_slot: 708544, head_root: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: state_advance Mar 09 21:49:08.063 DEBG Completed state advance initial_slot: 708543, advanced_slot: 708544, head_root: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: state_advance Mar 09 21:49:14.787 DEBG Proposer shuffling cache miss block_slot: 708544, block_root: 0x9b14bf68667ab1d9c35e6fd2c95ff5d609aa9e8cf08e0071988ae4aa00b9f9fe, parent_slot: 708543, parent_root: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: beacon Mar 09 21:49:14.800 DEBG Successfully processed gossip block root: 0x9b14bf68667ab1d9c35e6fd2c95ff5d609aa9e8cf08e0071988ae4aa00b9f9fe, slot: 708544, graffiti: , service: beacon Mar 09 21:49:14.800 INFO New block received hash: 0x9b14…f9fe, slot: 708544 Mar 09 21:49:14.984 DEBG Head beacon block slot: 708544, root: 0x9b14…f9fe, finalized_epoch: 22140, finalized_root: 0x28ec…29a7, justified_epoch: 22141, justified_root: 0x59db…e451, service: beacon Mar 09 21:49:15.055 INFO Unaggregated attestation validator: XXXXX, src: api, slot: 708544, epoch: 22142, delay_ms: 53, index: Y, head: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: val_mon Mar 09 21:49:17.001 DEBG Slot timer sync_state: Synced, current_slot: 708544, head_slot: 708544, head_block: 0x9b14…f9fe, finalized_epoch: 22140, finalized_root: 0x28ec…29a7, peers: 55, service: slot_notifier ``` The reason for this is that the condition was backwards, so that whole block of code was unreachable. Looking at the attestations for the block included in the block after, we can see that lots of validators missed it. Some of them may be Lighthouse v1.1.1-v1.2.0-rc.0, but it's probable that they would have missed even with the proposer cache primed, given how late the block 708544 arrived (the cache miss occurred 3.787s after the slot start): https://beaconcha.in/block/708545#attestations	2021-03-10 00:20:50 +00:00
Michael Sproul	786e25ea08	Release candidate v1.2.0-rc.0 (#2248 ) Prepare for v1.2.0 with this release candidate. To be merged after #2247 and #2246 Co-authored-by: Age Manning <Age@AgeManning.com>	2021-03-08 06:27:50 +00:00
Age Manning	babd153352	Prevent adding and dialing bootnodes when discovery is disabled (#2247 ) This is a small PR which prevents unwanted bootnodes from being added to the DHT and being dialed when the `--disable-discovery` flag is set. The main reason one would want to disable discovery is to connect to a fix set of peers. Currently, regardless of what the user does, Lighthouse will populate its DHT with previously known peers and also fill it with the spec's bootnodes. It will then dial the bootnodes that are capable of being dialed. This prevents testing with a fixed peer list. This PR prevents these excess nodes from being added and dialed if the user has set `--disable-discovery`.	2021-03-08 06:27:49 +00:00
Paul Hauner	e4eb0eb168	Use advanced state for block production (#2241 ) ## Issue Addressed NA ## Proposed Changes - Use the pre-states from #2174 during block production. - Running this on Pyrmont shows block production times dropping from ~550ms to ~150ms. - Create `crit` and `warn` logs when a block is published to the API later than we expect. - On mainnet we are issuing a warn if the block is published more than 1s later than the slot start and a crit for more than 3s. - Rename some methods on the `SnapshotCache` for clarity. - Add the ability to pass the state root to `BeaconChain::produce_block_on_state` to avoid computing a state root. This is a very common LH optimization. - Add a metric that tracks how late we broadcast blocks received from the HTTP API. This is technically a duplicate of a `ValidatorMonitor` log, but I wanted to have it for the case where we aren't monitoring validators too.	2021-03-04 04:43:31 +00:00
Michael Sproul	363f15f362	Use the database to persist the pubkey cache (#2234 ) ## Issue Addressed Closes #1787 ## Proposed Changes * Abstract the `ValidatorPubkeyCache` over a "backing" which is either a file (legacy), or the database. * Implement a migration from schema v2 to schema v3, whereby the contents of the cache file are copied to the DB, and then the file is deleted. The next release to include this change must be a minor version bump, and we will need to warn users of the inability to downgrade (this is our first DB schema change since mainnet genesis). * Move the schema migration code from the `store` crate into the `beacon_chain` crate so that it can access the datadir and the `ValidatorPubkeyCache`, etc. It gets injected back into the `store` via a closure (similar to what we do in fork choice).	2021-03-04 01:25:12 +00:00
Age Manning	1c507c588e	Update to the latest libp2p (#2239 ) Updates to the latest libp2p and ignores RUSTSEC-2020-0146 from cargo-audit Co-authored-by: Michael Sproul <michael@sigmaprime.io>	2021-03-02 05:59:49 +00:00
realbigsean	ed9b245de0	update tokio-stream to 0.1.3 and use `BroadcastStream` (#2212 ) ## Issue Addressed Resolves #2189 ## Proposed Changes use tokio's `BroadcastStream` ## Additional Info N/A Co-authored-by: realbigsean <seananderson33@gmail.com>	2021-03-01 01:58:05 +00:00
Michael Sproul	2f077b11fe	Allow HTTP API to return SSZ blocks (#2209 ) ## Issue Addressed Implements https://github.com/ethereum/eth2.0-APIs/pull/125 ## Proposed Changes Optionally return SSZ bytes from the `beacon/blocks` endpoint.	2021-02-24 04:15:14 +00:00
realbigsean	5bc93869c8	Update ValidatorStatus to match the v1 API (#2149 ) ## Issue Addressed N/A ## Proposed Changes We are currently a bit off of the standard API spec because we have [this](https://hackmd.io/bQxMDRt1RbS1TLno8K4NPg?view) proposal implemented for validator status. Based on discussion [here](https://github.com/ethereum/eth2.0-APIs/pull/94), it looks like this won't be added to the spec until v2, so this PR implements [this](https://hackmd.io/ofFJ5gOmQpu1jjHilHbdQQ) validator status logic instead ## Additional Info N/A Co-authored-by: realbigsean <seananderson33@gmail.com>	2021-02-24 04:15:13 +00:00
Paul Hauner	a764c3b247	Handle early blocks (#2155 ) ## Issue Addressed NA ## Problem this PR addresses There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events: 1. Gossip block 0xabc arrives ~200ms early - It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets). - However, it is not imported to our database since the block is early. 2. Attestations for 0xabc arrive, but the block was not imported. - The peer that sent the attestation is down-voted. - Each unknown-block attestation causes a score loss of 1, the peer is banned at -100. - When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc). ## Potential solutions I can think of three solutions to this: 1. Wait for attestation-queuing (#635) to arrive and solve this. - Easy - Not immediate fix. - Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better. 1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`. - Easy - ~~I have implemented this, for now.~~ 1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot. - More difficult - Feels like the best solution, I will try to implement this. This PR takes approach (3). ## Changes included - Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them. - Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event. - In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic). - Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester. - To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect. - The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed any failures due to races on my machine or on CI yet. - To assist with testing I allowed for directly setting the time on the `ManualSlotClock`. - I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.	2021-02-24 03:08:52 +00:00
Paul Hauner	46920a84e8	v1.1.3 (#2217 ) ## Issue Addressed NA ## Proposed Changes Bump versions ## Additional Info NA	2021-02-22 06:21:38 +00:00
Paul Hauner	4362ea4f98	Fix false positive "State advance too slow" logs (#2218 ) ## Issue Addressed - Resolves #2214 ## Proposed Changes Fix the false positive warning log described in #2214. ## Additional Info NA	2021-02-21 23:47:53 +00:00
Paul Hauner	8949ae7c4e	Address ENR update loop (#2216 ) ## Issue Addressed - Resolves #2215 ## Proposed Changes Addresses a potential loop when the majority of peers indicate that we are contactable via an IPv6 address. See https://github.com/sigp/discv5/pull/62 for further rationale. ## Additional Info The alternative to this PR is to use `--disable-enr-auto-update` and then manually supply an `--enr-address` and `--enr-upd-port`. However, that requires the user to know their IP addresses in order for discovery to work properly. This might not be practical/achievable for some users, hence this hotfix.	2021-02-21 23:47:52 +00:00
Paul Hauner	8c6537e71d	v1.1.2 (#2213 ) ## Issue Addressed NA ## Proposed Changes Bump versions ## Additional Info NA	2021-02-19 00:49:32 +00:00
Paul Hauner	f8cc82f2b1	Switch back to warp with cors wildcard support (#2211 ) ## Issue Addressed - Resolves #2204 - Resolves #2205 ## Proposed Changes Switches to my fork of `warp` which contains support for cors wildcards: https://github.com/paulhauner/warp/tree/cors-wildcard I have a PR open on the `warp` repo but it hasn't had any interest from the maintainers as of yet: https://github.com/seanmonstar/warp/pull/726. I think running from a fork is the best we can do for now. ## Additional Info NA	2021-02-18 22:33:12 +00:00
Lion - dapplion	613382f304	Add slot offset computing to be downloaded slot (#2198 ) The current implementation assumes the range offset of slots downloaded on a batch to equal zero. This conflicts with the condition to consider this chain as sync. For finalized sync, it results in one extra batch being downloaded which can't be processed. CC @wemeetagain	2021-02-18 08:24:46 +00:00
Paul Hauner	f819ba5414	v1.1.1 (#2202 ) ## Issue Addressed NA ## Proposed Changes Bump versions	2021-02-16 00:09:02 +00:00
Pawan Dhananjay	4a357c9947	Upgrade rand_core (#2201 ) ## Issue Addressed N/A ## Proposed Changes Upgrade `rand_core` to latest version to fix https://rustsec.org/advisories/RUSTSEC-2021-0023	2021-02-15 20:34:49 +00:00
Paul Hauner	88cc222204	Advance state to next slot after importing block (#2174 ) ## Issue Addressed NA ## Proposed Changes Add an optimization to perform `per_slot_processing` from the leading-edge of block processing to the trailing-edge. Ultimately, this allows us to import the block at slot `n` faster because we used the tail-end of slot `n - 1` to perform `per_slot_processing`. Additionally, add a "block proposer cache" which allows us to cache the block proposer for some epoch. Since we're now doing trailing-edge `per_slot_processing`, we can prime this cache with the values for the next epoch before those blocks arrive (assuming those blocks don't have some weird forking). There were several ancillary changes required to achieve this: - Remove the `state_root` field of `BeaconSnapshot`, since there's no need to know it on a `pre_state` and in all other cases we can just read it from `block.state_root()`. - This caused some "dust" changes of `snapshot.beacon_state_root` to `snapshot.beacon_state_root()`, where the `BeaconSnapshot::beacon_state_root()` func just reads the state root from the block. - Rename `types::ShuffingId` to `AttestationShufflingId`. I originally did this because I added a `ProposerShufflingId` struct which turned out to be not so useful. I thought this new name was more descriptive so I kept it. - Address https://github.com/ethereum/eth2.0-specs/pull/2196 - Add a debug log when we get a block with an unknown parent. There was previously no logging around this case. - Add a function to `BeaconState` to compute all proposers for an epoch without re-computing the active indices for each slot. ## Additional Info - ~~Blocked on #2173~~ - ~~Blocked on #2179~~ That PR was wrapped into this PR. - There's potentially some places where we could avoid computing the proposer indices in `per_block_processing` but I haven't done this here. These would be an optimization beyond the issue at hand (improving block propagation times) and I think this PR is already doing enough. We can come back for that later. ## TODO - [x] Tidy, improve comments. - [x] ~~Try avoid computing proposer index in `per_block_processing`?~~	2021-02-15 07:17:52 +00:00
Paul Hauner	3000f3e5da	Dht persistence on drop (v2) (#2200 ) ## Issue Addressed NA ## Proposed Changes This is simply #2177 with a merge conflict fixed. Co-authored-by: realbigsean <seananderson33@gmail.com>	2021-02-15 06:09:55 +00:00
Paul Hauner	8e5c20b6d1	Update for clippy 1.50 (#2193 ) ## Issue Addressed NA ## Proposed Changes Rust 1.50 has landed 🎉 The shiny new `clippy` peers down upon us mere mortals with disgust. Brutish peasants wrapping our `usize`s in superfluous `Option`s... tsk tsk. I've performed the goat sacrifice and corrected our evil ways in this PR. Tonight we shall pray that Github Actions bestows the almighty green tick upon us. ## Additional Info NA Co-authored-by: realbigsean <seananderson33@gmail.com> Co-authored-by: Michael Sproul <michael@sigmaprime.io>	2021-02-15 00:09:12 +00:00
realbigsean	e20f64b21a	Update to tokio 1.1 (#2172 ) ## Issue Addressed resolves #2129 resolves #2099 addresses some of #1712 unblocks #2076 unblocks #2153 ## Proposed Changes - Updates all the dependencies mentioned in #2129, except for web3. They haven't merged their tokio 1.0 update because they are waiting on some dependencies of their own. Since we only use web3 in tests, I think updating it in a separate issue is fine. If they are able to merge soon though, I can update in this PR. - Updates `tokio_util` to 0.6.2 and `bytes` to 1.0.1. - We haven't made a discv5 release since merging tokio 1.0 updates so I'm using a commit rather than release atm. Edit: I think we should merge an update of `tokio_util` to 0.6.2 into discv5 before this release because it has panic fixes in `DelayQueue` --> PR in discv5: https://github.com/sigp/discv5/pull/58 ## Additional Info tokio 1.0 changes that required some changes in lighthouse: - `interval.next().await.is_some()` -> `interval.tick().await` - `sleep` future is now `!Unpin` -> https://github.com/tokio-rs/tokio/issues/3028 - `try_recv` has been temporarily removed from `mpsc` -> https://github.com/tokio-rs/tokio/issues/3350 - stream features have moved to `tokio-stream` and `broadcast::Receiver::into_stream()` has been temporarily removed -> `https://github.com/tokio-rs/tokio/issues/2870 - I've copied over the `BroadcastStream` wrapper from this PR, but can update to use `tokio-stream` once it's merged https://github.com/tokio-rs/tokio/pull/3384 Co-authored-by: realbigsean <seananderson33@gmail.com>	2021-02-10 23:29:49 +00:00
Paul Hauner	e383ef3e91	Avoid temp allocations with slog (#2183 ) ## Issue Addressed Which issue # does this PR address? ## Proposed Changes Replaces use of `format!` in `slog` logging with it's special no-allocation `?` and `%` shortcuts. According to a `heaptrack` analysis today over about a period of an hour, this will reduce temporary allocations by at least 4%. ## Additional Info NA	2021-02-04 07:31:47 +00:00
Paul Hauner	ff35fbb121	Add metrics for beacon block propagation (#2173 ) ## Issue Addressed NA ## Proposed Changes Adds some metrics to track delays regarding: - LH processing of blocks - delays receiving blocks from other nodes. ## Additional Info NA	2021-02-04 05:33:56 +00:00
Akihito Nakano	1a22a096c6	Fix clippy errors on tests (#2160 ) ## Issue Addressed There are some clippy error on tests. ## Proposed Changes Enable clippy check on tests and fix the errors. 💪	2021-01-28 23:31:06 +00:00
Paul Hauner	e4b62139d7	v1.1.0 (#2168 ) ## Issue Addressed NA ## Proposed Changes - Bump version - ~~Run `cargo update`~~ ## Additional Info NA	2021-01-21 02:37:08 +00:00
Paul Hauner	2b2a358522	Detailed validator monitoring (#2151 ) ## Issue Addressed - Resolves #2064 ## Proposed Changes Adds a `ValidatorMonitor` struct which provides additional logging and Grafana metrics for specific validators. Use `lighthouse bn --validator-monitor` to automatically enable monitoring for any validator that hits the [subnet subscription](https://ethereum.github.io/eth2.0-APIs/#/Validator/prepareBeaconCommitteeSubnet) HTTP API endpoint. Also, use `lighthouse bn --validator-monitor-pubkeys` to supply a list of validators which will always be monitored. See the new docs included in this PR for more info. ## TODO - [x] Track validator balance, `slashed` status, etc. - [x] ~~Register slashings in current epoch, not offense epoch~~ - [ ] Publish Grafana dashboard, update TODO link in docs - [x] ~~#2130 is merged into this branch, resolve that~~	2021-01-20 19:19:38 +00:00
Paul Hauner	1eb0915301	Fix bug from #2163 (#2165 ) ## Issue Addressed NA ## Proposed Changes Fixes a bug that I missed during a review in #2163. I found this bug by observing that nodes were receiving far less attestations (~1/2 of previous). I'm not certain on exactly how this mistake manifested in a reduction in attestations, but the mistake touches so much code that I think it's reasonable to declare that this it the cause of the observed issue (drop in attestations). ## Additional Info NA	2021-01-20 10:28:12 +00:00
Paul Hauner	b06559ae97	Disallow attestation production earlier than head (#2130 ) ## Issue Addressed The non-finality period on Pyrmont between epochs [`9114`](https://pyrmont.beaconcha.in/epoch/9114) and [`9182`](https://pyrmont.beaconcha.in/epoch/9182) was contributed to by all the `lighthouse_team` validators going down. The nodes saw excessive CPU and RAM usage, resulting in the system to kill the `lighthouse bn` process. The `Restart=on-failure` directive for `systemd` caused the process to bounce in ~10-30m intervals. Diagnosis with `heaptrack` showed that the `BeaconChain::produce_unaggregated_attestation` function was calling `store::beacon_state::get_full_state` and sometimes resulting in a tree hash cache allocation. These allocations were approximately the size of the hosts physical memory and still allocated when `lighthouse bn` was killed by the OS. There was no CPU analysis (e.g., `perf`), but the `BeaconChain::produce_unaggregated_attestation` is very CPU-heavy so it is reasonable to assume it is the cause of the excessive CPU usage, too. ## Proposed Changes `BeaconChain::produce_unaggregated_attestation` has two paths: 1. Fast path: attesting to the head slot or later. 2. Slow path: attesting to a slot earlier than the head block. Path (2) is the only path that calls `store::beacon_state::get_full_state`, therefore it is the path causing this excessive CPU/RAM usage. This PR removes the current functionality of path (2) and replaces it with a static error (`BeaconChainError::AttestingPriorToHead`). This change reduces the generality of `BeaconChain::produce_unaggregated_attestation` (and therefore [`/eth/v1/validator/attestation_data`](https://ethereum.github.io/eth2.0-APIs/#/Validator/produceAttestationData)), but I argue that this functionality is an edge-case and arguably a violation of the [Honest Validator spec](https://github.com/ethereum/eth2.0-specs/blob/dev/specs/phase0/validator.md). It's possible that a validator goes back to a prior slot to "catch up" and submit some missed attestations. This change would prevent such behaviour, returning an error. My concerns with this catch-up behaviour is that it is: - Not specified as "honest validator" attesting behaviour. - Is behaviour that is risky for slashing (although, all validator clients should have slashing protection and will eventually fail if they do not). - It disguises clock-sync issues between a BN and VC. ## Additional Info It's likely feasible to implement path (2) if we implement some sort of caching mechanism. This would be a multi-week task and this PR gets the issue patched in the short term. I haven't created an issue to add path (2), instead I think we should implement it if we get user-demand.	2021-01-20 06:52:37 +00:00
Paul Hauner	d9f940613f	Represent slots in secs instead of millisecs (#2163 ) ## Issue Addressed NA ## Proposed Changes Copied from #2083, changes the config milliseconds_per_slot to seconds_per_slot to avoid errors when slot duration is not a multiple of a second. To avoid deserializing old serialized data (with milliseconds instead of seconds) the Serialize and Deserialize derive got removed from the Spec struct (isn't currently used anyway). This PR replaces #2083 for the purpose of fixing a merge conflict without requiring the input of @blacktemplar. ## Additional Info NA Co-authored-by: blacktemplar <blacktemplar@a1.net>	2021-01-19 09:39:51 +00:00
Paul Hauner	805e152f66	Simplify enum -> str with strum (#2164 ) ## Issue Addressed NA ## Proposed Changes As per #2100, uses derives from the sturm library to implement AsRef<str> and AsStaticRef to easily get str values from enums without creating new Strings. Furthermore unifies all attestation error counter into one IntCounterVec vector. These works are originally by @blacktemplar, I've just created this PR so I can resolve some merge conflicts. ## Additional Info NA Co-authored-by: blacktemplar <blacktemplar@a1.net>	2021-01-19 06:33:58 +00:00
realbigsean	7a71977987	Clippy 1.49.0 updates and dht persistence test fix (#2156 ) ## Issue Addressed `test_dht_persistence` failing ## Proposed Changes Bind `NetworkService::start` to an underscore prefixed variable rather than `_`. `_` was causing it to be dropped immediately This was failing 5/100 times before this update, but I haven't been able to get it to fail after updating it Co-authored-by: realbigsean <seananderson33@gmail.com>	2021-01-19 00:34:28 +00:00
Pawan Dhananjay	28238d97b1	Disconnect from peers quicker on internet issues (#2147 ) ## Issue Addressed Fixes #2146 ## Proposed Changes Change ping timeout errors to return `LowToleranceErrors` so that we disconnect faster on internet failures/changes.	2021-01-13 08:09:10 +00:00
realbigsean	423dea169c	update smallvec (#2152 ) ## Issue Addressed `cargo audit` is failing because of a potential for an overflow in the version of `smallvec` we're using ## Proposed Changes Update to the latest version of `smallvec`, which has the fix Co-authored-by: realbigsean <seananderson33@gmail.com>	2021-01-11 23:32:11 +00:00
Arthur Woimbée	851a4dca3c	replace tempdir by tempfile (#2143 ) ## Issue Addressed Fixes #2141 Remove [tempdir](https://docs.rs/tempdir/0.3.7/tempdir/) in favor of [tempfile](https://docs.rs/tempfile/3.1.0/tempfile/). ## Proposed Changes `tempfile` has a slightly different api that makes creating temp folders with a name prefix a chore (`tempdir::TempDir::new("toto")` => `tempfile::Builder::new().prefix("toto").tempdir()`). So I removed temp folder name prefix where I deemed it not useful. Otherwise, the functionality is the same.	2021-01-06 06:36:11 +00:00
Age Manning	7e4b190df0	Reduce ping interval (#2132 ) ## Issue Addressed #2123 ## Description Reduces the TCP ping interval to increase our responsiveness to peer liveness changes.	2021-01-06 04:35:52 +00:00
realbigsean	588b90157d	Ssz state api endpoint (#2111 ) ## Issue Addressed Catching up to a recently merged API spec PR: https://github.com/ethereum/eth2.0-APIs/pull/119 ## Proposed Changes - Return an SSZ beacon state on `/eth/v1/debug/beacon/states/{stateId}` when passed this header: `accept: application/octet-stream`. - requests to this endpoint with no `accept` header or an `accept` header and a value of `application/json` or `/` , or will result in a JSON response ## Additional Info Co-authored-by: realbigsean <seananderson33@gmail.com>	2021-01-06 03:01:46 +00:00
Samuel E. Moelius	939fa717fd	`test_decode_malicious_status_message` improvements (#2104 ) ## Issue Addressed None ## Proposed Changes * Correct typo in one comment, elaborate some others. * Add asserts to ensure comments match code. * Eliminate one unnecessary `clone`. ## Additional Info None	2021-01-06 01:10:26 +00:00
Samuel E. Moelius	0245ddd37b	Fix typo in `ssz_snappy.rs` comment (#2103 ) ## Issue Addressed None ## Proposed Changes Correct a typo in `ssz_snappy.rs`. ## Additional Info Pedantry at it finest.	2021-01-06 01:10:24 +00:00
Paul Hauner	f183af20e3	Version v1.0.6 (#2126 ) ## Issue Addressed NA ## Proposed Changes - Bump versions - Run `cargo update` ## Additional Info NA	2020-12-28 23:38:02 +00:00
Akihito Nakano	78d17c3255	Tweak error messages for ease of investigation (#2122 ) ## Proposed Changes <!-- Please list or describe the changes introduced by this PR. --> Tweaked the error message for ease of investigation as `Failed to update eth1 cache` is used in multiple places. 😃	2020-12-28 01:25:33 +00:00
Paul Hauner	9ed65a64f8	Version v1.0.5 (#2117 ) ## Issue Addressed NA ## Proposed Changes - Bump versions to `v1.0.5` - Run `cargo update` ## Additional Info NA	2020-12-23 18:52:48 +00:00
Age Manning	2931b05582	Update libp2p (#2101 ) This is a little bit of a tip-of-the-iceberg PR. It houses a lot of code changes in the libp2p dependency. This needs a bit of thorough testing before merging. The primary code changes are: - General libp2p dependency update - Gossipsub refactor to shift compression into gossipsub providing performance improvements and improved API for handling compression Co-authored-by: Paul Hauner <paul@paulhauner.com>	2020-12-23 07:53:36 +00:00
Samuel E. Moelius	3381266998	Eliminate uses of `expect` in `ssz_snappy.rs` (#2105 ) ## Issue Addressed None ## Proposed Changes Eliminate three uses of `expect` in `ssz_snappy.rs`. ## Additional Info None	2020-12-22 02:28:37 +00:00
Michael Sproul	e5bf2576f1	Optimise tree hash caching for block production (#2106 ) ## Proposed Changes `@potuz` on the Eth R&D Discord observed that Lighthouse blocks on Pyrmont were always arriving at other nodes after at least 1 second. Part of this could be due to processing and slow propagation, but metrics also revealed that the Lighthouse nodes were usually taking 400-600ms to even just produce a block before broadcasting it. I tracked the slowness down to the lack of a pre-built tree hash cache (THC) on the states being used for block production. This was due to using the head state for block production, which lacks a THC in order to keep fork choice fast (cloning a THC takes at least 30ms for 100k validators). This PR modifies block production to clone a state from the snapshot cache rather than the head, which speeds things up by 200-400ms by avoiding the tree hash cache rebuild. In practice this seems to have cut block production time down to 300ms or less. Ideally we could _remove_ the snapshot from the cache (and save the 30ms), but it is required for when we re-process the block after signing it with the validator client. ## Alternatives I experimented with 2 alternatives to this approach, before deciding on it: * Alternative 1: ensure the `head` has a tree hash cache. This is too slow, as it imposes a +30ms hit on fork choice, which currently takes ~5ms (with occasional spikes). * Alternative 2: use `Arc<BeaconSnapshot>` in the snapshot cache and share snapshots between the cache and the `head`. This made fork choice blazing fast (1ms), and block production the same as in this PR, but had a negative impact on block processing which I don't think is worth it. It ended up being necessary to clone the full state from the snapshot cache during block production, imposing the +30ms penalty there _as well_ as in block production. In contract, the approach in this PR should only impact block production, and it improves it! Yay for pareto improvements 🎉 ## Additional Info This commit (ac59dfa) is currently running on all the Lighthouse Pyrmont nodes, and I've added a dashboard to the Pyrmont grafana instance with the metrics. In future work we should optimise the attestation packing, which consumes around 30-60ms and is now a substantial contributor to the total.	2020-12-21 06:29:39 +00:00
Paul Hauner	a62dc65ca4	BN Fallback v2 (#2080 ) ## Issue Addressed - Resolves #1883 ## Proposed Changes This follows on from @blacktemplar's work in #2018. - Allows the VC to connect to multiple BN for redundancy. - Update the simulator so some nodes always need to rely on their fallback. - Adds some extra deprecation warnings for `--eth1-endpoint` - Pass `SignatureBytes` as a reference instead of by value. ## Additional Info NA Co-authored-by: blacktemplar <blacktemplar@a1.net>	2020-12-18 09:17:03 +00:00
Pawan Dhananjay	f998eff7ce	Subnet discovery fixes (#2095 ) ## Issue Addressed N/A ## Proposed Changes Fixes multiple issues related to discovering of subnet peers. 1. Subnet discovery retries after yielding no results 2. Metadata updates if peer send older metadata 3. peerdb stores the peer subscriptions from gossipsub	2020-12-17 00:39:15 +00:00
blacktemplar	3fcc517993	Fix Syncing Simulator (#2049 ) ## Issue Addressed NA ## Proposed Changes Fixes problems with slot times below 1 second which got revealed by running the syncing simulator with the default speedup time.	2020-12-16 05:37:38 +00:00
Michael Sproul	0c529b8d52	Add slasher broadcast (#2079 ) ## Issue Addressed Closes #2048 ## Proposed Changes * Broadcast slashings when the `--slasher-broadcast` flag is provided. * In the process of implementing this I refactored the slasher service into its own crate so that it could access the network code without creating a circular dependency. I moved the responsibility for putting slashings into the op pool into the service as well, as it makes sense for it to handle the whole slashing lifecycle.	2020-12-16 03:44:01 +00:00
Pawan Dhananjay	63eeb14a81	Improve eth1 fallback logging (#2096 ) ## Issue Addressed N/A ## Proposed Changes There seemed to be confusion among discord users on the eth1 fallback logging ``` WARN Error connecting to eth1 node. Trying fallback ..., endpoint: http://127.0.0.1:8545/, service: eth1_rpc ``` The assumption users seem to be making here is that it is trying the fallback and fallback=endpoint in the log. This PR improves the logging to be like ``` WARN Error connecting to eth1 node endpoint, endpoint: http://127.0.0.1:8545/, action: trying fallbacks, service: eth1_rpc ``` I think this is a bit more clear that the endpoint that failed is the one in the log.	2020-12-16 02:39:09 +00:00
divma	11c299cbf6	impl Resource Unavailable RPC error (#2072 ) ## Issue Addressed Related to #1891, The error is not in the spec yet (see ethereum/eth2.0-specs#2131) ## Proposed Changes Implement the proposed error, banning peers that send it ## Additional Info NA	2020-12-15 00:17:32 +00:00
blacktemplar	701843aaa0	Update dependencies (#2084 ) ## Issue Addressed Partially addresses dependencies mentioned in issue #1712. ## Proposed Changes Updates dependencies (including an update avoiding a vulnerability) + add tokio compatibility to `remote_signer_test`	2020-12-14 02:28:19 +00:00
Michael Sproul	1abc70e815	Version v1.0.4 (#2073 ) ## Proposed Changes Run cargo update and bump version in prep for v1.0.4 release ## Additional Info Planning to merge this commit to `unstable`, test on Pyrmont and canary nodes, then push to `stable`.	2020-12-10 04:01:40 +00:00
Age Manning	dfb588e521	Softer penalties for missing blocks (#2075 ) ## Issue Addressed Users are reporting errors for sending attestations to peers. If the clock sync is a little out or we receive attestations before blocks, peers are being too harshly penalized. They can get scored many times per missing block and we typically need these peers on subnets. ## Proposed Changes This removes the penalization for missing blocks with attestations. The penalty should be handled when #635 gets built as it will allow us to group attestations per missing block and penalize once.	2020-12-10 00:40:12 +00:00
Michael Sproul	aa45fa3ff7	Revert fork choice if disk write fails (#2068 ) ## Issue Addressed Closes #2028 Replaces #2059 ## Proposed Changes If writing to the database fails while importing a block, revert fork choice to the last version stored on disk. This prevents fork choice from being ahead of the blocks on disk. Having fork choice ahead is particularly bad if it is later successfully written to disk, because it renders the database corrupt (see #2028). ## Additional Info * This mitigation might fail if the head+fork choice haven't been persisted yet, which can only happen at first startup (see #2067) * This relies on it being OK for the head tracker to be ahead of fork choice. I figure this is tolerable because blocks only get added to the head tracker after successfully being written on disk _and_ to fork choice, so even if fork choice reverts a little bit, when the pruning algorithm runs, those blocks will still be on disk and OK to prune. The pruning algorithm also doesn't rely on heads being unique, technically it's OK for multiple blocks from the same linear chain segment to be present in the head tracker. This begs the question of #1785 (i.e. things would be simpler with the head tracker out of the way). Alternatively, this PR could just revert the head tracker as well (I'll look into this tomorrow).	2020-12-09 05:10:34 +00:00
Michael Sproul	82753f842d	Improve compile time (#1989 ) ## Issue Addressed Closes #1264 ## Proposed Changes * Milagro BLS: tweak the feature flags so that Milagro doesn't get compiled if we're using BLST. Profiling showed that it was consuming about 1 minute of CPU time out of 60 minutes of CPU time (real time ~15 mins). A 1.6% saving. * Reduce monomorphization: compiling for 3 different `EthSpec` types causes a heck of a lot of generic functions to be instantiated (monomorphized). Removing 2 of 3 cuts the LLVM+linking step from around 250 seconds to 180 seconds, a saving of 70 seconds (real time!). This applies only to `make` and not the CI build, because we test with the minimal spec on CI. * Update `web3` crate to v0.13. This is perhaps the most controversial change, because it requires axing some deposit contract tools from `lcli`. I suspect these tools weren't used much anyway, and could be maintained separately, but I'm also happy to revert this change. However, it does save us a lot of compile time. With #1839, we now have 3 versions of Tokio (and all of Tokio's deps). This change brings us down to 2 versions, but 1 should be achievable once web3 (and reqwest) move to Tokio 0.3. * Remove `lcli` from the Docker image. It's a dev tool and can be built from the repo if required.	2020-12-09 01:34:58 +00:00
Age Manning	4f85371ce8	Downgrades a valid log (#2057 ) ## Issue Addressed #2046 ## Proposed Changes The log was originally intended to verify the correct logic and ordering of events when scoring peers. The queued tasks can be structured in such a way that peers can be banned after they are disconnected. Therefore the error log is now downgraded to debug log.	2020-12-08 10:48:45 +00:00
divma	57489e620f	fix default network handling (#2029 ) ## Issue Addressed #1992 and #1987, and also to be considered a continuation of #1751 ## Proposed Changes many changed files but most are renaming to align the code with the semantics of `--network` - remove the `--network` default value (in clap) and instead set it after checking the `network` and `testnet-dir` flags - move `eth2_testnet_config` crate to `eth2_network_config` - move `Eth2TestnetConfig` to `Eth2NetworkConfig` - move `DEFAULT_HARDCODED_TESTNET` to `DEFAULT_HARDCODED_NETWORK` - `beacon_node`s `get_eth2_testnet_config` loads the `DEFAULT_HARDCODED_NETWORK` if there is no network nor testnet provided - `boot_node`s config loads the config same as the `beacon_node`, it was using the configuration only for preconfigured networks (That code is ~1year old so I asume it was not intended) - removed a one year old comment stating we should try to emulate `https://github.com/eth2-clients/eth2-testnets/tree/master/nimbus/testnet1` it looks outdated (?) - remove `lighthouse`s `load_testnet_config` in favor of `get_eth2_network_config` to centralize that logic (It had differences) - some spelling ## Additional Info Both the command of #1992 and the scripts of #1987 seem to work fine, same as `bn` and `vc`	2020-12-08 05:41:10 +00:00
divma	f3200784b4	More metrics + RPC tweaks (#2041 ) ## Issue Addressed NA ## Proposed Changes This was mostly done to find the reason why LH was dropping peers from Nimbus. It proved to be useful so I think it's worth it. But there is also some functional stuff here - Add metrics for rpc errors per client, error type and direction - Add metrics for downscoring events per source type, client and penalty type - Add metrics for gossip validation results per client for non-accepted messages - Make the RPC handler return errors and requests/responses in the order we see them - Allow a small burst for the Ping rate limit, from 1 every 5 seconds to 2 every 10 seconds - Send rate limiting errors with a particular code and use that same code to identify them. I picked something different to 128 since that is most likely what other clients are using for their own errors - Remove some unused code in the `PeerAction` and the rpc handler - Remove the unused variant `RateLimited`. tTis was never produced directly, since the only way to get the request's protocol is via de handler. The handler upon receiving from LH a response with an error (rate limited in this case) emits this event with the missing info (It was always like this, just pointing out that we do downscore rate limiting errors regardless of the change) Metrics for Nimbus looked like this: Downscoring events: `increase(libp2p_peer_actions_per_client{client="Nimbus"}[5m])` ![image](https://user-images.githubusercontent.com/26765164/101210880-862bf280-3676-11eb-94c0-399f0bf5aa2e.png) RPC Errors: `increase(libp2p_rpc_errors_per_client{client="Nimbus"}[5m])` ![image](https://user-images.githubusercontent.com/26765164/101210997-ba071800-3676-11eb-847a-f32405ede002.png) Unaccepted gossip message: `increase(gossipsub_unaccepted_messages_per_client{client="Nimbus"}[5m])` ![image](https://user-images.githubusercontent.com/26765164/101211124-f470b500-3676-11eb-9459-132ecff058ec.png)	2020-12-08 03:55:50 +00:00
blacktemplar	a28e8decbf	update dependencies (#2032 ) ## Issue Addressed NA ## Proposed Changes Updates out of date dependencies. ## Additional Info See also https://github.com/sigp/lighthouse/issues/1712 for a list of dependencies that are still out of date and the resasons.	2020-12-07 08:20:33 +00:00
Michael Sproul	c1ec386d18	Pass failed gossip blocks to the slasher (#2047 ) ## Issue Addressed Closes #2042 ## Proposed Changes Pass blocks that fail gossip verification to the slasher. Blocks that are successfully verified are not passed immediately, but will be passed as part of full block verification.	2020-12-04 05:03:30 +00:00
Pawan Dhananjay	7933596c89	Add a purge-eth1-cache cli option (#2039 ) ## Issue Some eth1 clients are missing deposit logs on mainnet for multiple reasons (not fully synced, eth1 client issues) because of which we are getting `FailedToInsertDeposit` errors. Ideally, LH should pick up where it left off after pointing it to a nice eth1 client endpoint (which has all deposits). However, I have seen instances where LH keeps getting `FailedToInsertDeposit` even after switching to a good endpoint. Only deleting the beacon directory (which also wipes the eth1 cache) and resyncing the eth1 caches seems to be the solution. This wouldn't be great for mainnet if you have to sync your beacon node again as well. ## Proposed Changes Add a `--purge-eth1-db` option which just wipes the eth1 cache and doesn't touch the rest of the beacon db. Still need to investigate if and why LH isn't picking up where it left off for the deposit logs sync, but I think it would be good to have an option to just delete eth1 caches regardless.	2020-12-04 05:03:28 +00:00
realbigsean	fdfb81a74a	Server sent events (#1920 ) ## Issue Addressed Resolves #1434 (this is the last major feature in the standard spec. There are only a couple of places we may be off-spec due to recent spec changes or ongoing discussion) Partly addresses #1669 ## Proposed Changes - remove the websocket server - remove the `TeeEventHandler` and `NullEventHandler` - add server sent events according to the eth2 API spec ## Additional Info This is according to the currently unmerged PR here: https://github.com/ethereum/eth2.0-APIs/pull/117 Co-authored-by: realbigsean <seananderson33@gmail.com>	2020-12-04 00:18:58 +00:00
realbigsean	2b5c0df9e5	Validators endpoint status code (#2040 ) ## Issue Addressed Resolves #2035 ## Proposed Changes Update 405's to 400's for failures when we are parsing path params. ## Additional Info Haven't updated the same for non-standard endpoints Co-authored-by: realbigsean <seananderson33@gmail.com>	2020-12-03 23:10:08 +00:00
Age Manning	2682f46025	Fingerprint new client identify agent string (#2027 ) Nimbus have modified their identify agent string. This PR adds their new agent string to identify new nimbus peers.	2020-12-03 22:07:14 +00:00
Pawan Dhananjay	482695142a	Minor fixes (#2038 ) Fixes a couple of low hanging fruits. - Fixes #2037 - `validators-dir` and `secrets-dir` flags don't really need to depend upon each other - Fixes #2006 and Fixes #1995	2020-12-03 01:10:28 +00:00
blacktemplar	d8cda2d86e	Fix new clippy lints (#2036 ) ## Issue Addressed NA ## Proposed Changes Fixes new clippy lints in the whole project (mainly [manual_strip](https://rust-lang.github.io/rust-clippy/master/index.html#manual_strip) and [unnecessary_lazy_evaluations](https://rust-lang.github.io/rust-clippy/master/index.html#unnecessary_lazy_evaluations)). Furthermore, removes `to_string()` calls on literals when used with the `?`-operator.	2020-12-03 01:10:26 +00:00
Paul Hauner	b8bd80d2fb	Add Content-Type to metrics server (#2019 ) ## Issue Addressed - Resolves #2013 ## Proposed Changes Adds the `Content-Type text/plain` header as per #2013 ## Additional Info NA	2020-12-01 00:04:46 +00:00
Paul Hauner	65dcdc361b	Bump version to v1.0.3 (#2024 ) ## Issue Addressed NA ## Proposed Changes - Set version to `v1.0.3` - Run cargo update ## Additional Info - ~~Blocked on #2008~~	2020-11-30 22:55:10 +00:00
Age Manning	c718e81eaf	Add privacy option (#2016 ) Adds a `--privacy` CLI flag to the beacon node that users may opt into. This does two things: - Removes client identifying information from the identify libp2p protocol - Changes the default graffiti to "" if no graffiti is set.	2020-11-30 22:55:08 +00:00
Paul Hauner	77f3539654	Improve eth1 block sync (#2008 ) ## Issue Addressed NA ## Proposed Changes - Log about eth1 whilst waiting for genesis. - For the block and deposit caches, update them after each download instead of when all downloads are complete. - This prevents the case where a single timeout error can cause us to drop all previously download blocks/deposits. - Set `max_log_requests_per_update` to avoid timeouts due to very large log counts in a response. - Set `max_blocks_per_update` to prevent a single update of the block cache to download an unreasonable number of blocks. - This shouldn't have any affect in normal use, it's just a safe-guard against bugs. - Increase the timeout for eth1 calls from 15s to 60s, as per @pawanjay176's experience with Infura. ## Additional Info NA	2020-11-30 20:29:17 +00:00
divma	8fcd22992c	No string in slog (#2017 ) ## Issue Addressed Following slog's documentation, this should help a bit with string allocations. I left it run for two days and mem usage is lower. This is of course anecdotal, but shouldn't harm anyway ## Proposed Changes remove `String` creation in logs when possible	2020-11-30 10:33:00 +00:00
Paul Hauner	85e69249e6	Drop discovery log to trace (#2007 ) ## Issue Addressed NA ## Proposed Changes This was causing: ``` Nov 28 21:56:08.154 ERRO slog-async: logger dropped messages due to channel overflow, count: 44, service: libp2p ``` ## Additional Info NA	2020-11-29 03:02:23 +00:00
Age Manning	f7183098ee	Bump to version v1.0.2 (#2001 ) Update lighthouse to version `v1.0.2`. There are two major updates in this version: - Updates to the task executor to tokio 0.3 and all sub-dependencies relying on core execution, including libp2p - Update BLST	2020-11-28 13:22:37 +00:00
Age Manning	a567f788bd	Upgrade to tokio 0.3 (#1839 ) ## Description This PR updates Lighthouse to tokio 0.3. It includes a number of dependency updates and some structural changes as to how we create and spawn tasks. This also brings with it a number of various improvements: - Discv5 update - Libp2p update - Fix for recompilation issues - Improved UPnP port mapping handling - Futures dependency update - Log downgrade to traces for rejecting peers when we've reached our max Co-authored-by: blacktemplar <blacktemplar@a1.net>	2020-11-28 05:30:57 +00:00
Paul Hauner	5a3b94cbb4	Update to v1.0.1, run cargo update	2020-11-27 21:16:59 +11:00
blacktemplar	38b15deccb	Fallback nodes for eth1 access (#1918 ) ## Issue Addressed part of #1883 ## Proposed Changes Adds a new cli argument `--eth1-endpoints` that can be used instead of `--eth1-endpoint` to specify a comma-separated list of endpoints. If the first endpoint returns an error for some request the other endpoints are tried in the given order. ## Additional Info Currently if the first endpoint fails the fallbacks are used silently (except for `try_fallback_test_endpoint` that is used in `do_update` which logs a `WARN` for each endpoint that is not reachable). A question is if we should add more logs so that the user gets warned if his main endpoint is for example just slow and sometimes hits timeouts.	2020-11-27 08:37:44 +00:00
Michael Sproul	1312844f29	Disable snappy in LevelDB to fix build issues (#1983 ) ## Proposed Changes A user on Discord reported build issues when trying to compile Lighthouse checked out to a path with spaces in it. I've fixed the issue upstream in `leveldb-sys` (https://github.com/skade/leveldb-sys/pull/22), but rather than waiting for a new release of the `leveldb` crate, we can also work around the issue by disabling Snappy in LevelDB, which we weren't using anyway. This may also have the side-effect of slightly improving compilation times, as LevelDB+Snappy was found to be a substantial contributor to build time (although I'm not sure how much was LevelDB and how much was Snappy).	2020-11-27 03:01:57 +00:00
Pawan Dhananjay	0589a14afe	Log better error message (#1981 ) ## Issue Addressed Fixes #1965 ## Proposed Changes Log an error and don't update eth1 caches if `chain_id = 0`	2020-11-26 23:13:46 +00:00
divma	fc07cc3fdf	Sync metrics (#1975 ) ## Issue Addressed - Add metrics to keep track of peer counts by sync type - Add metric to keep track of the number of syncing chains in range ## Proposed Changes Plugin to the network metrics update interval and update too the counts for peers wrt to their sync status with us ## Additional Info For the peer counts - By the way it is implemented the numbers won't always match to the total peer count in the `libp2p` metric. - Updating the gauge with every change is messy because it requires to be updated on connection (in the `eth2_libp2p` crate, while metrics are defined in the `network` crate) on Goodbye sent (for an `IrrelevantPeer`) either in the `beacon_processor` or the `peer_manager`, and on disconnection. Since this is not a critical metric I think counting once every second is enough. If you think more accuracy is needed we can do it too, but it would be harder to maintain) ATM those look like this ![image](https://user-images.githubusercontent.com/26765164/100275387-22137b00-2f60-11eb-93b9-94b0f265240c.png)	2020-11-26 05:23:17 +00:00
Paul Hauner	26741944b1	Add metrics to VC (#1954 ) ## Issue Addressed NA ## Proposed Changes - Adds a HTTP server to the VC which provides Prometheus metrics. - Moves the health metrics into the `lighthouse_metrics` crate so it can be shared between BN/VC. - Sprinkle some metrics around the VC. - Update the book to indicate that we now have VC metrics. - Shifts the "waiting for genesis" logic later in the `ProductionValidatorClient::new_from_cli` - This is worth attention during the review. ## Additional Info - ~~`clippy` has some new lints that are failing. I'll deal with that in another PR.~~	2020-11-26 01:10:51 +00:00
divma	3b4afc27bf	Status race condition (#1967 ) ## Issue Addressed Sync stalls due to race conditions between dc notifications and status processing	2020-11-25 02:15:38 +00:00
Paul Hauner	c6baa0eed1	Bump to v1.0.0, run cargo update	2020-11-25 02:02:19 +11:00
Age Manning	a96893744c	Update bootnodes and boot_node cli (#1961 )	2020-11-25 02:01:37 +11:00
divma	6f890c398e	Sync Bug fixes (#1950 ) ## Issue Addressed Two issues related to empty batches - Chain target's was not being advanced when the batch was successful, empty and the chain didn't have an optimistic batch - Not switching finalized chains. We now switch finalized chains requiring a minimum work first	2020-11-24 02:11:31 +00:00
Paul Hauner	21617aa87f	Change --testnet flag to --network (#1751 ) ## Issue Addressed - Resolves #1689 ## Proposed Changes TBC ## Additional Info NA	2020-11-23 23:54:03 +00:00
Michael Sproul	7d644103c6	Tweak slasher DB schema and pruning (#1948 ) ## Issue Addressed Resolves #1890 ## Proposed Changes Change the slasher database schema to key indexed attestations by `(target_epoch, indexed_attestation_root)` instead of just `indexed_attestation_root`. This allows more straight-forward pruning (linear scan), that is also "re-entrant". By re-entrant, we mean that a pruning pass that gets stuck because of a `MapFull` error can attempt to commit midway, and be resumed later without issue. The previous pruning strategy for indexed attestations did not have this property. There was also a flaw in the previous pruning that could leave "zombie" indexed attestations in the database (ones not referenced by any attester record), which could build up and contribute to bloat (although in practice I think they occur quite infrequently). ## Additional Info During testing I noticed that a `MapFull` error can still occur during the commit of the transaction itself, which is irritating, but not unbearable. This PR should at least reduce the frequency with which users need to manually resize their DB, and if the `MapFull` on commit rears its ugly head too often we could use a dynamic strategy (temporarily increase the size of the map until the transaction commits). The extra bytes for the epoch make the database a bit heavier, so the size estimate docs have been updated to reflect this. This is also a breaking schema change, so anyone using a v0 database from a few hours ago will need to drop it and update 😅	2020-11-23 21:33:51 +00:00

1 2 3 4 5 ...

1692 Commits