## Proposed Changes
`@potuz` on the Eth R&D Discord observed that Lighthouse blocks on Pyrmont were always arriving at other nodes after at least 1 second. Part of this could be due to processing and slow propagation, but metrics also revealed that the Lighthouse nodes were usually taking 400-600ms to even just produce a block before broadcasting it.
I tracked the slowness down to the lack of a pre-built tree hash cache (THC) on the states being used for block production. This was due to using the head state for block production, which lacks a THC in order to keep fork choice fast (cloning a THC takes at least 30ms for 100k validators). This PR modifies block production to clone a state from the snapshot cache rather than the head, which speeds things up by 200-400ms by avoiding the tree hash cache rebuild. In practice this seems to have cut block production time down to 300ms or less. Ideally we could _remove_ the snapshot from the cache (and save the 30ms), but it is required for when we re-process the block after signing it with the validator client.
## Alternatives
I experimented with 2 alternatives to this approach, before deciding on it:
* Alternative 1: ensure the `head` has a tree hash cache. This is too slow, as it imposes a +30ms hit on fork choice, which currently takes ~5ms (with occasional spikes).
* Alternative 2: use `Arc<BeaconSnapshot>` in the snapshot cache and share snapshots between the cache and the `head`. This made fork choice blazing fast (1ms), and block production the same as in this PR, but had a negative impact on block processing which I don't think is worth it. It ended up being necessary to clone the full state from the snapshot cache during block production, imposing the +30ms penalty there _as well_ as in block production.
In contract, the approach in this PR should only impact block production, and it improves it! Yay for pareto improvements 🎉
## Additional Info
This commit (ac59dfa) is currently running on all the Lighthouse Pyrmont nodes, and I've added a dashboard to the Pyrmont grafana instance with the metrics.
In future work we should optimise the attestation packing, which consumes around 30-60ms and is now a substantial contributor to the total.
## Issue Addressed
- Resolves#1883
## Proposed Changes
This follows on from @blacktemplar's work in #2018.
- Allows the VC to connect to multiple BN for redundancy.
- Update the simulator so some nodes always need to rely on their fallback.
- Adds some extra deprecation warnings for `--eth1-endpoint`
- Pass `SignatureBytes` as a reference instead of by value.
## Additional Info
NA
Co-authored-by: blacktemplar <blacktemplar@a1.net>
## Issue Addressed
N/A
## Proposed Changes
Fixes multiple issues related to discovering of subnet peers.
1. Subnet discovery retries after yielding no results
2. Metadata updates if peer send older metadata
3. peerdb stores the peer subscriptions from gossipsub
## Issue Addressed
NA
## Proposed Changes
Fixes problems with slot times below 1 second which got revealed by running the syncing simulator with the default speedup time.
## Issue Addressed
Closes#2048
## Proposed Changes
* Broadcast slashings when the `--slasher-broadcast` flag is provided.
* In the process of implementing this I refactored the slasher service into its own crate so that it could access the network code without creating a circular dependency. I moved the responsibility for putting slashings into the op pool into the service as well, as it makes sense for it to handle the whole slashing lifecycle.
## Issue Addressed
N/A
## Proposed Changes
There seemed to be confusion among discord users on the eth1 fallback logging
```
WARN Error connecting to eth1 node. Trying fallback ..., endpoint: http://127.0.0.1:8545/, service: eth1_rpc
```
The assumption users seem to be making here is that it is trying the fallback and fallback=endpoint in the log.
This PR improves the logging to be like
```
WARN Error connecting to eth1 node endpoint, endpoint: http://127.0.0.1:8545/, action: trying fallbacks, service: eth1_rpc
```
I think this is a bit more clear that the endpoint that failed is the one in the log.
## Issue Addressed
Related to #1891, The error is not in the spec yet (see ethereum/eth2.0-specs#2131)
## Proposed Changes
Implement the proposed error, banning peers that send it
## Additional Info
NA
## Proposed Changes
Run cargo update and bump version in prep for v1.0.4 release
## Additional Info
Planning to merge this commit to `unstable`, test on Pyrmont and canary nodes, then push to `stable`.
## Issue Addressed
Users are reporting errors for sending attestations to peers. If the clock sync is a little out or we receive attestations before blocks, peers are being too harshly penalized. They can get scored many times per missing block and we typically need these peers on subnets.
## Proposed Changes
This removes the penalization for missing blocks with attestations. The penalty should be handled when #635 gets built as it will allow us to group attestations per missing block and penalize once.
## Issue Addressed
Closes#2028
Replaces #2059
## Proposed Changes
If writing to the database fails while importing a block, revert fork choice to the last version stored on disk. This prevents fork choice from being ahead of the blocks on disk. Having fork choice ahead is particularly bad if it is later successfully written to disk, because it renders the database corrupt (see #2028).
## Additional Info
* This mitigation might fail if the head+fork choice haven't been persisted yet, which can only happen at first startup (see #2067)
* This relies on it being OK for the head tracker to be ahead of fork choice. I figure this is tolerable because blocks only get added to the head tracker after successfully being written on disk _and_ to fork choice, so even if fork choice reverts a little bit, when the pruning algorithm runs, those blocks will still be on disk and OK to prune. The pruning algorithm also doesn't rely on heads being unique, technically it's OK for multiple blocks from the same linear chain segment to be present in the head tracker. This begs the question of #1785 (i.e. things would be simpler with the head tracker out of the way). Alternatively, this PR could just revert the head tracker as well (I'll look into this tomorrow).
## Issue Addressed
Closes#1264
## Proposed Changes
* Milagro BLS: tweak the feature flags so that Milagro doesn't get compiled if we're using BLST. Profiling showed that it was consuming about 1 minute of CPU time out of 60 minutes of CPU time (real time ~15 mins). A 1.6% saving.
* Reduce monomorphization: compiling for 3 different `EthSpec` types causes a heck of a lot of generic functions to be instantiated (monomorphized). Removing 2 of 3 cuts the LLVM+linking step from around 250 seconds to 180 seconds, a saving of 70 seconds (real time!). This applies only to `make` and not the CI build, because we test with the minimal spec on CI.
* Update `web3` crate to v0.13. This is perhaps the most controversial change, because it requires axing some deposit contract tools from `lcli`. I suspect these tools weren't used much anyway, and could be maintained separately, but I'm also happy to revert this change. However, it does save us a lot of compile time. With #1839, we now have 3 versions of Tokio (and all of Tokio's deps). This change brings us down to 2 versions, but 1 should be achievable once web3 (and reqwest) move to Tokio 0.3.
* Remove `lcli` from the Docker image. It's a dev tool and can be built from the repo if required.
## Issue Addressed
#2046
## Proposed Changes
The log was originally intended to verify the correct logic and ordering of events when scoring peers. The queued tasks can be structured in such a way that peers can be banned after they are disconnected. Therefore the error log is now downgraded to debug log.
## Issue Addressed
#1992 and #1987, and also to be considered a continuation of #1751
## Proposed Changes
many changed files but most are renaming to align the code with the semantics of `--network`
- remove the `--network` default value (in clap) and instead set it after checking the `network` and `testnet-dir` flags
- move `eth2_testnet_config` crate to `eth2_network_config`
- move `Eth2TestnetConfig` to `Eth2NetworkConfig`
- move `DEFAULT_HARDCODED_TESTNET` to `DEFAULT_HARDCODED_NETWORK`
- `beacon_node`s `get_eth2_testnet_config` loads the `DEFAULT_HARDCODED_NETWORK` if there is no network nor testnet provided
- `boot_node`s config loads the config same as the `beacon_node`, it was using the configuration only for preconfigured networks (That code is ~1year old so I asume it was not intended)
- removed a one year old comment stating we should try to emulate `https://github.com/eth2-clients/eth2-testnets/tree/master/nimbus/testnet1` it looks outdated (?)
- remove `lighthouse`s `load_testnet_config` in favor of `get_eth2_network_config` to centralize that logic (It had differences)
- some spelling
## Additional Info
Both the command of #1992 and the scripts of #1987 seem to work fine, same as `bn` and `vc`
## Issue Addressed
NA
## Proposed Changes
This was mostly done to find the reason why LH was dropping peers from Nimbus. It proved to be useful so I think it's worth it. But there is also some functional stuff here
- Add metrics for rpc errors per client, error type and direction
- Add metrics for downscoring events per source type, client and penalty type
- Add metrics for gossip validation results per client for non-accepted messages
- Make the RPC handler return errors and requests/responses in the order we see them
- Allow a small burst for the Ping rate limit, from 1 every 5 seconds to 2 every 10 seconds
- Send rate limiting errors with a particular code and use that same code to identify them. I picked something different to 128 since that is most likely what other clients are using for their own errors
- Remove some unused code in the `PeerAction` and the rpc handler
- Remove the unused variant `RateLimited`. tTis was never produced directly, since the only way to get the request's protocol is via de handler. The handler upon receiving from LH a response with an error (rate limited in this case) emits this event with the missing info (It was always like this, just pointing out that we do downscore rate limiting errors regardless of the change)
Metrics for Nimbus looked like this:
Downscoring events: `increase(libp2p_peer_actions_per_client{client="Nimbus"}[5m])`
![image](https://user-images.githubusercontent.com/26765164/101210880-862bf280-3676-11eb-94c0-399f0bf5aa2e.png)
RPC Errors: `increase(libp2p_rpc_errors_per_client{client="Nimbus"}[5m])`
![image](https://user-images.githubusercontent.com/26765164/101210997-ba071800-3676-11eb-847a-f32405ede002.png)
Unaccepted gossip message: `increase(gossipsub_unaccepted_messages_per_client{client="Nimbus"}[5m])`
![image](https://user-images.githubusercontent.com/26765164/101211124-f470b500-3676-11eb-9459-132ecff058ec.png)
## Issue Addressed
NA
## Proposed Changes
Updates out of date dependencies.
## Additional Info
See also https://github.com/sigp/lighthouse/issues/1712 for a list of dependencies that are still out of date and the resasons.
## Issue Addressed
Closes#2042
## Proposed Changes
Pass blocks that fail gossip verification to the slasher. Blocks that are successfully verified are not passed immediately, but will be passed as part of full block verification.
## Issue
Some eth1 clients are missing deposit logs on mainnet for multiple reasons (not fully synced, eth1 client issues) because of which we are getting `FailedToInsertDeposit` errors.
Ideally, LH should pick up where it left off after pointing it to a nice eth1 client endpoint (which has all deposits).
However, I have seen instances where LH keeps getting `FailedToInsertDeposit` even after switching to a good endpoint. Only deleting the beacon directory (which also wipes the eth1 cache) and resyncing the eth1 caches seems to be the solution. This wouldn't be great for mainnet if you have to sync your beacon node again as well.
## Proposed Changes
Add a `--purge-eth1-db` option which just wipes the eth1 cache and doesn't touch the rest of the beacon db.
Still need to investigate if and why LH isn't picking up where it left off for the deposit logs sync, but I think it would be good to have an option to just delete eth1 caches regardless.
## Issue Addressed
Resolves#1434 (this is the last major feature in the standard spec. There are only a couple of places we may be off-spec due to recent spec changes or ongoing discussion)
Partly addresses #1669
## Proposed Changes
- remove the websocket server
- remove the `TeeEventHandler` and `NullEventHandler`
- add server sent events according to the eth2 API spec
## Additional Info
This is according to the currently unmerged PR here: https://github.com/ethereum/eth2.0-APIs/pull/117
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
Resolves#2035
## Proposed Changes
Update 405's to 400's for failures when we are parsing path params.
## Additional Info
Haven't updated the same for non-standard endpoints
Co-authored-by: realbigsean <seananderson33@gmail.com>
Fixes a couple of low hanging fruits.
- Fixes#2037
- `validators-dir` and `secrets-dir` flags don't really need to depend upon each other
- Fixes#2006 and Fixes#1995
Adds a `--privacy` CLI flag to the beacon node that users may opt into.
This does two things:
- Removes client identifying information from the identify libp2p protocol
- Changes the default graffiti to "" if no graffiti is set.
## Issue Addressed
NA
## Proposed Changes
- Log about eth1 whilst waiting for genesis.
- For the block and deposit caches, update them after each download instead of when *all* downloads are complete.
- This prevents the case where a single timeout error can cause us to drop *all* previously download blocks/deposits.
- Set `max_log_requests_per_update` to avoid timeouts due to very large log counts in a response.
- Set `max_blocks_per_update` to prevent a single update of the block cache to download an unreasonable number of blocks.
- This shouldn't have any affect in normal use, it's just a safe-guard against bugs.
- Increase the timeout for eth1 calls from 15s to 60s, as per @pawanjay176's experience with Infura.
## Additional Info
NA
## Issue Addressed
Following slog's documentation, this should help a bit with string allocations. I left it run for two days and mem usage is lower. This is of course anecdotal, but shouldn't harm anyway
## Proposed Changes
remove `String` creation in logs when possible
## Issue Addressed
NA
## Proposed Changes
This was causing:
```
Nov 28 21:56:08.154 ERRO slog-async: logger dropped messages due to channel overflow, count: 44, service: libp2p
```
## Additional Info
NA
Update lighthouse to version `v1.0.2`.
There are two major updates in this version:
- Updates to the task executor to tokio 0.3 and all sub-dependencies relying on core execution, including libp2p
- Update BLST
## Description
This PR updates Lighthouse to tokio 0.3. It includes a number of dependency updates and some structural changes as to how we create and spawn tasks.
This also brings with it a number of various improvements:
- Discv5 update
- Libp2p update
- Fix for recompilation issues
- Improved UPnP port mapping handling
- Futures dependency update
- Log downgrade to traces for rejecting peers when we've reached our max
Co-authored-by: blacktemplar <blacktemplar@a1.net>
## Issue Addressed
part of #1883
## Proposed Changes
Adds a new cli argument `--eth1-endpoints` that can be used instead of `--eth1-endpoint` to specify a comma-separated list of endpoints. If the first endpoint returns an error for some request the other endpoints are tried in the given order.
## Additional Info
Currently if the first endpoint fails the fallbacks are used silently (except for `try_fallback_test_endpoint` that is used in `do_update` which logs a `WARN` for each endpoint that is not reachable). A question is if we should add more logs so that the user gets warned if his main endpoint is for example just slow and sometimes hits timeouts.
## Proposed Changes
A user on Discord reported build issues when trying to compile Lighthouse checked out to a path with spaces in it. I've fixed the issue upstream in `leveldb-sys` (https://github.com/skade/leveldb-sys/pull/22), but rather than waiting for a new release of the `leveldb` crate, we can also work around the issue by disabling Snappy in LevelDB, which we weren't using anyway.
This may also have the side-effect of slightly improving compilation times, as LevelDB+Snappy was found to be a substantial contributor to build time (although I'm not sure how much was LevelDB and how much was Snappy).
## Issue Addressed
- Add metrics to keep track of peer counts by sync type
- Add metric to keep track of the number of syncing chains in range
## Proposed Changes
Plugin to the network metrics update interval and update too the counts for peers wrt to their sync status with us
## Additional Info
For the peer counts
- By the way it is implemented the numbers won't always match to the total peer count in the `libp2p` metric.
- Updating the gauge with every change is messy because it requires to be updated on connection (in the `eth2_libp2p` crate, while metrics are defined in the `network` crate) on Goodbye sent (for an `IrrelevantPeer`) either in the `beacon_processor` or the `peer_manager`, and on disconnection. Since this is not a critical metric I think counting once every second is enough. If you think more accuracy is needed we can do it too, but it would be harder to maintain)
ATM those look like this
![image](https://user-images.githubusercontent.com/26765164/100275387-22137b00-2f60-11eb-93b9-94b0f265240c.png)
## Issue Addressed
NA
## Proposed Changes
- Adds a HTTP server to the VC which provides Prometheus metrics.
- Moves the health metrics into the `lighthouse_metrics` crate so it can be shared between BN/VC.
- Sprinkle some metrics around the VC.
- Update the book to indicate that we now have VC metrics.
- Shifts the "waiting for genesis" logic later in the `ProductionValidatorClient::new_from_cli`
- This is worth attention during the review.
## Additional Info
- ~~`clippy` has some new lints that are failing. I'll deal with that in another PR.~~
## Issue Addressed
Two issues related to empty batches
- Chain target's was not being advanced when the batch was successful, empty and the chain didn't have an optimistic batch
- Not switching finalized chains. We now switch finalized chains requiring a minimum work first
## Issue Addressed
Resolves#1890
## Proposed Changes
Change the slasher database schema to key indexed attestations by `(target_epoch, indexed_attestation_root)` instead of just `indexed_attestation_root`. This allows more straight-forward pruning (linear scan), that is also "re-entrant". By re-entrant, we mean that a pruning pass that gets stuck because of a `MapFull` error can attempt to commit midway, and be resumed later without issue. The previous pruning strategy for indexed attestations did not have this property. There was also a flaw in the previous pruning that could leave "zombie" indexed attestations in the database (ones not referenced by any attester record), which could build up and contribute to bloat (although in practice I think they occur quite infrequently).
## Additional Info
During testing I noticed that a `MapFull` error can still occur during the commit of the transaction itself, which is irritating, but not unbearable. This PR should at least reduce the frequency with which users need to manually resize their DB, and if the `MapFull` on commit rears its ugly head too often we could use a dynamic strategy (temporarily increase the size of the map until the transaction commits).
The extra bytes for the epoch make the database a bit heavier, so the size estimate docs have been updated to reflect this. This is also a breaking schema change, so anyone using a v0 database from a few hours ago will need to drop it and update 😅
This is an implementation of a slasher that lives inside the BN and can be enabled via `lighthouse bn --slasher`.
Features included in this PR:
- [x] Detection of attester slashing conditions (double votes, surrounds existing, surrounded by existing)
- [x] Integration into Lighthouse's attestation verification flow
- [x] Detection of proposer slashing conditions
- [x] Extraction of attestations from blocks as they are verified
- [x] Compression of chunks
- [x] Configurable history length
- [x] Pruning of old attestations and blocks
- [x] More tests
Future work:
* Focus on a slice of history separate from the most recent N epochs (e.g. epochs `current - K` to `current - M`)
* Run out-of-process
* Ingest attestations from the chain without a resync
Design notes are here https://hackmd.io/@sproul/HJSEklmPL
## Issue Addressed
- Resolves#1424
## Proposed Changes
Add a `GET lighthouse/staking` that returns 200 if the node is ready to stake (i.e., `--eth1` flag is present) or a 404 otherwise.
Whilst the VC is waiting for the genesis time to start (i.e., when the genesis state is known), check the `lighthouse/staking` endpoint and log an error if the node isn't configured for staking.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Adds the `--import-all-attestations` flag which tells the `network::AttestationService` to import/aggregate all attestations after verification (instead of only ones for subnets that are relevant to local validators).
This is useful for testing/debugging and also for creating back-up nodes that should be all cached up and ready for any validator.
## Additional Info
NA
## Issue Addressed
we have a log saying we add a peer to a chain, and an another one in case the chain is not syncing. To avoid needing to peer there two (and reduce log entries) simply log the chain's syncing state in the chain's KV
## Issue Addressed
Closes#1719
## Proposed Changes
Lift the internal `RwLock`s and `Mutex`es from the `Observed*` data structures to resolve the race conditions described in #1719.
Most of this work was done by @paulhauner on his `lift-locks` branch, I merely updated it for the current `master` and checked over it.
## Additional Info
I think it would be prudent to test this on a testnet or two before mainnet launch, just to be sure that the extra lock contention doesn't negatively impact performance.
## Issue Addressed
- Resolves#1945
## Proposed Changes
- As per #1945, fix a log message from the metrics server that was falsely claiming to be from the api server.
- Ensure successful api request logs are published to debug, not trace. This is something I've wanted to do for a while.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
- Caches later blocks than is required by `ETH1_FOLLOW_DISTANCE`.
- Adds logging to `warn` if the eth1 cache is insufficiently primed.
- Use `max_by_key` instead of `max_by` in `BeaconChain::Eth1Chain` since it's simpler.
- Rename `voting_period_start_timestamp` to `voting_target_timestamp` for accuracy.
## Additional Info
The reason for eating into the `ETH1_FOLLOW_DISTANCE` and caching blocks that are closer to the head is due to possibility for `SECONDS_PER_ETH1_BLOCK` to be incorrect (as is the case for the Pyrmont testnet on Goerli).
If `SECONDS_PER_ETH1_BLOCK` is too short, we'll skip back too far from the head and skip over blocks that would be valid [`is_candidate_block`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/validator.md#eth1-data) blocks. This was the case on the Pyrmont testnet and resulted in Lighthouse choosing blocks that were about 30 minutes older than is ideal.