## Proposed Changes
Allow Lighthouse to speculatively create blocks via the `/eth/v1/validators/blocks` endpoint by optionally skipping the RANDAO verification that we introduced in #2740. When `verify_randao=false` is passed as a query parameter the `randao_reveal` is not required to be present, and if present will only be lightly checked (must be a valid BLS sig). If `verify_randao` is omitted it defaults to true and Lighthouse behaves exactly as it did previously, hence this PR is backwards-compatible.
I'd like to get this change into `unstable` pretty soon as I've got 3 projects building on top of it:
- [`blockdreamer`](https://github.com/michaelsproul/blockdreamer), which mocks block production every slot in order to fingerprint clients
- analysis of Lighthouse's block packing _optimality_, which uses `blockdreamer` to extract interesting instances of the attestation packing problem
- analysis of Lighthouse's block packing _performance_ (as in speed) on the `tree-states` branch
## Additional Info
Having tested `blockdreamer` with Prysm, Nimbus and Teku I noticed that none of them verify the randao signature on `/eth/v1/validator/blocks`. I plan to open a PR to the `beacon-APIs` repo anyway so that this parameter can be standardised in case the other clients add RANDAO verification by default in future.
## Issue Addressed
No issue, just updating merge ASCII art.
## Proposed Changes
Updating ASCII art for merge.
## Additional Info
Please provide any additional information. For example, future considerations
or information useful for reviewers.
## Issue Addressed
This address an issue which was preventing checkpoint-sync.
When the node starts from checkpoint sync, the head block and the finalized block are the same value. We did not respect this when sending a `forkchoiceUpdated` (fcU) call to the EL and were expecting fork choice to hold the *finalized ancestor of the head* and returning an error when it didn't.
This PR uses *only fork choice* for sending fcU updates. This is actually quite nice and avoids some atomicity issues between `chain.canonical_head` and `chain.fork_choice`. Now, whenever `chain.fork_choice.get_head` returns a value we also cache the values required for the next fcU call.
## TODO
- [x] ~~Blocked on #3043~~
- [x] Ensure there isn't a warn message at startup.
## Issue Addressed
Resolves#2936
## Proposed Changes
Adds functionality for calling [`validator/prepare_beacon_proposer`](https://ethereum.github.io/beacon-APIs/?urls.primaryName=dev#/Validator/prepareBeaconProposer) in advance.
There is a `BeaconChain::prepare_beacon_proposer` method which, which called, computes the proposer for the next slot. If that proposer has been registered via the `validator/prepare_beacon_proposer` API method, then the `beacon_chain.execution_layer` will be provided the `PayloadAttributes` for us in all future forkchoiceUpdated calls. An artificial forkchoiceUpdated call will be created 4s before each slot, when the head updates and when a validator updates their information.
Additionally, I added strict ordering for calls from the `BeaconChain` to the `ExecutionLayer`. I'm not certain the `ExecutionLayer` will always maintain this ordering, but it's a good start to have consistency from the `BeaconChain`. There are some deadlock opportunities introduced, they are documented in the code.
## Additional Info
- ~~Blocked on #2837~~
Co-authored-by: realbigsean <seananderson33@GMAIL.com>
## Issue Addressed
Resolves#3015
## Proposed Changes
Add JWT token based authentication to engine api requests. The jwt secret key is read from the provided file and is used to sign tokens that are used for authenticated communication with the EL node.
- [x] Interop with geth (synced `merge-devnet-4` with the `merge-kiln-v2` branch on geth)
- [x] Interop with other EL clients (nethermind on `merge-devnet-4`)
- [x] ~Implement `zeroize` for jwt secrets~
- [x] Add auth server tests with `mock_execution_layer`
- [x] Get auth working with the `execution_engine_integration` tests
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
Addresses spec changes from v1.1.0:
- https://github.com/ethereum/consensus-specs/pull/2830
- https://github.com/ethereum/consensus-specs/pull/2846
## Proposed Changes
* Downgrade the REJECT for `HeadBlockFinalized` to an IGNORE. This applies to both unaggregated and aggregated attestations.
## Additional Info
I thought about also changing the penalty for `UnknownTargetRoot` but I don't think it's reachable in practice.
## Issue Addressed
As discussed on last-night's consensus call, the testnets next week will target the [Kiln Spec v2](https://hackmd.io/@n0ble/kiln-spec).
Presently, we support Kiln V1. V2 is backwards compatible, except for renaming `random` to `prev_randao` in:
- https://github.com/ethereum/execution-apis/pull/180
- https://github.com/ethereum/consensus-specs/pull/2835
With this PR we'll no longer be compatible with the existing Kintsugi and Kiln testnets, however we'll be ready for the testnets next week. I raised this breaking change in the call last night, we are all keen to move forward and break things.
We now target the [`merge-kiln-v2`](https://github.com/MariusVanDerWijden/go-ethereum/tree/merge-kiln-v2) branch for interop with Geth. This required adding the `--http.aauthport` to the tester to avoid a port conflict at startup.
### Changes to exec integration tests
There's some change in the `merge-kiln-v2` version of Geth that means it can't compile on a vanilla Github runner. Bumping the `go` version on the runner solved this issue.
Whilst addressing this, I refactored the `testing/execution_integration` crate to be a *binary* rather than a *library* with tests. This means that we don't need to run the `build.rs` and build Geth whenever someone runs `make lint` or `make test-release`. This is nice for everyday users, but it's also nice for CI so that we can have a specific runner for these tests and we don't need to ensure *all* runners support everything required to build all execution clients.
## More Info
- [x] ~~EF tests are failing since the rename has broken some tests that reference the old field name. I have been told there will be new tests released in the coming days (25/02/22 or 26/02/22).~~
## Description
This PR adds a single, trivial commit (f5d2b27d78349d5a675a2615eba42cc9ae708094) atop #2986 to resolve a tests compile error. The original author (@ethDreamer) is AFK so I'm getting this one merged ☺️
Please see #2986 for more information about the other, significant changes in this PR.
Co-authored-by: Mark Mackey <mark@sigmaprime.io>
Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com>
## Issue Addressed
NA
## Proposed Changes
Adds the functionality to allow blocks to be validated/invalidated after their import as per the [optimistic sync spec](https://github.com/ethereum/consensus-specs/blob/dev/sync/optimistic.md#how-to-optimistically-import-blocks). This means:
- Updating `ProtoArray` to allow flipping the `execution_status` of ancestors/descendants based on payload validity updates.
- Creating separation between `execution_layer` and the `beacon_chain` by creating a `PayloadStatus` struct.
- Refactoring how the `execution_layer` selects a `PayloadStatus` from the multiple statuses returned from multiple EEs.
- Adding testing framework for optimistic imports.
- Add `ExecutionBlockHash(Hash256)` new-type struct to avoid confusion between *beacon block roots* and *execution payload hashes*.
- Add `merge` to [`FORKS`](c3a793fd73/Makefile (L17)) in the `Makefile` to ensure we test the beacon chain with merge settings.
- Fix some tests here that were failing due to a missing execution layer.
## TODO
- [ ] Balance tests
Co-authored-by: Mark Mackey <mark@sigmaprime.io>
## Proposed Changes
Lots of lint updates related to `flat_map`, `unwrap_or_else` and string patterns. I did a little more creative refactoring in the op pool, but otherwise followed Clippy's suggestions.
## Additional Info
We need this PR to unblock CI.
## Issue Addressed
NA
## Proposed Changes
This PR extends #3018 to address my review comments there and add automated integration tests with Geth (and other implementations, in the future).
I've also de-duplicated the "unused port" logic by creating an `common/unused_port` crate.
## Additional Info
I'm not sure if we want to merge this PR, or update #3018 and merge that. I don't mind, I'm primarily opening this PR to make sure CI works.
Co-authored-by: Mark Mackey <mark@sigmaprime.io>
## Issue Addressed
Closes#3014
## Proposed Changes
- Rename `receipt_root` to `receipts_root`
- Rename `execute_payload` to `notify_new_payload`
- This is slightly weird since we modify everything except the actual HTTP call to the engine API. That change is expected to be implemented in #2985 (cc @ethDreamer)
- Enable "random" tests for Bellatrix.
## Notes
This will break *partially* compatibility with Kintusgi testnets in order to gain compatibility with [Kiln](https://hackmd.io/@n0ble/kiln-spec) testnets. I think it will only break the BN APIs due to the `receipts_root` change, however it might have some other effects too.
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
#2883
## Proposed Changes
* Added `suggested-fee-recipient` & `suggested-fee-recipient-file` flags to validator client (similar to graffiti / graffiti-file implementation).
* Added proposer preparation service to VC, which sends the fee-recipient of all known validators to the BN via [/eth/v1/validator/prepare_beacon_proposer](https://github.com/ethereum/beacon-APIs/pull/178) api once per slot
* Added [/eth/v1/validator/prepare_beacon_proposer](https://github.com/ethereum/beacon-APIs/pull/178) api endpoint and preparation data caching
* Added cleanup routine to remove cached proposer preparations when not updated for 2 epochs
## Additional Info
Changed the Implementation following the discussion in #2883.
Co-authored-by: pk910 <philipp@pk910.de>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: Philipp K <philipp@pk910.de>
## Issue Addressed
This PR fixes the unnecessary `WARN Single block lookup failed` messages described here:
https://github.com/sigp/lighthouse/pull/2866#issuecomment-1008442640
## Proposed Changes
Add a new cache to the `BeaconChain` that tracks the block roots of blocks from before finalization. These could be blocks from the canonical chain (which might need to be read from disk), or old pre-finalization blocks that have been forked out.
The cache also stores a set of block roots for in-progress single block lookups, which duplicates some of the information from sync's `single_block_lookups` hashmap:
a836e180f9/beacon_node/network/src/sync/manager.rs (L192-L196)
On a live node you can confirm that the cache is working by grepping logs for the message: `Rejected attestation to finalized block`.
#2923
Which issue # does this PR address?
There's a redundant field on the BeaconChain called disabled_forks that was once part of our fork-aware networking (#953) but which is no longer used and could be deleted. so Removed all references to disabled_forks so that the code compiles and git grep disabled_forks returns no results.
## Proposed Changes
Please list or describe the changes introduced by this PR.
Removed all references of disabled_forks
Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com>
## Issue Addressed
NA
## Proposed Changes
In https://github.com/sigp/lighthouse/pull/2832 we made some changes to the `SnapshotCache` to help deal with the one-block reorgs seen on mainnet (and testnets).
I believe the change in #2832 is good and we should keep it, but I think that in its present form it is causing the `SnapshotCache` to hold onto states that it doesn't need anymore. For example, a skip slot will result in one more `BeaconSnapshot` being stored in the cache.
This PR adds a new type of pruning that happens after a block is inserted to the cache. We will remove any snapshot from the cache that is a *grandparent* of the block being imported. Since we know the grandparent has two valid blocks built atop it, it is not at risk from a one-block re-org.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
We have observed occasions were under-resourced nodes will receive messages that were valid *at the time*, but later become invalidated due to long waits for a `BeaconProcessor` worker.
In this PR, we will check to see if the message was valid *at the time of receipt*. If it was initially valid but invalid now, we just ignore the message without penalizing the peer.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Introduces a cache to attestation to produce atop blocks which will become the head, but are not fully imported (e.g., not inserted into the database).
Whilst attesting to a block before it's imported is rather easy, if we're going to produce that attestation then we also need to be able to:
1. Verify that attestation.
1. Respond to RPC requests for the `beacon_block_root`.
Attestation verification (1) is *partially* covered. Since we prime the shuffling cache before we insert the block into the early attester cache, we should be fine for all typical use-cases. However, it is possible that the cache is washed out before we've managed to insert the state into the database and then attestation verification will fail with a "missing beacon state"-type error.
Providing the block via RPC (2) is also partially covered, since we'll check the database *and* the early attester cache when responding a blocks-by-root request. However, we'll still omit the block from blocks-by-range requests (until the block lands in the DB). I *think* this is fine, since there's no guarantee that we return all blocks for those responses.
Another important consideration is whether or not the *parent* of the early attester block is available in the databse. If it were not, we might fail to respond to blocks-by-root request that are iterating backwards to collect a chain of blocks. I argue that *we will always have the parent of the early attester block in the database.* This is because we are holding the fork-choice write-lock when inserting the block into the early attester cache and we do not drop that until the block is in the database.
## Issue Addressed
NA
## Proposed Changes
Ensures full roots are printed, rather than shortened versions like `0x935b…d376`.
For example, it would be nice if we could do API queries based upon the roots shown in the `Beacon chain re-org` event:
```
Jan 05 12:36:52.224 WARN Beacon chain re-org reorg_distance: 2, new_slot: 2073184, new_head: 0x8a97…2dec, new_head_parent: 0xa985…7688, previous_slot: 2073183, previous_head: 0x935b…d376, service: beacon
Jan 05 13:35:05.832 WARN Beacon chain re-org reorg_distance: 1, new_slot: 2073475, new_head: 0x9207…c6b9, new_head_parent: 0xb2ce…839b, previous_slot: 2073474, previous_head: 0x8066…92f7, service: beacon
```
## Additional Info
We should eventually fix this project-wide, however this is a short-term patch.
## Proposed Changes
Update `superstruct` to bring in @realbigsean's fixes necessary for MEV-compatible private beacon block types (a la #2795).
The refactoring is due to another change in superstruct that allows partial getters to be auto-generated.
## Issue Addressed
N/A
## Proposed Changes
We are currently treating errors from the EL on `engine_executePayload` as `PayloadVerificationStatus::NotVerified`. This adds the block as a candidate head block in fork choice even if the EL explicitly rejected the block as invalid.
`PayloadVerificationStatus::NotVerified` should be only returned when the EL explicitly returns "syncing" imo. This PR propagates an error instead of returning `NotVerified` on EL all EL errors.
## Issue Addressed
Successor to #2431
## Proposed Changes
* Add a `BlockReplayer` struct to abstract over the intricacies of calling `per_slot_processing` and `per_block_processing` while avoiding unnecessary tree hashing.
* Add a variant of the forwards state root iterator that does not require an `end_state`.
* Use the `BlockReplayer` when reconstructing states in the database. Use the efficient forwards iterator for frozen states.
* Refactor the iterators to remove `Arc<HotColdDB>` (this seems to be neater than making _everything_ an `Arc<HotColdDB>` as I did in #2431).
Supplying the state roots allow us to avoid building a tree hash cache at all when reconstructing historic states, which saves around 1 second flat (regardless of `slots-per-restore-point`). This is a small percentage of worst-case state load times with 200K validators and SPRP=2048 (~15s vs ~16s) but a significant speed-up for more frequent restore points: state loads with SPRP=32 should be now consistently <500ms instead of 1.5s (a ~3x speedup).
## Additional Info
Required by https://github.com/sigp/lighthouse/pull/2628
## Proposed Changes
Remove the `is_first_block_in_epoch` logic from the balances cache update logic, as it was incorrect in the case of skipped slots. The updated code is simpler because regardless of whether the block is the first in the epoch we can check if an entry for the epoch boundary root already exists in the cache, and update the cache accordingly.
Additionally, to assist with flip-flopping justified epochs, move to cloning the balance cache rather than moving it. This should still be very fast in practice because the balances cache is a ~1.6MB `Vec`, and this operation is expected to only occur infrequently.
## Issue Addressed
Resolves: https://github.com/sigp/lighthouse/issues/2741
Includes: https://github.com/sigp/lighthouse/pull/2853 so that we can get ssz static tests passing here on v1.1.6. If we want to merge that first, we can make this diff slightly smaller
## Proposed Changes
- Changes the `justified_epoch` and `finalized_epoch` in the `ProtoArrayNode` each to an `Option<Checkpoint>`. The `Option` is necessary only for the migration, so not ideal. But does allow us to add a default logic to `None` on these fields during the database migration.
- Adds a database migration from a legacy fork choice struct to the new one, search for all necessary block roots in fork choice by iterating through blocks in the db.
- updates related to https://github.com/ethereum/consensus-specs/pull/2727
- We will have to update the persisted forkchoice to make sure the justified checkpoint stored is correct according to the updated fork choice logic. This boils down to setting the forkchoice store's justified checkpoint to the justified checkpoint of the block that advanced the finalized checkpoint to the current one.
- AFAICT there's no migration steps necessary for the update to allow applying attestations from prior blocks, but would appreciate confirmation on that
- I updated the consensus spec tests to v1.1.6 here, but they will fail until we also implement the proposer score boost updates. I confirmed that the previously failing scenario `new_finalized_slot_is_justified_checkpoint_ancestor` will now pass after the boost updates, but haven't confirmed _all_ tests will pass because I just quickly stubbed out the proposer boost test scenario formatting.
- This PR now also includes proposer boosting https://github.com/ethereum/consensus-specs/pull/2730
## Additional Info
I realized checking justified and finalized roots in fork choice makes it more likely that we trigger this bug: https://github.com/ethereum/consensus-specs/pull/2727
It's possible the combination of justified checkpoint and finalized checkpoint in the forkchoice store is different from in any block in fork choice. So when trying to startup our store's justified checkpoint seems invalid to the rest of fork choice (but it should be valid). When this happens we get an `InvalidBestNode` error and fail to start up. So I'm including that bugfix in this branch.
Todo:
- [x] Fix fork choice tests
- [x] Self review
- [x] Add fix for https://github.com/ethereum/consensus-specs/pull/2727
- [x] Rebase onto Kintusgi
- [x] Fix `num_active_validators` calculation as @michaelsproul pointed out
- [x] Clean up db migrations
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
N/A
## Proposed Changes
Changes required for the `merge-devnet-3`. Added some more non substantive renames on top of @realbigsean 's commit.
Note: this doesn't include the proposer boosting changes in kintsugi v3.
This devnet isn't running with the proposer boosting fork choice changes so if we are looking to merge https://github.com/sigp/lighthouse/pull/2822 into `unstable`, then I think we should just maintain this branch for the devnet temporarily.
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Proposed Changes
In the event of a late block, keep the block in the snapshot cache by cloning it. This helps us process new blocks quickly in the event the late block was re-org'd.
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
We were calculating justified balances incorrectly on cache misses in `set_justified_checkpoint`
## Proposed Changes
Use the `get_effective_balances` method as opposed to `state.balances`, which returns exact balances
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
New rust lints
## Proposed Changes
- Boxing some enum variants
- removing some unused fields (is the validator lockfile unused? seemed so to me)
## Additional Info
- some error fields were marked as dead code but are logged out in areas
- left some dead fields in our ef test code because I assume they are useful for debugging?
Co-authored-by: realbigsean <seananderson33@gmail.com>
Fix max packet sizes
Fix max_payload_size function
Add merge block test
Fix max size calculation; fix up test
Clear comments
Add a payload_size_function
Use safe arith for payload calculation
Return an error if block too big in block production
Separate test to check if block is over limit
* Fix fork choice after rebase
* Remove paulhauner warp dep
* Fix fork choice test compile errors
* Assume fork choice payloads are valid
* Add comment
* Ignore new tests
* Fix error in test skipping
* Fix arbitrary check kintsugi
* Add merge chain spec fields, and a function to determine which constant to use based on the state variant
* increment spec test version
* Remove `Transaction` enum wrapper
* Remove Transaction new-type
* Remove gas validations
* Add `--terminal-block-hash-epoch-override` flag
* Increment spec tests version to 1.1.5
* Remove extraneous gossip verification https://github.com/ethereum/consensus-specs/pull/2687
* - Remove unused Error variants
- Require both "terminal-block-hash-epoch-override" and "terminal-block-hash-override" when either flag is used
* - Remove a couple more unused Error variants
Co-authored-by: Paul Hauner <paul@paulhauner.com>
* update initializing from eth1 for merge genesis
* read execution payload header from file lcli
* add `create-payload-header` command to `lcli`
* fix base fee parsing
* Apply suggestions from code review
* default `execution_payload_header` bool to false when deserializing `meta.yml` in EF tests
Co-authored-by: Paul Hauner <paul@paulhauner.com>
* Add payload verification status to fork choice
* Pass payload verification status to import_block
* Add valid back-propagation
* Add head safety status latch to API
* Remove ExecutionLayerStatus
* Add execution info to client notifier
* Update notifier logs
* Change use of "hash" to refer to beacon block
* Shutdown on invalid finalized block
* Tidy, add comments
* Fix failing FC tests
* Allow blocks with unsafe head
* Fix forkchoiceUpdate call on startup
* Ignore payload errors
* Only return payload handle on valid response
* Push some engine logs down to debug
* Push ee fork choice log to debug
* Push engine call failure to debug
* Push some more errors to debug
* Fix panic at startup
* Reject some HTTP endpoints when EL is not ready
* Restrict more endpoints
* Add watchdog task
* Change scheduling
* Update to new schedule
* Add "syncing" concept
* Remove RequireSynced
* Add is_merge_complete to head_info
* Cache latest_head in Engines
* Call consensus_forkchoiceUpdate on startup
* Thread eth1_block_hash into interop genesis state
* Add merge-fork-epoch flag
* Build LH with minimal spec by default
* Add verbose logs to execution_layer
* Add --http-allow-sync-stalled flag
* Update lcli new-testnet to create genesis state
* Fix http test
* Fix compile errors in tests
* Start adding merge tests
* Expose MockExecutionLayer
* Add mock_execution_layer to BeaconChainHarness
* Progress with merge test
* Return more detailed errors with gas limit issues
* Use a better gas limit in block gen
* Ensure TTD is met in block gen
* Fix basic_merge tests
* Start geth testing
* Fix conflicts after rebase
* Remove geth tests
* Improve merge test
* Address clippy lints
* Make pow block gen a pure function
* Add working new test, breaking existing test
* Fix test names
* Add should_panic
* Don't run merge tests in debug
* Detect a tokio runtime when starting MockServer
* Fix clippy lint, include merge tests
* - Update the fork choice `ProtoNode` to include `is_merge_complete`
- Add database migration for the persisted fork choice
* update tests
* Small cleanup
* lints
* store execution block hash in fork choice rather than bool
Added Execution Payload from Rayonism Fork
Updated new Containers to match Merge Spec
Updated BeaconBlockBody for Merge Spec
Completed updating BeaconState and BeaconBlockBody
Modified ExecutionPayload<T> to use Transaction<T>
Mostly Finished Changes for beacon-chain.md
Added some things for fork-choice.md
Update to match new fork-choice.md/fork.md changes
ran cargo fmt
Added Missing Pieces in eth2_libp2p for Merge
fix ef test
Various Changes to Conform Closer to Merge Spec
## Issue Addressed
Resolves#2545
## Proposed Changes
Adds the long-overdue EF tests for fork choice. Although we had pretty good coverage via other implementations that closely followed our approach, it is nonetheless important for us to implement these tests too.
During testing I found that we were using a hard-coded `SAFE_SLOTS_TO_UPDATE_JUSTIFIED` value rather than one from the `ChainSpec`. This caused a failure during a minimal preset test. This doesn't represent a risk to mainnet or testnets, since the hard-coded value matched the mainnet preset.
## Failing Cases
There is one failing case which is presently marked as `SkippedKnownFailure`:
```
case 4 ("new_finalized_slot_is_justified_checkpoint_ancestor") from /home/paul/development/lighthouse/testing/ef_tests/consensus-spec-tests/tests/minimal/phase0/fork_choice/on_block/pyspec_tests/new_finalized_slot_is_justified_checkpoint_ancestor failed with NotEqual:
head check failed: Got Head { slot: Slot(40), root: 0x9183dbaed4191a862bd307d476e687277fc08469fc38618699863333487703e7 } | Expected Head { slot: Slot(24), root: 0x105b49b51bf7103c182aa58860b039550a89c05a4675992e2af703bd02c84570 }
```
This failure is due to #2741. It's not a particularly high priority issue at the moment, so we fix it after merging this PR.
## Issue Addressed
@paulhauner noticed that when we send head events, we use the block root from `new_head` in `fork_choice_internal`, but calculate `dependent_root` and `previous_dependent_root` using the `canonical_head`. This is normally fine because `new_head` updates the `canonical_head` in `fork_choice_internal`, but it's possible we have a reorg updating `canonical_head` before our head events are sent. So this PR ensures `dependent_root` and `previous_dependent_root` are always derived from the state associated with `new_head`.
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
Resolves#2612
## Proposed Changes
Implements both the checks mentioned in the original issue.
1. Verifies the `randao_reveal` in the beacon node
2. Cross checks the proposer index after getting back the block from the beacon node.
## Additional info
The block production time increases by ~10x because of the signature verification on the beacon node (based on the `beacon_block_production_process_seconds` metric) when running on a local testnet.
## Proposed Changes
Add several metrics for the number of attestations in the op pool. These give us a way to observe the number of valid, non-trivial attestations during block packing rather than just the size of the entire op pool.
## Issue Addressed
Mitigates #1096
## Proposed Changes
Add a flag to the beacon node called `--disable-lock-timeouts` which allows opting out of lock timeouts.
The lock timeouts serve a dual purpose:
1. They prevent any single operation from hogging the lock for too long. When a timeout occurs it logs a nasty error which indicates that there's suboptimal lock use occurring, which we can then act on.
2. They allow deadlock detection. We're fairly sure there are no deadlocks left in Lighthouse anymore but the timeout locks offer a safeguard against that.
However, timeouts on locks are not without downsides:
They allow for the possibility of livelock, particularly on slower hardware. If lock timeouts keep failing spuriously the node can be prevented from making any progress, even if it would be able to make progress slowly without the timeout. One particularly concerning scenario which could occur would be if a DoS attack succeeded in slowing block signature verification times across the network, and all Lighthouse nodes got livelocked because they timed out repeatedly. This could also occur on just a subset of nodes (e.g. dual core VPSs or Raspberri Pis).
By making the behaviour runtime configurable this PR allows us to choose the behaviour we want depending on circumstance. I suspect that long term we could make the timeout-free approach the default (#2381 moves in this direction) and just enable the timeouts on our testnet nodes for debugging purposes. This PR conservatively leaves the default as-is so we can gain some more experience before switching the default.
## Issue Addressed
NA
## Proposed Changes
Adds some more testing for Altair to the op pool. Credits to @michaelsproul for some appropriated efforts here.
## Additional Info
NA
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Proposed Changes
Clone the proposer pubkeys during backfill signature verification to reduce the time that the pubkey cache lock is held for. Cloning such a small number of pubkeys has negligible impact on the total running time, but greatly reduces lock contention.
On a Ryzen 5950X, the setup step seems to take around 180us regardless of whether the key is cloned or not, while the verification takes 7ms. When Lighthouse is limited to 10% of one core using `sudo cpulimit --pid <pid> --limit 10` the total time jumps up to 800ms, but the setup step remains only 250us. This means that under heavy load this PR could cut the time the lock is held for from 800ms to 250us, which is a huge saving of 99.97%!
## Issue Addressed
NA
## Proposed Changes
This PR is near-identical to https://github.com/sigp/lighthouse/pull/2652, however it is to be merged into `unstable` instead of `merge-f2f`. Please see that PR for reasoning.
I'm making this duplicate PR to merge to `unstable` in an effort to shrink the diff between `unstable` and `merge-f2f` by doing smaller, lead-up PRs.
## Additional Info
NA
## Issue Addressed
Fix#2585
## Proposed Changes
Provide a canonical version of test_logger that can be used
throughout lighthouse.
## Additional Info
This allows tests to conditionally emit logging data by adding
test_logger as the default logger. And then when executing
`cargo test --features logging/test_logger` log output
will be visible:
wink@3900x:~/lighthouse/common/logging/tests/test-feature-test_logger (Add-test_logger-as-feature-to-logging)
$ cargo test --features logging/test_logger
Finished test [unoptimized + debuginfo] target(s) in 0.02s
Running unittests (target/debug/deps/test_logger-e20115db6a5e3714)
running 1 test
Sep 10 12:53:45.212 INFO hi, module: test_logger:8
test tests::test_fn_with_logging ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Doc-tests test-logger
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Or, in normal scenarios where logging isn't needed, executing
`cargo test` the log output will not be visible:
wink@3900x:~/lighthouse/common/logging/tests/test-feature-test_logger (Add-test_logger-as-feature-to-logging)
$ cargo test
Finished test [unoptimized + debuginfo] target(s) in 0.02s
Running unittests (target/debug/deps/test_logger-02e02f8d41e8cf8a)
running 1 test
test tests::test_fn_with_logging ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Doc-tests test-logger
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
## Issue Addressed
This PR addresses an issue found by @YorickDowne during testing of v2.0.0-rc.0.
Due to a lack of atomic database writes on checkpoint sync start-up, it was possible for the database to get into an inconsistent state from which it couldn't recover without `--purge-db`. The core of the issue was that the store's anchor info was being stored _before_ the `PersistedBeaconChain`. If a crash occured so that anchor info was stored but _not_ the `PersistedBeaconChain`, then on restart Lighthouse would think the database was unitialized and attempt to compare-and-swap a `None` value, but would actually find the stale info from the previous run.
## Proposed Changes
The issue is fixed by writing the anchor info, the split point, and the `PersistedBeaconChain` atomically on start-up. Some type-hinting ugliness was required, which could possibly be cleaned up in future refactors.
## Issue Addressed
Closes#2528
## Proposed Changes
- Add `BlockTimesCache` to provide block timing information to `BeaconChain`. This allows additional metrics to be calculated for blocks that are set as head too late.
- Thread the `seen_timestamp` of blocks received from RPC responses (except blocks from syncing) through to the sync manager, similar to what is done for blocks from gossip.
## Additional Info
This provides the following additional metrics:
- `BEACON_BLOCK_OBSERVED_SLOT_START_DELAY_TIME`
- The delay between the start of the slot and when the block was first observed.
- `BEACON_BLOCK_IMPORTED_OBSERVED_DELAY_TIME`
- The delay between when the block was first observed and when the block was imported.
- `BEACON_BLOCK_HEAD_IMPORTED_DELAY_TIME`
- The delay between when the block was imported and when the block was set as head.
The metric `BEACON_BLOCK_IMPORTED_SLOT_START_DELAY_TIME` was removed.
A log is produced when a block is set as head too late, e.g.:
```
Aug 27 03:46:39.006 DEBG Delayed head block set_as_head_delay: Some(21.731066ms), imported_delay: Some(119.929934ms), observed_delay: Some(3.864596988s), block_delay: 4.006257988s, slot: 1931331, proposer_index: 24294, block_root: 0x937602c89d3143afa89088a44bdf4b4d0d760dad082abacb229495c048648a9e, service: beacon
```
## Issue Addressed
Resolves#2552
## Proposed Changes
Offers some improvement in inclusion distance calculation in the validator monitor.
When registering an attestation from a block, instead of doing `block.slot() - attesstation.data.slot()` to get the inclusion distance, we now pass the parent block slot from the beacon chain and do `parent_slot.saturating_sub(attestation.data.slot())`. This allows us to give best effort inclusion distance in scenarios where the attestation was included right after a skip slot. Note that this does not give accurate results in scenarios where the attestation was included few blocks after the skip slot.
In this case, if the attestation slot was `b1` and was included in block `b2` with a skip slot in between, we would get the inclusion delay as 0 (by ignoring the skip slot) which is the best effort inclusion delay.
```
b1 <- missed <- b2
```
Here, if the attestation slot was `b1` and was included in block `b3` with a skip slot and valid block `b2` in between, then we would get the inclusion delay as 2 instead of 1 (by ignoring the skip slot).
```
b1 <- missed <- b2 <- b3
```
A solution for the scenario 2 would be to count number of slots between included slot and attestation slot ignoring the skip slots in the beacon chain and pass the value to the validator monitor. But I'm concerned that it could potentially lead to db accesses for older blocks in extreme cases.
This PR also uses the validator monitor data for logging per epoch inclusion distance. This is useful as we won't get inclusion data in post-altair summaries.
Co-authored-by: Michael Sproul <micsproul@gmail.com>
## Issue Addressed
NA
## Proposed Changes
Adds the ability to verify batches of aggregated/unaggregated attestations from the network.
When the `BeaconProcessor` finds there are messages in the aggregated or unaggregated attestation queues, it will first check the length of the queue:
- `== 1` verify the attestation individually.
- `>= 2` take up to 64 of those attestations and verify them in a batch.
Notably, we only perform batch verification if the queue has a backlog. We don't apply any artificial delays to attestations to try and force them into batches.
### Batching Details
To assist with implementing batches we modify `beacon_chain::attestation_verification` to have two distinct categories for attestations:
- *Indexed* attestations: those which have passed initial validation and were valid enough for us to derive an `IndexedAttestation`.
- *Verified* attestations: those attestations which were indexed *and also* passed signature verification. These are well-formed, interesting messages which were signed by validators.
The batching functions accept `n` attestations and then return `n` attestation verification `Result`s, where those `Result`s can be any combination of `Ok` or `Err`. In other words, we attempt to verify as many attestations as possible and return specific per-attestation results so peer scores can be updated, if required.
When we batch verify attestations, we first try to map all those attestations to *indexed* attestations. If any of those attestations were able to be indexed, we then perform batch BLS verification on those indexed attestations. If the batch verification succeeds, we convert them into *verified* attestations, disabling individual signature checking. If the batch fails, we convert to verified attestations with individual signature checking enabled.
Ultimately, we optimistically try to do a batch verification of attestation signatures and fall-back to individual verification if it fails. This opens an attach vector for "poisoning" the attestations and causing us to waste a batch verification. I argue that peer scoring should do a good-enough job of defending against this and the typical-case gains massively outweigh the worst-case losses.
## Additional Info
Before this PR, attestation verification took the attestations by value (instead of by reference). It turns out that this was unnecessary and, in my opinion, resulted in some undesirable ergonomics (e.g., we had to pass the attestation back in the `Err` variant to avoid clones). In this PR I've modified attestation verification so that it now takes a reference.
I refactored the `beacon_chain/tests/attestation_verification.rs` tests so they use a builder-esque "tester" struct instead of a weird macro. It made it easier for me to test individual/batch with the same set of tests and I think it was a nice tidy-up. Notably, I did this last to try and make sure my new refactors to *actual* production code would pass under the existing test suite.
## Issue Addressed
Closes#1891Closes#1784
## Proposed Changes
Implement checkpoint sync for Lighthouse, enabling it to start from a weak subjectivity checkpoint.
## Additional Info
- [x] Return unavailable status for out-of-range blocks requested by peers (#2561)
- [x] Implement sync daemon for fetching historical blocks (#2561)
- [x] Verify chain hashes (either in `historical_blocks.rs` or the calling module)
- [x] Consistency check for initial block + state
- [x] Fetch the initial state and block from a beacon node HTTP endpoint
- [x] Don't crash fetching beacon states by slot from the API
- [x] Background service for state reconstruction, triggered by CLI flag or API call.
Considered out of scope for this PR:
- Drop the requirement to provide the `--checkpoint-block` (this would require some pretty heavy refactoring of block verification)
Co-authored-by: Diva M <divma@protonmail.com>
## Issue Addressed
N/A
## Proposed Changes
Add functionality in the validator monitor to provide sync committee related metrics for monitored validators.
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
NA
## Proposed Changes
Missed head votes on attestations is a well-known issue. The primary cause is a block getting set as the head *after* the attestation deadline.
This PR aims to shorten the overall time between "block received" and "block set as head" by:
1. Persisting the head and fork choice *after* setting the canonical head
- Informal measurements show this takes ~200ms
1. Pruning the op pool *after* setting the canonical head.
1. No longer persisting the op pool to disk during `BeaconChain::fork_choice`
- Informal measurements show this can take up to 1.2s.
I also add some metrics to help measure the effect of these changes.
Persistence changes like this run the risk of breaking assumptions downstream. However, I have considered these risks and I think we're fine here. I will describe my reasoning for each change.
## Reasoning
### Change 1: Persisting the head and fork choice *after* setting the canonical head
For (1), although the function is called `persist_head_and_fork_choice`, it only persists:
- Fork choice
- Head tracker
- Genesis block root
Since `BeaconChain::fork_choice_internal` does not modify these values between the original time we were persisting it and the current time, I assert that the change I've made is non-substantial in terms of what ends up on-disk. There's the possibility that some *other* thread has modified fork choice in the extra time we've given it, but that's totally fine.
Since the only time we *read* those values from disk is during startup, I assert that this has no impact during runtime.
### Change 2: Pruning the op pool after setting the canonical head
Similar to the argument above, we don't modify the op pool during `BeaconChain::fork_choice_internal` so it shouldn't matter when we prune. This change should be non-substantial.
### Change 3: No longer persisting the op pool to disk during `BeaconChain::fork_choice`
This change *is* substantial. With the proposed changes, we'll only be persisting the op pool to disk when we shut down cleanly (i.e., the `BeaconChain` gets dropped). This means we'll save disk IO and time during usual operation, but a `kill -9` or similar "crash" will probably result in an out-of-date op pool when we reboot. An out-of-date op pool can only have an impact when producing blocks or aggregate attestations/sync committees.
I think it's pretty reasonable that a crash might result in an out-of-date op pool, since:
- Crashes are fairly rare. Practically the only time I see LH suffer a full crash is when the OOM killer shows up, and that's a very serious event.
- It's generally quite rare to produce a block/aggregate immediately after a reboot. Just a few slots of runtime is probably enough to have a decent-enough op pool again.
## Additional Info
Credits to @macladson for the timings referenced here.
## Issue Addressed
Closes#2526
## Proposed Changes
If the head block fails to decode on start up, do two things:
1. Revert all blocks between the head and the most recent hard fork (to `fork_slot - 1`).
2. Reset fork choice so that it contains the new head, and all blocks back to the new head's finalized checkpoint.
## Additional Info
I tweaked some of the beacon chain test harness stuff in order to make it generic enough to test with a non-zero slot clock on start-up. In the process I consolidated all the various `new_` methods into a single generic one which will hopefully serve all future uses 🤞
## Issue Addressed
Which issue # does this PR address?
## Proposed Changes
- Add a counter metric to log when a block is received late from gossip.
- Also push a `DEBG` log for the above condition.
- Use Debug (`?`) instead of Display (`%`) for a bunch of logs in the beacon processor, so we don't have to deal with concatenated block roots.
- Add new ERRO and CRIT to HTTP API to alert users when they're publishing late blocks.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
When testing our (not-yet-released) Doppelganger implementation, I noticed that we aren't detecting attestations included in blocks (only those on the gossip network).
This is because during [block processing](e8c0d1f19b/beacon_node/beacon_chain/src/beacon_chain.rs (L2168)) we only update the `observed_attestations` cache with each attestation, but not the `observed_attesters` cache. This is the correct behaviour when we consider the [p2p spec](https://github.com/ethereum/eth2.0-specs/blob/v1.0.1/specs/phase0/p2p-interface.md):
> [IGNORE] There has been no other valid attestation seen on an attestation subnet that has an identical attestation.data.target.epoch and participating validator index.
We're doing the right thing here and still allowing attestations on gossip that we've seen in a block. However, this doesn't work so nicely for Doppelganger.
To resolve this, I've taken the following steps:
- Add a `observed_block_attesters` cache.
- Rename `observed_attesters` to `observed_gossip_attesters`.
## TODO
- [x] Add a test to ensure a validator that's been seen in a block attestation (but not a gossip attestation) returns `true` for `BeaconChain::validator_seen_at_epoch`.
- [x] Add a test to ensure `observed_block_attesters` isn't polluted via gossip attestations and vice versa.
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Proposed Changes
* Implement the validator client and HTTP API changes necessary to support Altair
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
Resolves#2069
## Proposed Changes
- Adds a `--doppelganger-detection` flag
- Adds a `lighthouse/seen_validators` endpoint, which will make it so the lighthouse VC is not interopable with other client beacon nodes if the `--doppelganger-detection` flag is used, but hopefully this will become standardized. Relevant Eth2 API repo issue: https://github.com/ethereum/eth2.0-APIs/issues/64
- If the `--doppelganger-detection` flag is used, the VC will wait until the beacon node is synced, and then wait an additional 2 epochs. The reason for this is to make sure the beacon node is able to subscribe to the subnets our validators should be attesting on. I think an alternative would be to have the beacon node subscribe to all subnets for 2+ epochs on startup by default.
## Additional Info
I'd like to add tests and would appreciate feedback.
TODO: handle validators started via the API, potentially make this default behavior
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
N/A
## Proposed Changes
- Removing a bunch of unnecessary references
- Updated `Error::VariantError` to `Error::Variant`
- There were additional enum variant lints that I ignored, because I thought our variant names were fine
- removed `MonitoredValidator`'s `pubkey` field, because I couldn't find it used anywhere. It looks like we just use the string version of the pubkey (the `id` field) if there is no index
## Additional Info
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
- Resolves#2169
## Proposed Changes
Adds the `AttesterCache` to allow validators to produce attestations for older slots. Presently, some arbitrary restrictions can force validators to receive an error when attesting to a slot earlier than the present one. This can cause attestation misses when there is excessive load on the validator client or time sync issues between the VC and BN.
## Additional Info
NA
## Proposed Changes
Add the `sync_aggregate` from `BeaconBlock` to the bulk signature verifier for blocks. This necessitates a new signature set constructor for the sync aggregate, which is different from the others due to the use of [`eth2_fast_aggregate_verify`](https://github.com/ethereum/eth2.0-specs/blob/v1.1.0-alpha.7/specs/altair/bls.md#eth2_fast_aggregate_verify) for sync aggregates, per [`process_sync_aggregate`](https://github.com/ethereum/eth2.0-specs/blob/v1.1.0-alpha.7/specs/altair/beacon-chain.md#sync-aggregate-processing). I made the choice to return an optional signature set, with `None` representing the case where the signature is valid on account of being the point at infinity (requires no further checking).
To "dogfood" the changes and prevent duplication, the consensus logic now uses the signature set approach as well whenever it is required to verify signatures (which should only be in testing AFAIK). The EF tests pass with the code as it exists currently, but failed before I adapted the `eth2_fast_aggregate_verify` changes (which is good).
As a result of this change Altair block processing should be a little faster, and importantly, we will no longer accidentally verify signatures when replaying blocks, e.g. when replaying blocks from the database.
## Issue Addressed
NA
## Proposed Changes
This PR addresses two things:
1. Allows the `ValidatorMonitor` to work with Altair states.
1. Optimizes `altair::process_epoch` (see [code](https://github.com/paulhauner/lighthouse/blob/participation-cache/consensus/state_processing/src/per_epoch_processing/altair/participation_cache.rs) for description)
## Breaking Changes
The breaking changes in this PR revolve around one premise:
*After the Altair fork, it's not longer possible (given only a `BeaconState`) to identify if a validator had *any* attestation included during some epoch. The best we can do is see if that validator made the "timely" source/target/head flags.*
Whilst this seems annoying, it's not actually too bad. Finalization is based upon "timely target" attestations, so that's really the most important thing. Although there's *some* value in knowing if a validator had *any* attestation included, it's far more important to know about "timely target" participation, since this is what affects finality and justification.
For simplicity and consistency, I've also removed the ability to determine if *any* attestation was included from metrics and API endpoints. Now, all Altair and non-Altair states will simply report on the head/target attestations.
The following section details where we've removed fields and provides replacement values.
### Breaking Changes: Prometheus Metrics
Some participation metrics have been removed and replaced. Some were removed since they are no longer relevant to Altair (e.g., total attesting balance) and others replaced with gwei values instead of pre-computed values. This provides more flexibility at display-time (e.g., Grafana).
The following metrics were added as replacements:
- `beacon_participation_prev_epoch_head_attesting_gwei_total`
- `beacon_participation_prev_epoch_target_attesting_gwei_total`
- `beacon_participation_prev_epoch_source_attesting_gwei_total`
- `beacon_participation_prev_epoch_active_gwei_total`
The following metrics were removed:
- `beacon_participation_prev_epoch_attester`
- instead use `beacon_participation_prev_epoch_source_attesting_gwei_total / beacon_participation_prev_epoch_active_gwei_total`.
- `beacon_participation_prev_epoch_target_attester`
- instead use `beacon_participation_prev_epoch_target_attesting_gwei_total / beacon_participation_prev_epoch_active_gwei_total`.
- `beacon_participation_prev_epoch_head_attester`
- instead use `beacon_participation_prev_epoch_head_attesting_gwei_total / beacon_participation_prev_epoch_active_gwei_total`.
The `beacon_participation_prev_epoch_attester` endpoint has been removed. Users should instead use the pre-existing `beacon_participation_prev_epoch_target_attester`.
### Breaking Changes: HTTP API
The `/lighthouse/validator_inclusion/{epoch}/{validator_id}` endpoint loses the following fields:
- `current_epoch_attesting_gwei` (use `current_epoch_target_attesting_gwei` instead)
- `previous_epoch_attesting_gwei` (use `previous_epoch_target_attesting_gwei` instead)
The `/lighthouse/validator_inclusion/{epoch}/{validator_id}` endpoint lose the following fields:
- `is_current_epoch_attester` (use `is_current_epoch_target_attester` instead)
- `is_previous_epoch_attester` (use `is_previous_epoch_target_attester` instead)
- `is_active_in_current_epoch` becomes `is_active_unslashed_in_current_epoch`.
- `is_active_in_previous_epoch` becomes `is_active_unslashed_in_previous_epoch`.
## Additional Info
NA
## TODO
- [x] Deal with total balances
- [x] Update validator_inclusion API
- [ ] Ensure `beacon_participation_prev_epoch_target_attester` and `beacon_participation_prev_epoch_head_attester` work before Altair
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Proposed Changes
Remove the remaining Altair `FIXME`s from consensus land.
1. Implement tree hash caching for the participation lists. This required some light type manipulation, including removing the `TreeHash` bound from `CachedTreeHash` which was purely descriptive.
2. Plumb the proposer index through Altair attestation processing, to avoid calculating it for _every_ attestation (potentially 128ms on large networks). This duplicates some work from #2431, but with the aim of getting it in sooner, particularly for the Altair devnets.
3. Removes two FIXMEs related to `superstruct` and cloning, which are unlikely to be particularly detrimental and will be tracked here instead: https://github.com/sigp/superstruct/issues/5
## Issue Addressed
#635
## Proposed Changes
- Keep attestations that reference a block we have not seen for 30secs before being re processed
- If we do import the block before that time elapses, it is reprocessed in that moment
- The first time it fails, do nothing wrt to gossipsub propagation or peer downscoring. If after being re processed it fails, downscore with a `LowToleranceError` and ignore the message.
## Issue Addressed
NA
## Proposed Changes
Adds a metric to see how many set bits are in the sync aggregate for each beacon block being imported.
## Additional Info
NA
## Issue Addressed
- Resolves#2444
## Proposed Changes
Adds some more detail to the error message returned when the `BeaconChainBuilder` is unable to access or decode block/state objects during startup.
## Additional Info
NA
## Proposed Changes
Implement the consensus changes necessary for the upcoming Altair hard fork.
## Additional Info
This is quite a heavy refactor, with pivotal types like the `BeaconState` and `BeaconBlock` changing from structs to enums. This ripples through the whole codebase with field accesses changing to methods, e.g. `state.slot` => `state.slot()`.
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Issue Addressed
#2377
## Proposed Changes
Implement the same code used for block root lookups (from #2376) to state root lookups in order to improve performance and reduce associated memory spikes (e.g. from certain HTTP API requests).
## Additional Changes
- Tests using `rev_iter_state_roots` and `rev_iter_block_roots` have been refactored to use their `forwards` versions instead.
- The `rev_iter_state_roots` and `rev_iter_block_roots` functions are now unused and have been removed.
- The `state_at_slot` function has been changed to use the `forwards` iterator.
## Additional Info
- Some tests still need to be refactored to use their `forwards_iter` versions. These tests start their iteration from a specific beacon state and thus use the `rev_iter_state_roots_from` and `rev_iter_block_roots_from` functions. If they can be refactored, those functions can also be removed.
## Issue Addressed
`make lint` failing on rust 1.53.0.
## Proposed Changes
1.53.0 updates
## Additional Info
I haven't figure out why yet, we were now hitting the recursion limit in a few crates. So I had to add `#![recursion_limit = "256"]` in a few places
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Proposed Changes
A user on Discord (`@ChewsMacRibs`) reported that the validator monitor was logging `WARN Attested to an incorrect head` for their validator while it was awaiting activation.
This PR modifies the monitor so that it ignores inactive validators, by the logic that they are either awaiting activation, or have already exited. Either way, there's no way for an inactive validator to have their attestations included on chain, so no need for the monitor to report on them.
## Additional Info
To reproduce the bug requires registering validator keys manually with `--validator-monitor-pubkeys`. I don't think the bug will present itself with `--validator-monitor-auto`.
## Issue Addressed
NA
## Proposed Changes
When observing `jemallocator` heap profiles and Grafana, it became clear that Lighthouse is spending significant RAM/CPU on processing blocks from the RPC. On investigation, it seems that we are loading the parent of the block *before* we check to see if the block is already known. This is a big waste of resources.
This PR adds an additional `check_block_relevancy` call as the first thing we do when we try to process a `SignedBeaconBlock` via the RPC (or other similar methods). Ultimately, `check_block_relevancy` will be called again later in the block processing flow. It's a very light function and I don't think trying to optimize it out is worth the risk of a bad block slipping through.
Also adds a `New RPC block received` info log when we process a new RPC block. This seems like interesting and infrequent info.
## Additional Info
NA
## Issue Addressed
NA
## Primary Change
When investigating memory usage, I noticed that retrieving a block from an early slot (e.g., slot 900) would cause a sharp increase in the memory footprint (from 400mb to 800mb+) which seemed to be ever-lasting.
After some investigation, I found that the reverse iteration from the head back to that slot was the likely culprit. To counter this, I've switched the `BeaconChain::block_root_at_slot` to use the forwards iterator, instead of the reverse one.
I also noticed that the networking stack is using `BeaconChain::root_at_slot` to check if a peer is relevant (`check_peer_relevance`). Perhaps the steep, seemingly-random-but-consistent increases in memory usage are caused by the use of this function.
Using the forwards iterator with the HTTP API alleviated the sharp increases in memory usage. It also made the response much faster (before it felt like to took 1-2s, now it feels instant).
## Additional Changes
In the process I also noticed that we have two functions for getting block roots:
- `BeaconChain::block_root_at_slot`: returns `None` for a skip slot.
- `BeaconChain::root_at_slot`: returns the previous root for a skip slot.
I unified these two functions into `block_root_at_slot` and added the `WhenSlotSkipped` enum. Now, the caller must be explicit about the skip-slot behaviour when requesting a root.
Additionally, I replaced `vec![]` with `Vec::with_capacity` in `store::chunked_vector::range_query`. I stumbled across this whilst debugging and made this modification to see what effect it would have (not much). It seems like a decent change to keep around, but I'm not concerned either way.
Also, `BeaconChain::get_ancestor_block_root` is unused, so I got rid of it 🗑️.
## Additional Info
I haven't also done the same for state roots here. Whilst it's possible and a good idea, it's more work since the fwds iterators are presently block-roots-specific.
Whilst there's a few places a reverse iteration of state roots could be triggered (e.g., attestation production, HTTP API), they're no where near as common as the `check_peer_relevance` call. As such, I think we should get this PR merged first, then come back for the state root iters. I made an issue here https://github.com/sigp/lighthouse/issues/2377.
## Issue Addressed
The latest version of Rust has new clippy rules & the codebase isn't up to date with them.
## Proposed Changes
Small formatting changes that clippy tells me are functionally equivalent
## Issue Addressed
N/A
## Proposed Changes
Add unit tests for the various CLI flags associated with the beacon node and validator client. These changes require the addition of two new flags: `dump-config` and `immediate-shutdown`.
## Additional Info
Both `dump-config` and `immediate-shutdown` are marked as hidden since they should only be used in testing and other advanced use cases.
**Note:** This requires changing `main.rs` so that the flags can adjust the program behavior as necessary.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
Resolves#2186
## Proposed Changes
404 for any block-related information on a slot that was skipped or orphaned
Affected endpoints:
- `/eth/v1/beacon/blocks/{block_id}`
- `/eth/v1/beacon/blocks/{block_id}/root`
- `/eth/v1/beacon/blocks/{block_id}/attestations`
- `/eth/v1/beacon/headers/{block_id}`
## Additional Info
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Proposed Changes
Use two instances of max cover when packing attestations into blocks: one for the previous epoch, and one for the current epoch. This reduces the amount of computation done by roughly half due to the `O(n^2)` running time of max cover (`2 * (n/2)^2 = n^2/2`). This should help alleviate some load on block proposal, particularly on Prater.
## Issue Addressed
NA
## Proposed Changes
- Adds a specific log and metric for when a block is enshrined as head with a delay that will caused bad attestations
- We *technically* already expose this information, but it's a little tricky to determine during debugging. This makes it nice and explicit.
- Fixes a minor reporting bug with the validator monitor where it was expecting agg. attestations too early (at half-slot rather than two-thirds-slot).
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
- Ensure that the [target consistency check](b356f52c5c) is always performed on aggregates.
- Add a regression test.
## Additional Info
NA
This is a small PR that cleans up compiler warnings.
The most controversial change is removing the `data_dir` field from the `BeaconChainBuilder`.
It was removed because it was never read.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: Herman Junge <hermanjunge@protonmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
Which issue # does this PR address?
## Proposed Changes
Avoids cloning the `BeaconState` each time Prometheus scrapes our metrics (generally every 5s 😱).
I think the original motivation behind this was *"don't hold the lock on the head whilst we do computation on it"*, however I think is flawed since our computation here is so small that it'll be quicker than the clone.
The primary motivation here is to maintain a small memory footprint by holding less in memory (i.e., the cloned `BeaconState`) and to avoid the fragmentation-creep that occurs when cloning the big contiguous slabs of memory in the `BeaconState`.
I also collapsed the active/slashed/withdrawn counters into a single loop to increase efficiency.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
I noticed the following error on one of our nodes:
```
Mar 18 00:03:35 ip-xxxx lighthouse-bn[333503]: Mar 18 00:03:35.103 ERRO Unable to validate aggregate error: ObservedAttestersError(EpochTooLow { epoch: Epoch(23961), lowest_permissible_epoch: Epoch(23962) }), peer_id: 16Uiu2HAm5GL5KzPLhvfg9MBBFSpBqTVGRFSiTg285oezzWcZzwEv
```
The slot during this log was 766,815 (the last slot of the epoch). I believe this is due to an off-by-one error in `observed_attesters` where we were failing to provide enough capacity to store observations from the previous, current and next epochs. See code comments for further reasoning.
Here's a link to the spec: https://github.com/ethereum/eth2.0-specs/blob/v1.0.1/specs/phase0/p2p-interface.md#beacon_aggregate_and_proof
## Additional Info
NA
## Issue Addressed
Closes#2052
## Proposed Changes
- Refactor the attester/proposer duties endpoints in the BN
- Performance improvements
- Fixes some potential inconsistencies with the dependent root fields.
- Removes `http_api::beacon_proposer_cache` and just uses the one on the `BeaconChain` instead.
- Move the code for the proposer/attester duties endpoints into separate files, for readability.
- Refactor the `DutiesService` in the VC
- Required to reduce the delay on broadcasting new blocks.
- Gets rid of the `ValidatorDuty` shim struct that came about when we adopted the standard API.
- Separate block/attestation duty tasks so that they don't block each other when one is slow.
- In the VC, use `PublicKeyBytes` to represent validators instead of `PublicKey`. `PublicKey` is a legit crypto object whilst `PublicKeyBytes` is just a byte-array, it's much faster to clone/hash `PublicKeyBytes` and this change has had a significant impact on runtimes.
- Unfortunately this has created lots of dust changes.
- In the BN, store `PublicKeyBytes` in the `beacon_proposer_cache` and allow access to them. The HTTP API always sends `PublicKeyBytes` over the wire and the conversion from `PublicKey` -> `PublickeyBytes` is non-trivial, especially when queries have 100s/1000s of validators (like Pyrmont).
- Add the `state_processing::state_advance` mod which dedups a lot of the "apply `n` skip slots to the state" code.
- This also fixes a bug with some functions which were failing to include a state root as per [this comment](072695284f/consensus/state_processing/src/state_advance.rs (L69-L74)). I couldn't find any instance of this bug that resulted in anything more severe than keying a shuffling cache by the wrong block root.
- Swap the VC block service to use `mpsc` from `tokio` instead of `futures`. This is consistent with the rest of the code base.
~~This PR *reduces* the size of the codebase 🎉~~ It *used* to reduce the size of the code base before I added more comments.
## Observations on Prymont
- Proposer duties times down from peaks of 450ms to consistent <1ms.
- Current epoch attester duties times down from >1s peaks to a consistent 20-30ms.
- Block production down from +600ms to 100-200ms.
## Additional Info
- ~~Blocked on #2241~~
- ~~Blocked on #2234~~
## TODO
- [x] ~~Refactor this into some smaller PRs?~~ Leaving this as-is for now.
- [x] Address `per_slot_processing` roots.
- [x] Investigate slow next epoch times. Not getting added to cache on block processing?
- [x] Consider [this](072695284f/beacon_node/store/src/hot_cold_store.rs (L811-L812)) in the scenario of replacing the state roots
Co-authored-by: pawan <pawandhananjay@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Proposed Changes
While investigating an incorrect head + target vote for the epoch boundary block 708544, I noticed that the state advance failed to prime the proposer cache, as per these logs:
```
Mar 09 21:42:47.448 DEBG Subscribing to subnet target_slot: 708544, subnet: Y, service: attestation_service
Mar 09 21:49:08.063 DEBG Advanced head state one slot current_slot: 708543, state_slot: 708544, head_root: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: state_advance
Mar 09 21:49:08.063 DEBG Completed state advance initial_slot: 708543, advanced_slot: 708544, head_root: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: state_advance
Mar 09 21:49:14.787 DEBG Proposer shuffling cache miss block_slot: 708544, block_root: 0x9b14bf68667ab1d9c35e6fd2c95ff5d609aa9e8cf08e0071988ae4aa00b9f9fe, parent_slot: 708543, parent_root: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: beacon
Mar 09 21:49:14.800 DEBG Successfully processed gossip block root: 0x9b14bf68667ab1d9c35e6fd2c95ff5d609aa9e8cf08e0071988ae4aa00b9f9fe, slot: 708544, graffiti: , service: beacon
Mar 09 21:49:14.800 INFO New block received hash: 0x9b14…f9fe, slot: 708544
Mar 09 21:49:14.984 DEBG Head beacon block slot: 708544, root: 0x9b14…f9fe, finalized_epoch: 22140, finalized_root: 0x28ec…29a7, justified_epoch: 22141, justified_root: 0x59db…e451, service: beacon
Mar 09 21:49:15.055 INFO Unaggregated attestation validator: XXXXX, src: api, slot: 708544, epoch: 22142, delay_ms: 53, index: Y, head: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: val_mon
Mar 09 21:49:17.001 DEBG Slot timer sync_state: Synced, current_slot: 708544, head_slot: 708544, head_block: 0x9b14…f9fe, finalized_epoch: 22140, finalized_root: 0x28ec…29a7, peers: 55, service: slot_notifier
```
The reason for this is that the condition was backwards, so that whole block of code was unreachable.
Looking at the attestations for the block included in the block after, we can see that lots of validators missed it. Some of them may be Lighthouse v1.1.1-v1.2.0-rc.0, but it's probable that they would have missed even with the proposer cache primed, given how late the block 708544 arrived (the cache miss occurred 3.787s after the slot start): https://beaconcha.in/block/708545#attestations
## Issue Addressed
NA
## Proposed Changes
- Use the pre-states from #2174 during block production.
- Running this on Pyrmont shows block production times dropping from ~550ms to ~150ms.
- Create `crit` and `warn` logs when a block is published to the API later than we expect.
- On mainnet we are issuing a warn if the block is published more than 1s later than the slot start and a crit for more than 3s.
- Rename some methods on the `SnapshotCache` for clarity.
- Add the ability to pass the state root to `BeaconChain::produce_block_on_state` to avoid computing a state root. This is a very common LH optimization.
- Add a metric that tracks how late we broadcast blocks received from the HTTP API. This is *technically* a duplicate of a `ValidatorMonitor` log, but I wanted to have it for the case where we aren't monitoring validators too.
## Issue Addressed
Closes#1787
## Proposed Changes
* Abstract the `ValidatorPubkeyCache` over a "backing" which is either a file (legacy), or the database.
* Implement a migration from schema v2 to schema v3, whereby the contents of the cache file are copied to the DB, and then the file is deleted. The next release to include this change must be a minor version bump, and we will need to warn users of the inability to downgrade (this is our first DB schema change since mainnet genesis).
* Move the schema migration code from the `store` crate into the `beacon_chain` crate so that it can access the datadir and the `ValidatorPubkeyCache`, etc. It gets injected back into the `store` via a closure (similar to what we do in fork choice).
## Issue Addressed
NA
## Problem this PR addresses
There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events:
1. Gossip block 0xabc arrives ~200ms early
- It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets).
- However, it is not imported to our database since the block is early.
2. Attestations for 0xabc arrive, but the block was not imported.
- The peer that sent the attestation is down-voted.
- Each unknown-block attestation causes a score loss of 1, the peer is banned at -100.
- When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc).
## Potential solutions
I can think of three solutions to this:
1. Wait for attestation-queuing (#635) to arrive and solve this.
- Easy
- Not immediate fix.
- Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better.
1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`.
- Easy
- ~~I have implemented this, for now.~~
1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot.
- More difficult
- Feels like the best solution, I will try to implement this.
**This PR takes approach (3).**
## Changes included
- Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them.
- Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event.
- In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic).
- Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester.
- To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect.
- The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed *any* failures due to races on my machine or on CI yet.
- To assist with testing I allowed for directly setting the time on the `ManualSlotClock`.
- I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.
## Issue Addressed
NA
## Proposed Changes
Add an optimization to perform `per_slot_processing` from the *leading-edge* of block processing to the *trailing-edge*. Ultimately, this allows us to import the block at slot `n` faster because we used the tail-end of slot `n - 1` to perform `per_slot_processing`.
Additionally, add a "block proposer cache" which allows us to cache the block proposer for some epoch. Since we're now doing trailing-edge `per_slot_processing`, we can prime this cache with the values for the next epoch before those blocks arrive (assuming those blocks don't have some weird forking).
There were several ancillary changes required to achieve this:
- Remove the `state_root` field of `BeaconSnapshot`, since there's no need to know it on a `pre_state` and in all other cases we can just read it from `block.state_root()`.
- This caused some "dust" changes of `snapshot.beacon_state_root` to `snapshot.beacon_state_root()`, where the `BeaconSnapshot::beacon_state_root()` func just reads the state root from the block.
- Rename `types::ShuffingId` to `AttestationShufflingId`. I originally did this because I added a `ProposerShufflingId` struct which turned out to be not so useful. I thought this new name was more descriptive so I kept it.
- Address https://github.com/ethereum/eth2.0-specs/pull/2196
- Add a debug log when we get a block with an unknown parent. There was previously no logging around this case.
- Add a function to `BeaconState` to compute all proposers for an epoch without re-computing the active indices for each slot.
## Additional Info
- ~~Blocked on #2173~~
- ~~Blocked on #2179~~ That PR was wrapped into this PR.
- There's potentially some places where we could avoid computing the proposer indices in `per_block_processing` but I haven't done this here. These would be an optimization beyond the issue at hand (improving block propagation times) and I think this PR is already doing enough. We can come back for that later.
## TODO
- [x] Tidy, improve comments.
- [x] ~~Try avoid computing proposer index in `per_block_processing`?~~
## Issue Addressed
Which issue # does this PR address?
## Proposed Changes
Replaces use of `format!` in `slog` logging with it's special no-allocation `?` and `%` shortcuts. According to a `heaptrack` analysis today over about a period of an hour, this will reduce temporary allocations by at least 4%.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
Adds some metrics to track delays regarding:
- LH processing of blocks
- delays receiving blocks from other nodes.
## Additional Info
NA
## Issue Addressed
- Resolves#2064
## Proposed Changes
Adds a `ValidatorMonitor` struct which provides additional logging and Grafana metrics for specific validators.
Use `lighthouse bn --validator-monitor` to automatically enable monitoring for any validator that hits the [subnet subscription](https://ethereum.github.io/eth2.0-APIs/#/Validator/prepareBeaconCommitteeSubnet) HTTP API endpoint.
Also, use `lighthouse bn --validator-monitor-pubkeys` to supply a list of validators which will always be monitored.
See the new docs included in this PR for more info.
## TODO
- [x] Track validator balance, `slashed` status, etc.
- [x] ~~Register slashings in current epoch, not offense epoch~~
- [ ] Publish Grafana dashboard, update TODO link in docs
- [x] ~~#2130 is merged into this branch, resolve that~~
## Issue Addressed
The non-finality period on Pyrmont between epochs [`9114`](https://pyrmont.beaconcha.in/epoch/9114) and [`9182`](https://pyrmont.beaconcha.in/epoch/9182) was contributed to by all the `lighthouse_team` validators going down. The nodes saw excessive CPU and RAM usage, resulting in the system to kill the `lighthouse bn` process. The `Restart=on-failure` directive for `systemd` caused the process to bounce in ~10-30m intervals.
Diagnosis with `heaptrack` showed that the `BeaconChain::produce_unaggregated_attestation` function was calling `store::beacon_state::get_full_state` and sometimes resulting in a tree hash cache allocation. These allocations were approximately the size of the hosts physical memory and still allocated when `lighthouse bn` was killed by the OS.
There was no CPU analysis (e.g., `perf`), but the `BeaconChain::produce_unaggregated_attestation` is very CPU-heavy so it is reasonable to assume it is the cause of the excessive CPU usage, too.
## Proposed Changes
`BeaconChain::produce_unaggregated_attestation` has two paths:
1. Fast path: attesting to the head slot or later.
2. Slow path: attesting to a slot earlier than the head block.
Path (2) is the only path that calls `store::beacon_state::get_full_state`, therefore it is the path causing this excessive CPU/RAM usage.
This PR removes the current functionality of path (2) and replaces it with a static error (`BeaconChainError::AttestingPriorToHead`).
This change reduces the generality of `BeaconChain::produce_unaggregated_attestation` (and therefore [`/eth/v1/validator/attestation_data`](https://ethereum.github.io/eth2.0-APIs/#/Validator/produceAttestationData)), but I argue that this functionality is an edge-case and arguably a violation of the [Honest Validator spec](https://github.com/ethereum/eth2.0-specs/blob/dev/specs/phase0/validator.md).
It's possible that a validator goes back to a prior slot to "catch up" and submit some missed attestations. This change would prevent such behaviour, returning an error. My concerns with this catch-up behaviour is that it is:
- Not specified as "honest validator" attesting behaviour.
- Is behaviour that is risky for slashing (although, all validator clients *should* have slashing protection and will eventually fail if they do not).
- It disguises clock-sync issues between a BN and VC.
## Additional Info
It's likely feasible to implement path (2) if we implement some sort of caching mechanism. This would be a multi-week task and this PR gets the issue patched in the short term. I haven't created an issue to add path (2), instead I think we should implement it if we get user-demand.
## Issue Addressed
NA
## Proposed Changes
Copied from #2083, changes the config milliseconds_per_slot to seconds_per_slot to avoid errors when slot duration is not a multiple of a second. To avoid deserializing old serialized data (with milliseconds instead of seconds) the Serialize and Deserialize derive got removed from the Spec struct (isn't currently used anyway).
This PR replaces #2083 for the purpose of fixing a merge conflict without requiring the input of @blacktemplar.
## Additional Info
NA
Co-authored-by: blacktemplar <blacktemplar@a1.net>
## Issue Addressed
NA
## Proposed Changes
As per #2100, uses derives from the sturm library to implement AsRef<str> and AsStaticRef to easily get str values from enums without creating new Strings. Furthermore unifies all attestation error counter into one IntCounterVec vector.
These works are originally by @blacktemplar, I've just created this PR so I can resolve some merge conflicts.
## Additional Info
NA
Co-authored-by: blacktemplar <blacktemplar@a1.net>
## Issue Addressed
`test_dht_persistence` failing
## Proposed Changes
Bind `NetworkService::start` to an underscore prefixed variable rather than `_`. `_` was causing it to be dropped immediately
This was failing 5/100 times before this update, but I haven't been able to get it to fail after updating it
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Proposed Changes
`@potuz` on the Eth R&D Discord observed that Lighthouse blocks on Pyrmont were always arriving at other nodes after at least 1 second. Part of this could be due to processing and slow propagation, but metrics also revealed that the Lighthouse nodes were usually taking 400-600ms to even just produce a block before broadcasting it.
I tracked the slowness down to the lack of a pre-built tree hash cache (THC) on the states being used for block production. This was due to using the head state for block production, which lacks a THC in order to keep fork choice fast (cloning a THC takes at least 30ms for 100k validators). This PR modifies block production to clone a state from the snapshot cache rather than the head, which speeds things up by 200-400ms by avoiding the tree hash cache rebuild. In practice this seems to have cut block production time down to 300ms or less. Ideally we could _remove_ the snapshot from the cache (and save the 30ms), but it is required for when we re-process the block after signing it with the validator client.
## Alternatives
I experimented with 2 alternatives to this approach, before deciding on it:
* Alternative 1: ensure the `head` has a tree hash cache. This is too slow, as it imposes a +30ms hit on fork choice, which currently takes ~5ms (with occasional spikes).
* Alternative 2: use `Arc<BeaconSnapshot>` in the snapshot cache and share snapshots between the cache and the `head`. This made fork choice blazing fast (1ms), and block production the same as in this PR, but had a negative impact on block processing which I don't think is worth it. It ended up being necessary to clone the full state from the snapshot cache during block production, imposing the +30ms penalty there _as well_ as in block production.
In contract, the approach in this PR should only impact block production, and it improves it! Yay for pareto improvements 🎉
## Additional Info
This commit (ac59dfa) is currently running on all the Lighthouse Pyrmont nodes, and I've added a dashboard to the Pyrmont grafana instance with the metrics.
In future work we should optimise the attestation packing, which consumes around 30-60ms and is now a substantial contributor to the total.
## Issue Addressed
Closes#2048
## Proposed Changes
* Broadcast slashings when the `--slasher-broadcast` flag is provided.
* In the process of implementing this I refactored the slasher service into its own crate so that it could access the network code without creating a circular dependency. I moved the responsibility for putting slashings into the op pool into the service as well, as it makes sense for it to handle the whole slashing lifecycle.
## Issue Addressed
Closes#2028
Replaces #2059
## Proposed Changes
If writing to the database fails while importing a block, revert fork choice to the last version stored on disk. This prevents fork choice from being ahead of the blocks on disk. Having fork choice ahead is particularly bad if it is later successfully written to disk, because it renders the database corrupt (see #2028).
## Additional Info
* This mitigation might fail if the head+fork choice haven't been persisted yet, which can only happen at first startup (see #2067)
* This relies on it being OK for the head tracker to be ahead of fork choice. I figure this is tolerable because blocks only get added to the head tracker after successfully being written on disk _and_ to fork choice, so even if fork choice reverts a little bit, when the pruning algorithm runs, those blocks will still be on disk and OK to prune. The pruning algorithm also doesn't rely on heads being unique, technically it's OK for multiple blocks from the same linear chain segment to be present in the head tracker. This begs the question of #1785 (i.e. things would be simpler with the head tracker out of the way). Alternatively, this PR could just revert the head tracker as well (I'll look into this tomorrow).
## Issue Addressed
Closes#2042
## Proposed Changes
Pass blocks that fail gossip verification to the slasher. Blocks that are successfully verified are not passed immediately, but will be passed as part of full block verification.
## Issue Addressed
Resolves#1434 (this is the last major feature in the standard spec. There are only a couple of places we may be off-spec due to recent spec changes or ongoing discussion)
Partly addresses #1669
## Proposed Changes
- remove the websocket server
- remove the `TeeEventHandler` and `NullEventHandler`
- add server sent events according to the eth2 API spec
## Additional Info
This is according to the currently unmerged PR here: https://github.com/ethereum/eth2.0-APIs/pull/117
Co-authored-by: realbigsean <seananderson33@gmail.com>
Fixes a couple of low hanging fruits.
- Fixes#2037
- `validators-dir` and `secrets-dir` flags don't really need to depend upon each other
- Fixes#2006 and Fixes#1995
## Issue Addressed
NA
## Proposed Changes
- Log about eth1 whilst waiting for genesis.
- For the block and deposit caches, update them after each download instead of when *all* downloads are complete.
- This prevents the case where a single timeout error can cause us to drop *all* previously download blocks/deposits.
- Set `max_log_requests_per_update` to avoid timeouts due to very large log counts in a response.
- Set `max_blocks_per_update` to prevent a single update of the block cache to download an unreasonable number of blocks.
- This shouldn't have any affect in normal use, it's just a safe-guard against bugs.
- Increase the timeout for eth1 calls from 15s to 60s, as per @pawanjay176's experience with Infura.
## Additional Info
NA
## Description
This PR updates Lighthouse to tokio 0.3. It includes a number of dependency updates and some structural changes as to how we create and spawn tasks.
This also brings with it a number of various improvements:
- Discv5 update
- Libp2p update
- Fix for recompilation issues
- Improved UPnP port mapping handling
- Futures dependency update
- Log downgrade to traces for rejecting peers when we've reached our max
Co-authored-by: blacktemplar <blacktemplar@a1.net>
This is an implementation of a slasher that lives inside the BN and can be enabled via `lighthouse bn --slasher`.
Features included in this PR:
- [x] Detection of attester slashing conditions (double votes, surrounds existing, surrounded by existing)
- [x] Integration into Lighthouse's attestation verification flow
- [x] Detection of proposer slashing conditions
- [x] Extraction of attestations from blocks as they are verified
- [x] Compression of chunks
- [x] Configurable history length
- [x] Pruning of old attestations and blocks
- [x] More tests
Future work:
* Focus on a slice of history separate from the most recent N epochs (e.g. epochs `current - K` to `current - M`)
* Run out-of-process
* Ingest attestations from the chain without a resync
Design notes are here https://hackmd.io/@sproul/HJSEklmPL
## Issue Addressed
Closes#1719
## Proposed Changes
Lift the internal `RwLock`s and `Mutex`es from the `Observed*` data structures to resolve the race conditions described in #1719.
Most of this work was done by @paulhauner on his `lift-locks` branch, I merely updated it for the current `master` and checked over it.
## Additional Info
I think it would be prudent to test this on a testnet or two before mainnet launch, just to be sure that the extra lock contention doesn't negatively impact performance.
## Issue Addressed
NA
## Proposed Changes
- Caches later blocks than is required by `ETH1_FOLLOW_DISTANCE`.
- Adds logging to `warn` if the eth1 cache is insufficiently primed.
- Use `max_by_key` instead of `max_by` in `BeaconChain::Eth1Chain` since it's simpler.
- Rename `voting_period_start_timestamp` to `voting_target_timestamp` for accuracy.
## Additional Info
The reason for eating into the `ETH1_FOLLOW_DISTANCE` and caching blocks that are closer to the head is due to possibility for `SECONDS_PER_ETH1_BLOCK` to be incorrect (as is the case for the Pyrmont testnet on Goerli).
If `SECONDS_PER_ETH1_BLOCK` is too short, we'll skip back too far from the head and skip over blocks that would be valid [`is_candidate_block`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/validator.md#eth1-data) blocks. This was the case on the Pyrmont testnet and resulted in Lighthouse choosing blocks that were about 30 minutes older than is ideal.
## Issue Addressed
*Should* address #1917
## Proposed Changes
Stops the `BackgroupMigrator` rx channel from backing up with big `BeaconState` messages.
Looking at some logs from my Medalla node, we can see a discrepancy between the head finalized epoch and the migrator finalized epoch:
```
Nov 17 16:50:21.606 DEBG Head beacon block slot: 129214, root: 0xbc7a…0b99, finalized_epoch: 4033, finalized_root: 0xf930…6562, justified_epoch: 4035, justified_root: 0x206b…9321, service: beacon
Nov 17 16:50:21.626 DEBG Batch processed service: sync, processed_blocks: 43, last_block_slot: 129214, chain: 8274002112260436595, first_block_slot: 129153, batch_epoch: 4036
Nov 17 16:50:21.626 DEBG Chain advanced processing_target: 4036, new_start: 4036, previous_start: 4034, chain: 8274002112260436595, service: sync
Nov 17 16:50:22.162 DEBG Completed batch received awaiting_batches: 5, blocks: 47, epoch: 4048, chain: 8274002112260436595, service: sync
Nov 17 16:50:22.162 DEBG Requesting batch start_slot: 129601, end_slot: 129664, downloaded: 0, processed: 0, state: Downloading(16Uiu2HAmG3C3t1McaseReECjAF694tjVVjkDoneZEbxNhWm1nZaT, 0 blocks, 1273), epoch: 4050, chain: 8274002112260436595, service: sync
Nov 17 16:50:22.654 DEBG Database compaction complete service: beacon
Nov 17 16:50:22.655 INFO Starting database pruning new_finalized_epoch: 2193, old_finalized_epoch: 2192, service: beacon
```
I believe this indicates that the migrator rx has a backed-up queue of `MigrationNotification` items which each contain a `BeaconState`.
## TODO
- [x] Remove finalized state requirement for op-pool
## Proposed Changes
In an attempt to fix OOM issues and database consistency issues observed by some users after the introduction of compaction in v0.3.4, this PR makes the following changes:
* Run compaction less often: roughly every 1024 epochs, including after long periods of non-finality. I think the division check proposed by Paul is pretty solid, and ensures we don't miss any events where we should be compacting. LevelDB lacks an easy way to check the size of the DB, which would be another good trigger.
* Make it possible to disable the compaction on finalization using `--auto-compact-db=false`
* Make it possible to trigger a manual, single-threaded foreground compaction on start-up using `--compact-db`
* Downgrade the pruning log to `DEBUG`, as it's particularly noisy during sync
I would like to ship these changes to affected users ASAP, and will document them further in the Advanced Database section of the book if they prove effective.
## Issue Addressed
Resolves#1809Resolves#1824Resolves#1818Resolves#1828 (hopefully)
## Proposed Changes
- add `validator_index` to the proposer duties endpoint
- add the ability to query for historical proposer duties
- `StateId` deserialization now fails with a 400 warp rejection
- add the `validator_balances` endpoint
- update the `aggregate_and_proofs` endpoint to accept an array
- updates the attester duties endpoint from a `GET` to a `POST`
- reduces the number of times we query for proposer duties from once per slot per validator to only once per slot
Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
Closes#1866
## Proposed Changes
* Compact the database on finalization. This removes the deleted states from disk completely. Because it happens in the background migrator, it doesn't block other database operations while it runs. On my Medalla node it took about 1 minute and shrank the database from 90GB to 9GB.
* Fix an inefficiency in the pruning algorithm where it would always use the genesis checkpoint as the `old_finalized_checkpoint` when running for the first time after start-up. This would result in loading lots of states one-at-a-time back to genesis, and storing a lot of block roots in memory. The new code stores the old finalized checkpoint on disk and only uses genesis if no checkpoint is already stored. This makes it both backwards compatible _and_ forwards compatible -- no schema change required!
* Introduce two new `INFO` logs to indicate when pruning has started and completed. Users seem to want to know this information without enabling debug logs!
## Issue Addressed
- Related to #1691
## Proposed Changes
Adds the following API endpoints:
- `GET lighthouse/eth1/syncing`: status about how synced we are with Eth1.
- `GET lighthouse/eth1/block_cache`: all locally cached eth1 blocks.
- `GET lighthouse/eth1/deposit_cache`: all locally cached eth1 deposits.
Additionally:
- Moves some types from the `beacon_node/eth1` to the `common/eth2` crate, so they can be used in the API without duplication.
- Allow `update_deposit_cache` and `update_block_cache` to take an optional head block number to avoid duplicate requests.
## Additional Info
TBC
## Issue Addressed
Closes#800Closes#1713
## Proposed Changes
Implement the temporary state storage algorithm described in #800. Specifically:
* Add `DBColumn::BeaconStateTemporary`, for storing 0-length temporary marker values.
* Store intermediate states immediately as they are created, marked temporary. Delete the temporary flag if the block is processed successfully.
* Add a garbage collection process to delete leftover temporary states on start-up.
* Bump the database schema version to 2 so that a DB with temporary states can't accidentally be used with older versions of the software. The auto-migration is a no-op, but puts in place some infra that we can use for future migrations (e.g. #1784)
## Additional Info
There are two known race conditions, one potentially causing permanent faults (hopefully rare), and the other insignificant.
### Race 1: Permanent state marked temporary
EDIT: this has been fixed by the addition of a lock around the relevant critical section
There are 2 threads that are trying to store 2 different blocks that share some intermediate states (e.g. they both skip some slots from the current head). Consider this sequence of events:
1. Thread 1 checks if state `s` already exists, and seeing that it doesn't, prepares an atomic commit of `(s, s_temporary_flag)`.
2. Thread 2 does the same, but also gets as far as committing the state txn, finishing the processing of its block, and _deleting_ the temporary flag.
3. Thread 1 is (finally) scheduled again, and marks `s` as temporary with its transaction.
4.
a) The process is killed, or thread 1's block fails verification and the temp flag is not deleted. This is a permanent failure! Any attempt to load state `s` will fail... hope it isn't on the main chain! Alternatively (4b) happens...
b) Thread 1 finishes, and re-deletes the temporary flag. In this case the failure is transient, state `s` will disappear temporarily, but will come back once thread 1 finishes running.
I _hope_ that steps 1-3 only happen very rarely, and 4a even more rarely. It's hard to know
This once again begs the question of why we're using LevelDB (#483), when it clearly doesn't care about atomicity! A ham-fisted fix would be to wrap the hot and cold DBs in locks, which would bring us closer to how other DBs handle read-write transactions. E.g. [LMDB only allows one R/W transaction at a time](https://docs.rs/lmdb/0.8.0/lmdb/struct.Environment.html#method.begin_rw_txn).
### Race 2: Temporary state returned from `get_state`
I don't think this race really matters, but in `load_hot_state`, if another thread stores a state between when we call `load_state_temporary_flag` and when we call `load_hot_state_summary`, then we could end up returning that state even though it's only a temporary state. I can't think of any case where this would be relevant, and I suspect if it did come up, it would be safe/recoverable (having data is safer than _not_ having data).
This could be fixed by using a LevelDB read snapshot, but that would require substantial changes to how we read all our values, so I don't think it's worth it right now.
## Issue Addressed
Closes#1769Closes#1708
## Proposed Changes
Tweaks the op pool pruning so that the attestation pool is pruned against the wall-clock epoch instead of the finalized state's epoch. This should reduce the unbounded growth that we've seen during periods without finality.
Also fixes up the voluntary exit pruning as raised in #1708.
## Issue Addressed
Closes#1548
## Proposed Changes
Optimizes attester slashing choice by choosing the ones that cover the most amount of validators slashed, with the highest effective balances
## Additional Info
Initial pass, need to write a test for it
## Issue Addressed
Resolves#1776
## Proposed Changes
The beacon chain builder was using the canonical head's state root for the `genesis_state_root` field.
## Additional Info
## Issue Addressed
Closes#1557
## Proposed Changes
Modify the pruning algorithm so that it mutates the head-tracker _before_ committing the database transaction to disk, and _only if_ all the heads to be removed are still present in the head-tracker (i.e. no concurrent mutations).
In the process of writing and testing this I also had to make a few other changes:
* Use internal mutability for all `BeaconChainHarness` functions (namely the RNG and the graffiti), in order to enable parallel calls (see testing section below).
* Disable logging in harness tests unless the `test_logger` feature is turned on
And chose to make some clean-ups:
* Delete the `NullMigrator`
* Remove type-based configuration for the migrator in favour of runtime config (simpler, less duplicated code)
* Use the non-blocking migrator unless the blocking migrator is required. In the store tests we need the blocking migrator because some tests make asserts about the state of the DB after the migration has run.
* Rename `validators_keypairs` -> `validator_keypairs` in the `BeaconChainHarness`
## Testing
To confirm that the fix worked, I wrote a test using [Hiatus](https://crates.io/crates/hiatus), which can be found here:
https://github.com/michaelsproul/lighthouse/tree/hiatus-issue-1557
That test can't be merged because it inserts random breakpoints everywhere, but if you check out that branch you can run the test with:
```
$ cd beacon_node/beacon_chain
$ cargo test --release --test parallel_tests --features test_logger
```
It should pass, and the log output should show:
```
WARN Pruning deferred because of a concurrent mutation, message: this is expected only very rarely!
```
## Additional Info
This is a backwards-compatible change with no impact on consensus.
* Initial rebase
* Remove old code
* Correct release tests
* Rebase commit
* Remove eth2-testnet dep on eth2libp2p
* Remove crates lost in rebase
* Remove unused dep
## Issue Addressed
NA
## Proposed Changes
Addresses an interesting DoS vector raised by @protolambda by verifying that the head and target are consistent when processing aggregate attestations. This check prevents us from loading very old target blocks and doing lots of work to skip them to the current slot.
## Additional Info
NA
This commit was edited by Paul H when rebasing from master to
v0.3.0-staging.
Solution 2 proposed here: https://github.com/sigp/lighthouse/issues/1435#issuecomment-692317639
- Adds an optional `--wss-checkpoint` flag that takes a string `root:epoch`
- Verify that the given checkpoint exists in the chain, or that the the chain syncs through this checkpoint. If not, shutdown and prompt the user to purge state before restarting.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
Closes#673
## Proposed Changes
Store a schema version in the database so that future releases can check they're running against a compatible database version. This would also enable automatic migration on breaking database changes, but that's left as future work.
The database config is also stored in the database so that the `slots_per_restore_point` value can be checked for consistency, which closes#673
- Resolves#1550
- Resolves#824
- Resolves#825
- Resolves#1131
- Resolves#1411
- Resolves#1256
- Resolve#1177
- Includes the `ShufflingId` struct initially defined in #1492. That PR is now closed and the changes are included here, with significant bug fixes.
- Implement the https://github.com/ethereum/eth2.0-APIs in a new `http_api` crate using `warp`. This replaces the `rest_api` crate.
- Add a new `common/eth2` crate which provides a wrapper around `reqwest`, providing the HTTP client that is used by the validator client and for testing. This replaces the `common/remote_beacon_node` crate.
- Create a `http_metrics` crate which is a dedicated server for Prometheus metrics (they are no longer served on the same port as the REST API). We now have flags for `--metrics`, `--metrics-address`, etc.
- Allow the `subnet_id` to be an optional parameter for `VerifiedUnaggregatedAttestation::verify`. This means it does not need to be provided unnecessarily by the validator client.
- Move `fn map_attestation_committee` in `mod beacon_chain::attestation_verification` to a new `fn with_committee_cache` on the `BeaconChain` so the same cache can be used for obtaining validator duties.
- Add some other helpers to `BeaconChain` to assist with common API duties (e.g., `block_root_at_slot`, `head_beacon_block_root`).
- Change the `NaiveAggregationPool` so it can index attestations by `hash_tree_root(attestation.data)`. This is a requirement of the API.
- Add functions to `BeaconChainHarness` to allow it to create slashings and exits.
- Allow for `eth1::Eth1NetworkId` to go to/from a `String`.
- Add functions to the `OperationPool` to allow getting all objects in the pool.
- Add function to `BeaconState` to check if a committee cache is initialized.
- Fix bug where `seconds_per_eth1_block` was not transferring over from `YamlConfig` to `ChainSpec`.
- Add the `deposit_contract_address` to `YamlConfig` and `ChainSpec`. We needed to be able to return it in an API response.
- Change some uses of serde `serialize_with` and `deserialize_with` to a single use of `with` (code quality).
- Impl `Display` and `FromStr` for several BLS fields.
- Check for clock discrepancy when VC polls BN for sync state (with +/- 1 slot tolerance). This is not intended to be comprehensive, it was just easy to do.
- See #1434 for a per-endpoint overview.
- Seeking clarity here: https://github.com/ethereum/eth2.0-APIs/issues/75
- [x] Add docs for prom port to close#1256
- [x] Follow up on this #1177
- [x] ~~Follow up with #1424~~ Will fix in future PR.
- [x] Follow up with #1411
- [x] ~~Follow up with #1260~~ Will fix in future PR.
- [x] Add quotes to all integers.
- [x] Remove `rest_types`
- [x] Address missing beacon block error. (#1629)
- [x] ~~Add tests for lighthouse/peers endpoints~~ Wontfix
- [x] ~~Follow up with validator status proposal~~ Tracked in #1434
- [x] Unify graffiti structs
- [x] ~~Start server when waiting for genesis?~~ Will fix in future PR.
- [x] TODO in http_api tests
- [x] Move lighthouse endpoints off /eth/v1
- [x] Update docs to link to standard
- ~~Blocked on #1586~~
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
Closes#1680
## Proposed Changes
This PR fixes a race condition in beacon node start-up whereby the pubkey cache could be created by the beacon chain builder before the `PersistedBeaconChain` was stored to disk. When the node restarted, it would find the persisted chain missing, and attempt to start from scratch, creating a new pubkey cache in the process. This call to `ValidatorPubkeyCache::new` would fail if the file already existed (which it did). I changed the behaviour so that pubkey cache initialization now doesn't care whether there's a file already in existence (it's only a cache after all). Instead it will truncate and recreate the file in the race scenario described.
## Issue Addressed
NA
## Proposed Changes
There are four new conditions introduced in v0.12.3:
1. _[REJECT]_ The attestation's epoch matches its target -- i.e. `attestation.data.target.epoch ==
compute_epoch_at_slot(attestation.data.slot)`
1. _[REJECT]_ The attestation's target block is an ancestor of the block named in the LMD vote -- i.e.
`get_ancestor(store, attestation.data.beacon_block_root, compute_start_slot_at_epoch(attestation.data.target.epoch)) == attestation.data.target.root`
1. _[REJECT]_ The committee index is within the expected range -- i.e. `data.index < get_committee_count_per_slot(state, data.target.epoch)`.
1. _[REJECT]_ The number of aggregation bits matches the committee size -- i.e.
`len(attestation.aggregation_bits) == len(get_beacon_committee(state, data.slot, data.index))`.
This PR implements new logic to suit (1) and (2). Tests are added for (3) and (4), although they were already implicitly enforced.
## Additional Info
- There's a bit of edge-case with target root verification that I raised here: https://github.com/ethereum/eth2.0-specs/pull/2001#issuecomment-699246659
- I've had to add an `--ignore` to `cargo audit` to get CI to pass. See https://github.com/sigp/lighthouse/issues/1669
## Issue Addressed
- Resolves#1616
## Proposed Changes
If we look at the function which persists fork choice and the canonical head to disk:
1db8daae0c/beacon_node/beacon_chain/src/beacon_chain.rs (L234-L280)
There is a race-condition which might cause the canonical head and fork choice values to be out-of-sync.
I believe this is the cause of #1616. I managed to recreate the issue and produce a database that was unable to sync under the `master` branch but able to sync with this branch.
These new changes solve the issue by ignoring the persisted `canonical_head_block_root` value and instead getting fork choice to generate it. This ensures that the canonical head is in-sync with fork choice.
## Additional Info
This is hotfix method that leaves some crusty code hanging around. Once this PR is merged (to satisfy the v0.2.x users) we should later update and merge #1638 so we can have a clean fix for the v0.3.x versions.
## Issue Addressed
- Resolves#1616
## Proposed Changes
Fixes a bug where we are unable to read the finalized block from fork choice.
## Detail
I had made an assumption that the finalized block always has a parent root of `None`:
e5fc6bab48/consensus/fork_choice/src/fork_choice.rs (L749-L752)
This was a faulty assumption, we don't set parent *roots* to `None`. Instead we *sometimes* set parent *indices* to `None`, depending if this pruning condition is satisfied:
e5fc6bab48/consensus/proto_array/src/proto_array.rs (L229-L232)
The bug manifested itself like this:
1. We attempt to get the finalized block from fork choice
1. We try to check that the block is descendant of the finalized block (note: they're the same block).
1. We expect the parent root to be `None`, but it's actually the parent root of the finalized root.
1. We therefore end up checking if the parent of the finalized root is a descendant of itself. (note: it's an *ancestor* not a *descendant*).
1. We therefore declare that the finalized block is not a descendant of (or eq to) the finalized block. Bad.
## Additional Info
In reflection, I made a poor assumption in the quest to obtain a probably negligible performance gain. The performance gain wasn't worth the risk and we got burnt.
## Issue Addressed
Partly addresses #1547
## Proposed Changes
This fix addresses the missing attestations at slot 0 of an epoch (also sometimes slot 1 when slot 0 was skipped).
There are 2 cases:
1. BN receives the block for the attestation slot after 4 seconds (1/3rd of the slot).
2. No block is proposed for this slot.
In both cases, when we produce the attestation, we pass the head state to the
`produce_unaggregated_attestation_for_block` function here
9833eca024/beacon_node/beacon_chain/src/beacon_chain.rs (L845-L850)
Since we don't advance the state in this function, we set `attestation.data.source = state.current_justified_checkpoint` which is atleast 2 epochs lower than current_epoch(wall clock epoch).
This attestation is invalid and cannot be included in a block because of this assert from the spec:
```python
if data.target.epoch == get_current_epoch(state):
assert data.source == state.current_justified_checkpoint
state.current_epoch_attestations.append(pending_attestation)
```
https://github.com/ethereum/eth2.0-specs/blob/dev/specs/phase0/beacon-chain.md#attestations
This PR changes the `produce_unaggregated_attestation_for_block` function to ensure that it advances the state before producing the attestation at the new epoch.
Running this on my node, have missed 0 attestations across all 8 of my validators in a 100 epoch period 🎉
To compare, I was missing ~14 attestations across all 8 validators in the same 100 epoch period before the fix.
Will report missed attestations if any after running for another 100 epochs tomorrow.
Converts the graffiti binary data to string before printing to logs.
## Issue Addressed
#1566
## Proposed Changes
Rather than converting graffiti to a vector the binary data less the last character is passed to String::from_utf_lossy(). This then allows us to call the to_string() function directly to give us the string
## Additional Info
Rust skills are fairly weak
The PR:
* Adds the ability to generate a crucial test scenario that isn't possible with `BeaconChainHarness` (i.e. two blocks occupying the same slot; previously forks necessitated skipping slots):
![image](https://user-images.githubusercontent.com/165678/88195404-4bce3580-cc40-11ea-8c08-b48d2e1d5959.png)
* New testing API: Instead of repeatedly calling add_block(), you generate a sorted `Vec<Slot>` and leave it up to the framework to generate blocks at those slots.
* Jumping backwards to an earlier epoch is a hard error, so that tests necessarily generate blocks in a epoch-by-epoch manner.
* Configures the test logger so that output is printed on the console in case a test fails. The logger also plays well with `--nocapture`, contrary to the existing testing framework
* Rewrites existing fork pruning tests to use the new API
* Adds a tests that triggers finalization at a non epoch boundary slot
* Renamed `BeaconChainYoke` to `BeaconChainTestingRig` because the former has been too confusing
* Fixed multiple tests (e.g. `block_production_different_shuffling_long`, `delete_blocks_and_states`, `shuffling_compatible_simple_fork`) that relied on a weird (and accidental) feature of the old `BeaconChainHarness` that attestations aren't produced for epochs earlier than the current one, thus masking potential bugs in test cases.
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
## Issue Addressed
Closes#1488
## Proposed Changes
* Prevent the pruning algorithm from over-eagerly deleting states at skipped slots when they are shared with the canonical chain.
* Add `debug` logging to the pruning algorithm so we have so better chance of debugging future issues from logs.
* Modify the handling of the "finalized state" in the beacon chain, so that it's always the state at the first slot of the finalized epoch (previously it was the state at the finalized block). This gives database pruning a clearer and cleaner view of things, and will marginally impact the pruning of the op pool, observed proposers, etc (in ways that are safe as far as I can tell).
* Remove duplicated `RevertedFinalizedEpoch` check from `after_finalization`
* Delete useless and unused `max_finality_distance`
* Add tests that exercise pruning with shared states at skip slots
* Delete unnecessary `block_strategy` argument from `add_blocks` and friends in the test harness (will likely conflict with #1380 slightly, sorry @adaszko -- but we can fix that)
* Bonus: add a `BeaconChain::with_head` method. I didn't end up needing it, but it turned out quite nice, so I figured we could keep it?
## Additional Info
Any users who have experienced pruning errors on Medalla will need to resync after upgrading to a release including this change. This should end unbounded `chain_db` growth! 🎉
## Issue Addressed
NA
## Proposed Changes
Sets the default max skips to 700 so that it can cover the 693 slot skip from `80894 - 80201`.
## Additional Info
NA
## Issue Addressed
NA
## Proposed Changes
- Fixes a mistake I made in #1530 which resulted us in *not* rejecting attestations that we intended to reject.
- Adds skip-slot checks for blocks earlier in import process, so it rejects gossip and RPC blocks.
## Additional Info
NA
## Proposed Changes
To mitigate the impact of minority forks on RAM and disk usage, this change rejects blocks whose parent lies more than 320 slots (10 epochs, ~1 hour) in the past. The behaviour is configurable via `lighthouse bn --max-skip-slots N`, and can be turned off entirely using `--max-skip-slots none`.
Co-authored-by: Paul Hauner <paul@paulhauner.com>
## Issue Addressed
NA
## Proposed Changes
- Adds a new function to allow getting a state with a bad state root history for attestation verification. This reduces unnecessary tree hashing during attestation processing, which accounted for 23% of memory allocations (by bytes) in a recent `heaptrack` observation.
- Don't clone caches on intermediate epoch-boundary states during block processing.
- Reject blocks that are known to fork choice earlier during gossip processing, instead of waiting until after state has been loaded (this only happens in edge-case).
- Avoid multiple re-allocations by creating a "forced" exact size iterator.
## Additional Info
NA
## Issue Addressed
- Resolves#1451
## Proposed Changes
- Restricts the `contains_block` and `contains_block` so they only indicate a block is present if it descends from the finalized root. This helps to ensure that fork choice never points to a block that has been pruned from the database.
- Resolves#1451
- Before importing a block, double-check that its parent is known and a descendant of the finalized root.
- Split a big, monolithic block verification test into smaller tests.
## Additional Notes
I suspect there would be a craftier way to do the `is_descendant_of_finalized` check, but we're a bit tight on time now and we can optimize later if it starts showing in benches.
## TODO
- [x] Tests
## Issue Addressed
#1419
## Proposed Changes
Creates a `--graffiti` cli flag in the validator client. If the flag is set, it overrides graffiti in the beacon node.
## Additional Info
## Issue Addressed
NA
## Proposed Changes
- When producing a block, go and ensure every attestation in the naive aggregation pool is included in the operation pool. This should help us increase the number of useful attestations in a block.
- Lift the `RwLock`s inside `NaiveAggregationPool` up into a single high-level lock. There were race conditions in the existing setup and it was hard to reason about.
## Additional Info
NA
## Issue Addressed
#1028
A bit late, but I think if `BlockError` had a kind (the current `BlockError` minus everything on the variants that comes directly from the block) and the original block, more clones could be removed
## Issue Addressed
NA
## Proposed Changes
- Refactor the `bls` crate to support multiple BLS "backends" (e.g., milagro, blst, etc).
- Removes some duplicate, unused code in `common/rest_types/src/validator.rs`.
- Removes the old "upgrade legacy keypairs" functionality (these were unencrypted keys that haven't been supported for a few testnets, no one should be using them anymore).
## Additional Info
Most of the files changed are just inconsequential changes to function names.
## TODO
- [x] Optimization levels
- [x] Infinity point: https://github.com/supranational/blst/issues/11
- [x] Ensure milagro *and* blst are tested via CI
- [x] What to do with unsafe code?
- [x] Test infinity point in signature sets
## Issue Addressed
This PR makes the `Eth1Chain::use_dummy_backend` field private. I believe this could be good to ensure the consistency of a Eth1Chain instance. 💡
## Issue Addressed
Closes#1319
## Proposed Changes
This issue:
1. Allows users to edit their Graffiti via the cli option `--graffiti`. If the graffiti is too long, lighthouse will not start and throw an error message. Otherwise, it will set the Graffiti to be the one provided by the user, right-padded with 0s.
2. Create a new `Graffiti` type and unify the code around it. With this type, everything is enforced at compile-time, and the code can be (I think...) panic-free! :)
## Additional info
Currently, only `&str` are supported, as this is the returned type by `.arg("graffiti")`.
Since this is user-input, I tried being as careful as I could. This is also why I created the `Graffiti` type, to make sure I could check as much as possible at compile time.
* Bake in Altona testnet (without genesis state)
* Add sig verification, without optimization
* Start integration with genesis service
* Update config.yml
* Fix eth2_testnet_config test
* Stop using default spec in genesis
* Fix lcli compile error
* Update min genesis time
* Fix typo
* Process exits and slashings off the network
* Fix rest_api tests
* Add op verification tests
* Add tests for pruning of slashings in the op pool
* Address Paul's review comments
* Update `milagro_bls` to new release (#1183)
* Update milagro_bls to new release
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* Tidy up fake cryptos
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* move SecretHash to bls and put plaintext back
Signed-off-by: Kirk Baird <baird.k@outlook.com>
* Update v0.12.0 to v0.12.1
* Add compute_subnet_for_attestation
* Replace CommitteeIndex topic with Attestation
* Fix warnings
* Fix attestation service tests
* fmt
* Appease clippy
* return error from validator_subscriptions
* move state out of loop
* Fix early break on error
* Get state from slot clock
* Fix beacon state in attestation tests
* Add failing test for lookahead > 1
* Minor change
* Address some review comments
* Add subnet verification to beacon chain
* Move subnet verification to processor
* Pass committee_count_at_slot to ValidatorDuty and ValidatorSubscription
* Pass subnet id for publishing attestations
* Fix attestation service tests
* Fix more tests
* Fix fork choice test
* Remove unused code
* Remove more unused and expensive code
Co-authored-by: Kirk Baird <baird.k@outlook.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
Co-authored-by: Age Manning <Age@AgeManning.com>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
* Layer do_atomically() abstractions properly
* Reduce allocs and DRY get_key_for_col()
* Parameterize HotColdDB with hot and cold item stores
* -impl Store for MemoryStore
* Replace Store uses with HotColdDB
* Ditch Store trait
* cargo fmt
* Style fix
* Readd missing dep that broke the build
* Add logging on shutdown
* Replace tokio::spawn with handle.spawn
* Upgrade tokio
* Add a task executor
* Beacon chain tasks use task executor
* Validator client tasks use task executor
* Rename runtime_handle to executor
* Add duration histograms; minor fixes
* Cleanup
* Fix logs
* Fix tests
* Remove random file
* Get enr dependency instead of libp2p
* Address some review comments
* Libp2p takes a TaskExecutor
* Ugly fix libp2p tests
* Move TaskExecutor to own file
* Upgrade Dockerfile rust version
* Minor fixes
* Revert "Ugly fix libp2p tests"
This reverts commit 58d4bb690f52de28d893943b7504d2d0c6621429.
* Pretty fix libp2p tests
* Add spawn_without_exit; change Counter to Gauge
* Tidy
* Move log from RuntimeContext to TaskExecutor
* Fix errors
* Replace histogram with int_gauge for async tasks
* Fix todo
* Fix memory leak in test by exiting all spawned tasks at the end
* Remove redundant method
* Pull out a method out of a struct
* More precise db access abstractions
* Move fake trait method out of it
* cargo fmt
* Fix compilation error after refactoring
* Move another fake method out the Store trait
* Get rid of superfluous method
* Fix refactoring bug
* Rename: SimpleStoreItem -> StoreItem
* Get rid of the confusing DiskStore type alias
* Get rid of SimpleDiskStore type alias
* Correction: A method took both self and a ref to Self
* Port eth1 lib to use stable futures
* Port eth1_test_rig to stable futures
* Port eth1 tests to stable futures
* Port genesis service to stable futures
* Port genesis tests to stable futures
* Port beacon_chain to stable futures
* Port lcli to stable futures
* Fix eth1_test_rig (#1014)
* Fix lcli
* Port timer to stable futures
* Fix timer
* Port websocket_server to stable futures
* Port notifier to stable futures
* Add TODOS
* Update hashmap hashset to stable futures
* Adds panic test to hashset delay
* Port remote_beacon_node to stable futures
* Fix lcli merge conflicts
* Non rpc stuff compiles
* protocol.rs compiles
* Port websockets, timer and notifier to stable futures (#1035)
* Fix lcli
* Port timer to stable futures
* Fix timer
* Port websocket_server to stable futures
* Port notifier to stable futures
* Add TODOS
* Port remote_beacon_node to stable futures
* Partial eth2-libp2p stable future upgrade
* Finished first round of fighting RPC types
* Further progress towards porting eth2-libp2p adds caching to discovery
* Update behaviour
* RPC handler to stable futures
* Update RPC to master libp2p
* Network service additions
* Fix the fallback transport construction (#1102)
* Correct warning
* Remove hashmap delay
* Compiling version of eth2-libp2p
* Update all crates versions
* Fix conversion function and add tests (#1113)
* Port validator_client to stable futures (#1114)
* Add PH & MS slot clock changes
* Account for genesis time
* Add progress on duties refactor
* Add simple is_aggregator bool to val subscription
* Start work on attestation_verification.rs
* Add progress on ObservedAttestations
* Progress with ObservedAttestations
* Fix tests
* Add observed attestations to the beacon chain
* Add attestation observation to processing code
* Add progress on attestation verification
* Add first draft of ObservedAttesters
* Add more tests
* Add observed attesters to beacon chain
* Add observers to attestation processing
* Add more attestation verification
* Create ObservedAggregators map
* Remove commented-out code
* Add observed aggregators into chain
* Add progress
* Finish adding features to attestation verification
* Ensure beacon chain compiles
* Link attn verification into chain
* Integrate new attn verification in chain
* Remove old attestation processing code
* Start trying to fix beacon_chain tests
* Split adding into pools into two functions
* Add aggregation to harness
* Get test harness working again
* Adjust the number of aggregators for test harness
* Fix edge-case in harness
* Integrate new attn processing in network
* Fix compile bug in validator_client
* Update validator API endpoints
* Fix aggreagation in test harness
* Fix enum thing
* Fix attestation observation bug:
* Patch failing API tests
* Start adding comments to attestation verification
* Remove unused attestation field
* Unify "is block known" logic
* Update comments
* Supress fork choice errors for network processing
* Add todos
* Tidy
* Add gossip attn tests
* Disallow test harness to produce old attns
* Comment out in-progress tests
* Partially address pruning tests
* Fix failing store test
* Add aggregate tests
* Add comments about which spec conditions we check
* Dont re-aggregate
* Split apart test harness attn production
* Fix compile error in network
* Make progress on commented-out test
* Fix skipping attestation test
* Add fork choice verification tests
* Tidy attn tests, remove dead code
* Remove some accidentally added code
* Fix clippy lint
* Rename test file
* Add block tests, add cheap block proposer check
* Rename block testing file
* Add observed_block_producers
* Tidy
* Switch around block signature verification
* Finish block testing
* Remove gossip from signature tests
* First pass of self review
* Fix deviation in spec
* Update test spec tags
* Start moving over to hashset
* Finish moving observed attesters to hashmap
* Move aggregation pool over to hashmap
* Make fc attn borrow again
* Fix rest_api compile error
* Fix missing comments
* Fix monster test
* Uncomment increasing slots test
* Address remaining comments
* Remove unsafe, use cfg test
* Remove cfg test flag
* Fix dodgy comment
* Revert "Update hashmap hashset to stable futures"
This reverts commit d432378a3cc5cd67fc29c0b15b96b886c1323554.
* Revert "Adds panic test to hashset delay"
This reverts commit 281502396fc5b90d9c421a309c2c056982c9525b.
* Ported attestation_service
* Ported duties_service
* Ported fork_service
* More ports
* Port block_service
* Minor fixes
* VC compiles
* Update TODOS
* Borrow self where possible
* Ignore aggregates that are already known.
* Unify aggregator modulo logic
* Fix typo in logs
* Refactor validator subscription logic
* Avoid reproducing selection proof
* Skip HTTP call if no subscriptions
* Rename DutyAndState -> DutyAndProof
* Tidy logs
* Print root as dbg
* Fix compile errors in tests
* Fix compile error in test
* Re-Fix attestation and duties service
* Minor fixes
Co-authored-by: Paul Hauner <paul@paulhauner.com>
* Network crate update to stable futures
* Port account_manager to stable futures (#1121)
* Port account_manager to stable futures
* Run async fns in tokio environment
* Port rest_api crate to stable futures (#1118)
* Port rest_api lib to stable futures
* Reduce tokio features
* Update notifier to stable futures
* Builder update
* Further updates
* Convert self referential async functions
* stable futures fixes (#1124)
* Fix eth1 update functions
* Fix genesis and client
* Fix beacon node lib
* Return appropriate runtimes from environment
* Fix test rig
* Refactor eth1 service update
* Upgrade simulator to stable futures
* Lighthouse compiles on stable futures
* Remove println debugging statement
* Update libp2p service, start rpc test upgrade
* Update network crate for new libp2p
* Update tokio::codec to futures_codec (#1128)
* Further work towards RPC corrections
* Correct http timeout and network service select
* Use tokio runtime for libp2p
* Revert "Update tokio::codec to futures_codec (#1128)"
This reverts commit e57aea924acf5cbabdcea18895ac07e38a425ed7.
* Upgrade RPC libp2p tests
* Upgrade secio fallback test
* Upgrade gossipsub examples
* Clean up RPC protocol
* Test fixes (#1133)
* Correct websocket timeout and run on os thread
* Fix network test
* Clean up PR
* Correct tokio tcp move attestation service tests
* Upgrade attestation service tests
* Correct network test
* Correct genesis test
* Test corrections
* Log info when block is received
* Modify logs and update attester service events
* Stable futures: fixes to vc, eth1 and account manager (#1142)
* Add local testnet scripts
* Remove whiteblock script
* Rename local testnet script
* Move spawns onto handle
* Fix VC panic
* Initial fix to block production issue
* Tidy block producer fix
* Tidy further
* Add local testnet clean script
* Run cargo fmt
* Tidy duties service
* Tidy fork service
* Tidy ForkService
* Tidy AttestationService
* Tidy notifier
* Ensure await is not suppressed in eth1
* Ensure await is not suppressed in account_manager
* Use .ok() instead of .unwrap_or(())
* RPC decoding test for proto
* Update discv5 and eth2-libp2p deps
* Fix lcli double runtime issue (#1144)
* Handle stream termination and dialing peer errors
* Correct peer_info variant types
* Remove unnecessary warnings
* Handle subnet unsubscription removal and improve logigng
* Add logs around ping
* Upgrade discv5 and improve logging
* Handle peer connection status for multiple connections
* Improve network service logging
* Improve logging around peer manager
* Upgrade swarm poll centralise peer management
* Identify clients on error
* Fix `remove_peer` in sync (#1150)
* remove_peer removes from all chains
* Remove logs
* Fix early return from loop
* Improved logging, fix panic
* Partially correct tests
* Stable futures: Vc sync (#1149)
* Improve syncing heuristic
* Add comments
* Use safer method for tolerance
* Fix tests
* Stable futures: Fix VC bug, update agg pool, add more metrics (#1151)
* Expose epoch processing summary
* Expose participation metrics to prometheus
* Switch to f64
* Reduce precision
* Change precision
* Expose observed attesters metrics
* Add metrics for agg/unagg attn counts
* Add metrics for gossip rx
* Add metrics for gossip tx
* Adds ignored attns to prom
* Add attestation timing
* Add timer for aggregation pool sig agg
* Add write lock timer for agg pool
* Add more metrics to agg pool
* Change map lock code
* Add extra metric to agg pool
* Change lock handling in agg pool
* Change .write() to .read()
* Add another agg pool timer
* Fix for is_aggregator
* Fix pruning bug
Co-authored-by: pawan <pawandhananjay@gmail.com>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
* Improve error handling in block iteration
* Introduce atomic DB operations
* Fix race condition
An invariant was violated: For every block hash in head_tracker, that
block is accessible from the store.
* Add PH & MS slot clock changes
* Account for genesis time
* Add progress on duties refactor
* Add simple is_aggregator bool to val subscription
* Start work on attestation_verification.rs
* Add progress on ObservedAttestations
* Progress with ObservedAttestations
* Fix tests
* Add observed attestations to the beacon chain
* Add attestation observation to processing code
* Add progress on attestation verification
* Add first draft of ObservedAttesters
* Add more tests
* Add observed attesters to beacon chain
* Add observers to attestation processing
* Add more attestation verification
* Create ObservedAggregators map
* Remove commented-out code
* Add observed aggregators into chain
* Add progress
* Finish adding features to attestation verification
* Ensure beacon chain compiles
* Link attn verification into chain
* Integrate new attn verification in chain
* Remove old attestation processing code
* Start trying to fix beacon_chain tests
* Split adding into pools into two functions
* Add aggregation to harness
* Get test harness working again
* Adjust the number of aggregators for test harness
* Fix edge-case in harness
* Integrate new attn processing in network
* Fix compile bug in validator_client
* Update validator API endpoints
* Fix aggreagation in test harness
* Fix enum thing
* Fix attestation observation bug:
* Patch failing API tests
* Start adding comments to attestation verification
* Remove unused attestation field
* Unify "is block known" logic
* Update comments
* Supress fork choice errors for network processing
* Add todos
* Tidy
* Add gossip attn tests
* Disallow test harness to produce old attns
* Comment out in-progress tests
* Partially address pruning tests
* Fix failing store test
* Add aggregate tests
* Add comments about which spec conditions we check
* Dont re-aggregate
* Split apart test harness attn production
* Fix compile error in network
* Make progress on commented-out test
* Fix skipping attestation test
* Add fork choice verification tests
* Tidy attn tests, remove dead code
* Remove some accidentally added code
* Fix clippy lint
* Rename test file
* Add block tests, add cheap block proposer check
* Rename block testing file
* Add observed_block_producers
* Tidy
* Switch around block signature verification
* Finish block testing
* Remove gossip from signature tests
* First pass of self review
* Fix deviation in spec
* Update test spec tags
* Start moving over to hashset
* Finish moving observed attesters to hashmap
* Move aggregation pool over to hashmap
* Make fc attn borrow again
* Fix rest_api compile error
* Fix missing comments
* Fix monster test
* Uncomment increasing slots test
* Address remaining comments
* Remove unsafe, use cfg test
* Remove cfg test flag
* Fix dodgy comment
* Ignore aggregates that are already known.
* Unify aggregator modulo logic
* Fix typo in logs
* Refactor validator subscription logic
* Avoid reproducing selection proof
* Skip HTTP call if no subscriptions
* Rename DutyAndState -> DutyAndProof
* Tidy logs
* Print root as dbg
* Fix compile errors in tests
* Fix compile error in test