Commit Graph

1927 Commits

Author SHA1 Message Date
Age Manning
1790010260 Upgrade to latest libp2p (#2605)
This is a pre-cursor to the next libp2p upgrade. 

It is currently being used for staging a number of PR upgrades which are contingent on the latest libp2p.
2021-10-29 01:59:29 +00:00
ethDreamer
2c4413454a Fixed Gossip Topics on Fork Boundary (#2619)
## Issue Addressed

The [p2p-interface section of the `altair` spec](https://github.com/ethereum/consensus-specs/blob/dev/specs/altair/p2p-interface.md#transitioning-the-gossip) says you should subscribe to the topics for a fork "In advance of the fork" and unsubscribe from old topics `2 Epochs` after the new fork is activated. We've chosen to subscribe to new fork topics `2 slots` before the fork is initiated.

This function is supposed to return the required fork digests at any given time but as it was currently written, it doesn't return the fork digest for a previous fork if you've switched to the current fork less than 2 epoch's ago. Also this function required modification for every new fork we add.

## Proposed Changes

Make this function fork-agnostic and correctly handle the previous fork topic digests when you've only just switched to the new fork.
2021-10-29 00:05:27 +00:00
Pawan Dhananjay
88063398f6 Prevent double import of blocks (#2647)
## Issue Addressed

Resolves #2611 

## Proposed Changes

Adds a duplicate block root cache to the `BeaconProcessor`. Adds the block root to the cache before calling `process_gossip_block` and `process_rpc_block`. Since `process_rpc_block` is called only for single block lookups, we don't have to worry about batched block imports.

The block is imported from the source(gossip/rpc) that arrives first. The block that arrives second is not imported to avoid the db access issue.
There are 2 cases:
1. Block that arrives second is from rpc: In this case, we return an optimistic `BlockError::BlockIsAlreadyKnown` to sync.
2. Block that arrives second is from gossip: In this case, we only do gossip verification and forwarding but don't import the block into the the beacon chain.

## Additional info
Splits up `process_gossip_block` function to `process_gossip_unverified_block` and `process_gossip_verified_block`.
2021-10-28 03:36:14 +00:00
Michael Sproul
2dc6163043 Add API version headers and map_fork_name! (#2745)
## Proposed Changes

* Add the `Eth-Consensus-Version` header to the HTTP API for the block and state endpoints. This is part of the v2.1.0 API that was recently released: https://github.com/ethereum/beacon-APIs/pull/170
* Add tests for the above. I refactored the `eth2` crate's helper functions to make this more straight-forward, and introduced some new mixin traits that I think greatly improve readability and flexibility.
* Add a new `map_with_fork!` macro which is useful for decoding a superstruct type without naming all its variants. It is now used for SSZ-decoding `BeaconBlock` and `BeaconState`, and for JSON-decoding `SignedBeaconBlock` in the API.

## Additional Info

The `map_with_fork!` changes will conflict with the Merge changes, but when resolving the conflict the changes from this branch should be preferred (it is no longer necessary to enumerate every fork). The merge fork _will_  need to be added to `map_fork_name_with`.
2021-10-28 01:18:04 +00:00
Mac L
8edd9d45ab Fix purge-db edge case (#2747)
## Issue Addressed

Currently, if you launch the beacon node with the `--purge-db` flag and the `beacon` directory exists, but one (or both) of the `chain_db` or `freezer-db` directories are missing, it will error unnecessarily with: 
```
Failed to remove chain_db: No such file or directory (os error 2)
```

This is an edge case which can occur in cases of manual intervention (a user deleted the directory) or if you had previously run with the `--purge-db` flag and Lighthouse errored before it could initialize the db directories.

## Proposed Changes

Check if the `chain_db`/`freezer_db` exists before attempting to remove them. This prevents unnecessary errors.
2021-10-25 22:11:28 +00:00
Divma
d4819bfd42 Add a waker to the RPC handler (#2721)
## Issue Addressed

Attempts to fix #2701 but I doubt this is the reason behind that.

## Proposed Changes

maintain a waker in the rpc handler and call it if an event is received
2021-10-21 06:14:36 +00:00
Pawan Dhananjay
de34001e78 Update next_fork_subscriptions correctly (#2688)
## Issue Addressed

N/A

## Proposed Changes

Update the `next_fork_subscriptions` timer only after a fork happens.
2021-10-21 04:38:44 +00:00
divma
99f7a7db58 remove double backfill sync state (#2733)
## Issue Addressed
In the backfill sync the state was maintained twice, once locally and also in the globals. This makes it so that it's maintained only once.

The only behavioral change is that when backfill sync in paused, the global backfill state is updated. I asked @AgeManning about this and he deemed it a bug, so this solves it.
2021-10-19 22:32:25 +00:00
Michael Sproul
aad397f00a Resolve Rust 1.56 lints and warnings (#2728)
## Issue Addressed

When compiling with Rust 1.56.0 the compiler generates 3 instances of this warning:

```
warning: trailing semicolon in macro used in expression position
   --> common/eth2_network_config/src/lib.rs:181:24
    |
181 |                     })?;
    |                        ^
...
195 |         let deposit_contract_deploy_block = load_from_file!(DEPLOY_BLOCK_FILE);
    |                                             ---------------------------------- in this macro invocation
    |
    = note: `#[warn(semicolon_in_expressions_from_macros)]` on by default
    = warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
    = note: for more information, see issue #79813 <https://github.com/rust-lang/rust/issues/79813>
    = note: this warning originates in the macro `load_from_file` (in Nightly builds, run with -Z macro-backtrace for more info)
```

This warning is completely harmless, but will be visible to users compiling Lighthouse v2.0.1 (or earlier) with Rust 1.56.0 (to be released October 21st). It is **completely safe** to ignore this warning, it's just a superficial change to Rust's syntax.

## Proposed Changes

This PR removes the semi-colon as recommended, and fixes the new Clippy lints from 1.56.0
2021-10-19 00:30:42 +00:00
Akihito Nakano
efec60ee90 Tiny fix: wrong log level (#2720)
## Proposed Changes

If the `RemoveChain` is critical log level should be crit. 🙂
2021-10-19 00:30:41 +00:00
Michael Sproul
d2e3d4c6f1 Add flag to disable lock timeouts (#2714)
## Issue Addressed

Mitigates #1096

## Proposed Changes

Add a flag to the beacon node called `--disable-lock-timeouts` which allows opting out of lock timeouts.

The lock timeouts serve a dual purpose:

1. They prevent any single operation from hogging the lock for too long. When a timeout occurs it logs a nasty error which indicates that there's suboptimal lock use occurring, which we can then act on.
2. They allow deadlock detection. We're fairly sure there are no deadlocks left in Lighthouse anymore but the timeout locks offer a safeguard against that.

However, timeouts on locks are not without downsides:

They allow for the possibility of livelock, particularly on slower hardware. If lock timeouts keep failing spuriously the node can be prevented from making any progress, even if it would be able to make progress slowly without the timeout. One particularly concerning scenario which could occur would be if a DoS attack succeeded in slowing block signature verification times across the network, and all Lighthouse nodes got livelocked because they timed out repeatedly. This could also occur on just a subset of nodes (e.g. dual core VPSs or Raspberri Pis).

By making the behaviour runtime configurable this PR allows us to choose the behaviour we want depending on circumstance. I suspect that long term we could make the timeout-free approach the default (#2381 moves in this direction) and just enable the timeouts on our testnet nodes for debugging purposes. This PR conservatively leaves the default as-is so we can gain some more experience before switching the default.
2021-10-19 00:30:40 +00:00
Age Manning
df40700ddd Rename eth2_libp2p to lighthouse_network (#2702)
## Description

The `eth2_libp2p` crate was originally named and designed to incorporate a simple libp2p integration into lighthouse. Since its origins the crates purpose has expanded dramatically. It now houses a lot more sophistication that is specific to lighthouse and no longer just a libp2p integration. 

As of this writing it currently houses the following high-level lighthouse-specific logic:
- Lighthouse's implementation of the eth2 RPC protocol and specific encodings/decodings
- Integration and handling of ENRs with respect to libp2p and eth2
- Lighthouse's discovery logic, its integration with discv5 and logic about searching and handling peers. 
- Lighthouse's peer manager - This is a large module handling various aspects of Lighthouse's network, such as peer scoring, handling pings and metadata, connection maintenance and recording, etc.
- Lighthouse's peer database - This is a collection of information stored for each individual peer which is specific to lighthouse. We store connection state, sync state, last seen ips and scores etc. The data stored for each peer is designed for various elements of the lighthouse code base such as syncing and the http api.
- Gossipsub scoring - This stores a collection of gossipsub 1.1 scoring mechanisms that are continuously analyssed and updated based on the ethereum 2 networks and how Lighthouse performs on these networks.
- Lighthouse specific types for managing gossipsub topics, sync status and ENR fields
- Lighthouse's network HTTP API metrics - A collection of metrics for lighthouse network monitoring
- Lighthouse's custom configuration of all networking protocols, RPC, gossipsub, discovery, identify and libp2p. 

Therefore it makes sense to rename the crate to be more akin to its current purposes, simply that it manages the majority of Lighthouse's network stack. This PR renames this crate to `lighthouse_network`

Co-authored-by: Paul Hauner <paul@paulhauner.com>
2021-10-19 00:30:39 +00:00
Paul Hauner
fff01b24dd Release v2.0.1 (#2726)
## Issue Addressed

NA

## Proposed Changes

- Update versions to `v2.0.1` in anticipation for a release early next week.
- Add `--ignore` to `cargo audit`. See #2727.

## Additional Info

NA
2021-10-18 03:08:32 +00:00
Age Manning
180c90bf6d Correct peer connection transition logic (#2725)
## Description

This PR updates the peer connection transition logic. It is acceptable for a peer to immediately transition from a disconnected state to a disconnecting state. This can occur when we are at our peer limit and a new peer's dial us.
2021-10-17 04:04:36 +00:00
Paul Hauner
a7b675460d Add Altair tests to op pool (#2723)
## Issue Addressed

NA

## Proposed Changes

Adds some more testing for Altair to the op pool. Credits to @michaelsproul for some appropriated efforts here.

## Additional Info

NA


Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-10-16 05:07:23 +00:00
Michael Sproul
5cde3fc4da Reduce lock contention in backfill sync (#2716)
## Proposed Changes

Clone the proposer pubkeys during backfill signature verification to reduce the time that the pubkey cache lock is held for. Cloning such a small number of pubkeys has negligible impact on the total running time, but greatly reduces lock contention.

On a Ryzen 5950X, the setup step seems to take around 180us regardless of whether the key is cloned or not, while the verification takes 7ms. When Lighthouse is limited to 10% of one core using `sudo cpulimit --pid <pid> --limit 10` the total time jumps up to 800ms, but the setup step remains only 250us. This means that under heavy load this PR could cut the time the lock is held for from 800ms to 250us, which is a huge saving of 99.97%!
2021-10-15 03:28:03 +00:00
Paul Hauner
9c5a8ab7f2 Change "too many resources" to "insufficient resources" in eth2_libp2p (#2713)
## Issue Addressed

NA

## Proposed Changes

Fixes what I assume is a typo in a log message. See the diff for details.

## Additional Info

NA
2021-10-15 00:07:12 +00:00
Age Manning
05040e68ec Update discovery (#2711)
## Issue Addressed

#2695 

## Proposed Changes

This updates discovery to the latest version which has patched a panic that occurred due to a race condition in the bucket logic.
2021-10-14 22:09:38 +00:00
Paul Hauner
e2d09bb8ac Add BeaconChainHarness::builder (#2707)
## Issue Addressed

NA

## Proposed Changes

This PR is near-identical to https://github.com/sigp/lighthouse/pull/2652, however it is to be merged into `unstable` instead of `merge-f2f`. Please see that PR for reasoning.

I'm making this duplicate PR to merge to `unstable` in an effort to shrink the diff between `unstable` and `merge-f2f` by doing smaller, lead-up PRs.

## Additional Info

NA
2021-10-14 02:58:10 +00:00
Pawan Dhananjay
34d22b5920 Reduce validator monitor logging verbosity (#2606)
## Issue Addressed

Resolves #2541

## Proposed Changes

Reduces verbosity of validator monitor per epoch logging by batching info logs for multiple validators.

Instead of a log for every validator managed by the validator monitor, we now batch logs for attestation records for previous epoch.

Before:
```log
Sep 20 06:53:08.239 INFO Previous epoch attestation success      validator: 1, epoch: 65875, matched_head: true, matched_target: true, inclusion_lag: 0 slot(s), service: val_mon
Sep 20 06:53:08.239 INFO Previous epoch attestation success      validator: 2, epoch: 65875, matched_head: true, matched_target: true, inclusion_lag: 0 slot(s), service: val_mon
Sep 20 06:53:08.239 INFO Previous epoch attestation success      validator: 3, epoch: 65875, matched_head: true, matched_target: true, inclusion_lag: 0 slot(s), service: val_mon
Sep 20 06:53:08.239 INFO Previous epoch attestation success      validator: 4, epoch: 65875, matched_head: true, matched_target: true, inclusion_lag: 0 slot(s), service: val_mon
Sep 20 06:53:08.239 INFO Previous epoch attestation success      validator: 5, epoch: 65875, matched_head: false, matched_target: true, inclusion_lag: 0 slot(s), service: val_mon
Sep 20 06:53:08.239 WARN Attestation failed to match head        validator: 5, epoch: 65875, service: val_mon
Sep 20 06:53:08.239 INFO Previous epoch attestation success      validator: 6, epoch: 65875, matched_head: false, matched_target: true, inclusion_lag: 0 slot(s), service: val_mon
Sep 20 06:53:08.239 WARN Attestation failed to match head        validator: 6, epoch: 65875, service: val_mon
Sep 20 06:53:08.239 INFO Previous epoch attestation success      validator: 7, epoch: 65875, matched_head: true, matched_target: false, inclusion_lag: 1 slot(s), service: val_mon
Sep 20 06:53:08.239 WARN Attestation failed to match target      validator: 7, epoch: 65875, service: val_mon
Sep 20 06:53:08.239 WARN Sub-optimal inclusion delay             validator: 7, epoch: 65875, optimal: 1, delay: 2, service: val_mon
Sep 20 06:53:08.239 INFO Previous epoch attestation success      validator: 8, epoch: 65875, matched_head: true, matched_target: false, inclusion_lag: 1 slot(s), service: val_mon
Sep 20 06:53:08.239 WARN Attestation failed to match target      validator: 8, epoch: 65875, service: val_mon
Sep 20 06:53:08.239 WARN Sub-optimal inclusion delay             validator: 8, epoch: 65875, optimal: 1, delay: 2, service: val_mon
Sep 20 06:53:08.239 ERRO Previous epoch attestation missing      validator: 9, epoch: 65875, service: val_mon
Sep 20 06:53:08.239 ERRO Previous epoch attestation missing      validator: 10, epoch: 65875, service: val_mon
```

after
```
Sep 20 06:53:08.239 INFO Previous epoch attestation success      validators: [1,2,3,4,5,6,7,8,9] , epoch: 65875, service: val_mon
Sep 20 06:53:08.239 WARN Previous epoch attestation failed to match head, validators: [5,6], epoch: 65875, service: val_mon
Sep 20 06:53:08.239 WARN Previous epoch attestation failed to match target, validators: [7,8], epoch: 65875, service: val_mon
Sep 20 06:53:08.239 WARN Previous epoch attestations had sub-optimal inclusion delay, validators: [7,8], epoch: 65875, service: val_mon
Sep 20 06:53:08.239 ERRO Previous epoch attestation missing      validators: [9,10], epoch: 65875, service: val_mon
```

The detailed individual logs are downgraded to debug logs.
2021-10-12 05:06:48 +00:00
Mac L
a73d698e30 Add TLS capability to the beacon node HTTP API (#2668)
Currently, the beacon node has no ability to serve the HTTP API over TLS.
Adding this functionality would be helpful for certain use cases, such as when you need a validator client to connect to a backup beacon node which is outside your local network, and the use of an SSH tunnel or reverse proxy would be inappropriate.

## Proposed Changes

- Add three new CLI flags to the beacon node
  - `--http-enable-tls`: enables TLS
  - `--http-tls-cert`: to specify the path to the certificate file
  - `--http-tls-key`: to specify the path to the key file
- Update the HTTP API to optionally use `warp`'s [`TlsServer`](https://docs.rs/warp/0.3.1/warp/struct.TlsServer.html) depending on the presence of the `--http-enable-tls` flag
- Update tests and docs
- Use a custom branch for `warp` to ensure proper error handling

## Additional Info

Serving the API over TLS should currently be considered experimental. The reason for this is that it uses code from an [unmerged PR](https://github.com/seanmonstar/warp/pull/717). This commit provides the `try_bind_with_graceful_shutdown` method to `warp`, which is helpful for controlling error flow when the TLS configuration is invalid (cert/key files don't exist, incorrect permissions, etc). 
I've implemented the same code in my [branch here](https://github.com/macladson/warp/tree/tls).

Once the code has been reviewed and merged upstream into `warp`, we can remove the dependency on my branch and the feature can be considered more stable.

Currently, the private key file must not be password-protected in order to be read into Lighthouse.
2021-10-12 03:35:49 +00:00
Age Manning
0aee7ec873 Refactor Peerdb and PeerManager (#2660)
## Proposed Changes

This is a refactor of the PeerDB and PeerManager. A number of bugs have been surfacing around the connection state of peers and their interaction with the score state. 

This refactor tightens the mutability properties of peers such that only specific modules are able to modify the state of peer information preventing inadvertant state changes that can lead to our local peer manager db being out of sync with libp2p. 

Further, the logic around connection and scoring was quite convoluted and the distinction between the PeerManager and Peerdb was not well defined. Although these issues are not fully resolved, this PR is step to cleaning up this logic. The peerdb solely manages most mutability operations of peers leaving high-order logic to the peer manager. 

A single `update_connection_state()` function has been added to the peer-db making it solely responsible for modifying the peer's connection state. The way the peer's scores can be modified have been reduced to three simple functions (`update_scores()`, `update_gossipsub_scores()` and `report_peer()`). This prevents any add-hoc modifications of scores and only natural processes of score modification is allowed which simplifies the reasoning of score and state changes.
2021-10-11 02:45:06 +00:00
Michael Sproul
708557a473 Fix cargo audit warns for nix, psutil, time (#2699)
## Issue Addressed

Fix `cargo audit` failures on `unstable`

Closes #2698

## Proposed Changes

The main culprit is `nix`, which is vulnerable for versions below v0.23.0. We can't get by with a straight-forward `cargo update` because `psutil` depends on an old version of `nix` (cf. https://github.com/rust-psutil/rust-psutil/pull/93). Hence I've temporarily forked `psutil` under the `sigp` org, where I've included the update to `nix` v0.23.0.

Additionally, I took the chance to update the `time` dependency to v0.3, which removed a bunch of stale deps including `stdweb` which is no longer maintained. Lighthouse only uses the `time` crate in the notifier to do some pretty printing, and so wasn't affected by any of the breaking changes in v0.3 ([changelog here](https://github.com/time-rs/time/blob/main/CHANGELOG.md#030-2021-07-30)).
2021-10-11 00:10:35 +00:00
Pawan Dhananjay
7c7ba770de Update broken api links (#2665)
## Issue Addressed

Resolves #2563 
Replacement for #2653 as I'm not able to reopen that PR after force pushing.

## Proposed Changes

Fixes all broken api links. Cherry picked changes in #2590 and updated a few more links.

Co-authored-by: Mason Stallmo <masonstallmo@gmail.com>
2021-10-06 00:46:09 +00:00
Pawan Dhananjay
73ec29c267 Don't log errors on resubscription of gossip topics (#2613)
## Issue Addressed

Resolves #2555

## Proposed Changes

Don't log errors on resubscribing to topics. Also don't log errors if we are setting already set attnet/syncnet bits.
2021-10-06 00:46:08 +00:00
Wink Saville
58870fc6d3 Add test_logger as feature to logging (#2586)
## Issue Addressed

Fix #2585

## Proposed Changes

Provide a canonical version of test_logger that can be used
throughout lighthouse.

## Additional Info

This allows tests to conditionally emit logging data by adding
test_logger as the default logger. And then when executing
`cargo test --features logging/test_logger` log output
will be visible:

  wink@3900x:~/lighthouse/common/logging/tests/test-feature-test_logger (Add-test_logger-as-feature-to-logging)
  $ cargo test --features logging/test_logger
      Finished test [unoptimized + debuginfo] target(s) in 0.02s
       Running unittests (target/debug/deps/test_logger-e20115db6a5e3714)

  running 1 test
  Sep 10 12:53:45.212 INFO hi, module: test_logger:8
  test tests::test_fn_with_logging ... ok

  test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Doc-tests test-logger

  running 0 tests

  test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Or, in normal scenarios where logging isn't needed, executing
`cargo test` the log output will not be visible:

  wink@3900x:~/lighthouse/common/logging/tests/test-feature-test_logger (Add-test_logger-as-feature-to-logging)
  $ cargo test
      Finished test [unoptimized + debuginfo] target(s) in 0.02s
       Running unittests (target/debug/deps/test_logger-02e02f8d41e8cf8a)

  running 1 test
  test tests::test_fn_with_logging ... ok

  test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Doc-tests test-logger

  running 0 tests

  test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
2021-10-06 00:46:07 +00:00
Michael Sproul
7c88f582d9 Release v2.0.0 (#2673)
## Proposed Changes

* Bump version to v2.0.0
* Update dependencies (obsoletes #2670). `tokio-macros` v1.4.0 had been yanked due to a bug.
2021-10-05 03:53:18 +00:00
Michael Sproul
ed1fc7cca6 Fix I/O atomicity issues with checkpoint sync (#2671)
## Issue Addressed

This PR addresses an issue found by @YorickDowne during testing of v2.0.0-rc.0.

Due to a lack of atomic database writes on checkpoint sync start-up, it was possible for the database to get into an inconsistent state from which it couldn't recover without `--purge-db`. The core of the issue was that the store's anchor info was being stored _before_ the `PersistedBeaconChain`. If a crash occured so that anchor info was stored but _not_ the `PersistedBeaconChain`, then on restart Lighthouse would think the database was unitialized and attempt to compare-and-swap a `None` value, but would actually find the stale info from the previous run.

## Proposed Changes

The issue is fixed by writing the anchor info, the split point, and the `PersistedBeaconChain` atomically on start-up. Some type-hinting ugliness was required, which could possibly be cleaned up in future refactors.
2021-10-05 03:53:17 +00:00
Kane Wallmann
28b79084cd Fix chain_id value in config/deposit_contract RPC method (#2659)
## Issue Addressed

This PR addresses issue #2657

## Proposed Changes

Changes `/eth/v1/config/deposit_contract` endpoint to return the chain ID from the loaded chain spec instead of eth1::DEFAULT_NETWORK_ID which is the Goerli chain ID of 5.

Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-10-01 06:32:38 +00:00
Michael Sproul
ea78315749 Release v2.0.0-rc.0 (#2634)
## Proposed Changes

Cut the first release candidate for v2.0.0, in preparation for testing and release this week

## Additional Info

Builds on #2632, which should either be merged first or in the same batch
2021-10-01 01:23:55 +00:00
Age Manning
29a8865d07 Consistent tracking of disconnected peers (#2650)
## Issue Addressed

N/A

## Proposed Changes

When peers switching to a disconnecting state, decrement the disconnected peers counter. This also downgrades some crit logs to errors. 

I've also added a re-sync point when peers get unbanned the disconnected peer count will match back to the number of disconnected peers if it has gone out of sync previously.
2021-09-30 04:31:43 +00:00
Squirrel
db4d72c4f1 Remove unused deps (#2592)
Found some deps you're possibly not using.

Please shout if you think they are indeed still needed.
2021-09-30 04:31:42 +00:00
Mac L
4c510f8f6b Add BlockTimesCache to allow additional block delay metrics (#2546)
## Issue Addressed

Closes #2528

## Proposed Changes

- Add `BlockTimesCache` to provide block timing information to `BeaconChain`. This allows additional metrics to be calculated for blocks that are set as head too late.
- Thread the `seen_timestamp` of blocks received from RPC responses (except blocks from syncing) through to the sync manager, similar to what is done for blocks from gossip.

## Additional Info

This provides the following additional metrics:
- `BEACON_BLOCK_OBSERVED_SLOT_START_DELAY_TIME`
  - The delay between the start of the slot and when the block was first observed.
- `BEACON_BLOCK_IMPORTED_OBSERVED_DELAY_TIME`
   - The delay between when the block was first observed and when the block was imported.
- `BEACON_BLOCK_HEAD_IMPORTED_DELAY_TIME`
  - The delay between when the block was imported and when the block was set as head.

The metric `BEACON_BLOCK_IMPORTED_SLOT_START_DELAY_TIME` was removed.

A log is produced when a block is set as head too late, e.g.:
```
Aug 27 03:46:39.006 DEBG Delayed head block                      set_as_head_delay: Some(21.731066ms), imported_delay: Some(119.929934ms), observed_delay: Some(3.864596988s), block_delay: 4.006257988s, slot: 1931331, proposer_index: 24294, block_root: 0x937602c89d3143afa89088a44bdf4b4d0d760dad082abacb229495c048648a9e, service: beacon
```
2021-09-30 04:31:41 +00:00
Pawan Dhananjay
70441aa554 Improve valmon inclusion delay calculation (#2618)
## Issue Addressed

Resolves #2552 

## Proposed Changes

Offers some improvement in inclusion distance calculation in the validator monitor. 

When registering an attestation from a block, instead of doing `block.slot() - attesstation.data.slot()` to get the inclusion distance, we now pass the parent block slot from the beacon chain and do `parent_slot.saturating_sub(attestation.data.slot())`. This allows us to give best effort inclusion distance in scenarios where the attestation was included right after a skip slot. Note that this does not give accurate results in scenarios where the attestation was included few blocks after the skip slot.

In this case, if the attestation slot was `b1` and was included in block `b2` with a skip slot in between, we would get the inclusion delay as 0  (by ignoring the skip slot) which is the best effort inclusion delay.
```
b1 <- missed <- b2
``` 

Here, if the attestation slot was `b1` and was included in block `b3` with a skip slot and valid block `b2` in between, then we would get the inclusion delay as 2 instead of 1 (by ignoring the skip slot).
```
b1 <- missed <- b2 <- b3 
```
A solution for the scenario 2 would be to count number of slots between included slot and attestation slot ignoring the skip slots in the beacon chain and pass the value to the validator monitor. But I'm concerned that it could potentially lead to db accesses for older blocks in extreme cases.


This PR also uses the validator monitor data for logging per epoch inclusion distance. This is useful as we won't get inclusion data in post-altair summaries.


Co-authored-by: Michael Sproul <micsproul@gmail.com>
2021-09-30 01:22:43 +00:00
realbigsean
7d13e57d9f Add interop metrics (#2645)
## Issue Addressed

Resolves: #2644

## Proposed Changes

- Adds mandatory metrics mentioned here: https://github.com/ethereum/beacon-metrics/blob/master/metrics.md#interop-metrics

## Additional Info

Couldn't figure out how to alias metrics, so I created them all as new gauges/counters.

Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-09-29 23:44:24 +00:00
realbigsean
113ef74ef6 Add contribution and proof event (#2527)
## Issue Addressed

N/A

## Proposed Changes

Add the new ContributionAndProof event: https://github.com/ethereum/beacon-APIs/pull/158

## Additional Info

N/A

Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-09-25 07:53:58 +00:00
Paul Hauner
fe52322088 Implement SSZ union type (#2579)
## Issue Addressed

NA

## Proposed Changes

Implements the "union" type from the SSZ spec for `ssz`, `ssz_derive`, `tree_hash` and `tree_hash_derive` so it may be derived for `enums`:

https://github.com/ethereum/consensus-specs/blob/v1.1.0-beta.3/ssz/simple-serialize.md#union

The union type is required for the merge, since the `Transaction` type is defined as a single-variant union `Union[OpaqueTransaction]`.

### Crate Updates

This PR will (hopefully) cause CI to publish new versions for the following crates:

- `eth2_ssz_derive`: `0.2.1` -> `0.3.0`
- `eth2_ssz`: `0.3.0` -> `0.4.0`
- `eth2_ssz_types`: `0.2.0` -> `0.2.1`
- `tree_hash`: `0.3.0` -> `0.4.0`
- `tree_hash_derive`: `0.3.0` -> `0.4.0`

These these crates depend on each other, I've had to add a workspace-level `[patch]` for these crates. A follow-up PR will need to remove this patch, ones the new versions are published.

### Union Behaviors

We already had SSZ `Encode` and `TreeHash` derive for enums, however it just did a "transparent" pass-through of the inner value. Since the "union" decoding from the spec is in conflict with the transparent method, I've required that all `enum` have exactly one of the following enum-level attributes:

#### SSZ

-  `#[ssz(enum_behaviour = "union")]`
    - matches the spec used for the merge
-  `#[ssz(enum_behaviour = "transparent")]`
    - maintains existing functionality
    - not supported for `Decode` (never was)
    
#### TreeHash

-  `#[tree_hash(enum_behaviour = "union")]`
    - matches the spec used for the merge
-  `#[tree_hash(enum_behaviour = "transparent")]`
    - maintains existing functionality

This means that we can maintain the existing transparent behaviour, but all existing users will get a compile-time error until they explicitly opt-in to being transparent.

### Legacy Option Encoding

Before this PR, we already had a union-esque encoding for `Option<T>`. However, this was with the *old* SSZ spec where the union selector was 4 bytes. During merge specification, the spec was changed to use 1 byte for the selector.

Whilst the 4-byte `Option` encoding was never used in the spec, we used it in our database. Writing a migrate script for all occurrences of `Option` in the database would be painful, especially since it's used in the `CommitteeCache`. To avoid the migrate script, I added a serde-esque `#[ssz(with = "module")]` field-level attribute to `ssz_derive` so that we can opt into the 4-byte encoding on a field-by-field basis.

The `ssz::legacy::four_byte_impl!` macro allows a one-liner to define the module required for the `#[ssz(with = "module")]` for some `Option<T> where T: Encode + Decode`.

Notably, **I have removed `Encode` and `Decode` impls for `Option`**. I've done this to force a break on downstream users. Like I mentioned, `Option` isn't used in the spec so I don't think it'll be *that* annoying. I think it's nicer than quietly having two different union implementations or quietly breaking the existing `Option` impl.

### Crate Publish Ordering

I've modified the order in which CI publishes crates to ensure that we don't publish a crate without ensuring we already published a crate that it depends upon.

## TODO

- [ ] Queue a follow-up `[patch]`-removing PR.
2021-09-25 05:58:36 +00:00
Age Manning
00a7ef0036 Correct bug in sync (#2615)
A bug that causes failed batches to continually download in a loop is corrected.
2021-09-23 01:32:04 +00:00
Paul Hauner
be11437c27 Batch BLS verification for attestations (#2399)
## Issue Addressed

NA

## Proposed Changes

Adds the ability to verify batches of aggregated/unaggregated attestations from the network.

When the `BeaconProcessor` finds there are messages in the aggregated or unaggregated attestation queues, it will first check the length of the queue:

- `== 1` verify the attestation individually.
- `>= 2` take up to 64 of those attestations and verify them in a batch.

Notably, we only perform batch verification if the queue has a backlog. We don't apply any artificial delays to attestations to try and force them into batches. 

### Batching Details

To assist with implementing batches we modify `beacon_chain::attestation_verification` to have two distinct categories for attestations:

- *Indexed* attestations: those which have passed initial validation and were valid enough for us to derive an `IndexedAttestation`.
- *Verified* attestations: those attestations which were indexed *and also* passed signature verification. These are well-formed, interesting messages which were signed by validators.

The batching functions accept `n` attestations and then return `n` attestation verification `Result`s, where those `Result`s can be any combination of `Ok` or `Err`. In other words, we attempt to verify as many attestations as possible and return specific per-attestation results so peer scores can be updated, if required.

When we batch verify attestations, we first try to map all those attestations to *indexed* attestations. If any of those attestations were able to be indexed, we then perform batch BLS verification on those indexed attestations. If the batch verification succeeds, we convert them into *verified* attestations, disabling individual signature checking. If the batch fails, we convert to verified attestations with individual signature checking enabled.

Ultimately, we optimistically try to do a batch verification of attestation signatures and fall-back to individual verification if it fails. This opens an attach vector for "poisoning" the attestations and causing us to waste a batch verification. I argue that peer scoring should do a good-enough job of defending against this and the typical-case gains massively outweigh the worst-case losses.

## Additional Info

Before this PR, attestation verification took the attestations by value (instead of by reference). It turns out that this was unnecessary and, in my opinion, resulted in some undesirable ergonomics (e.g., we had to pass the attestation back in the `Err` variant to avoid clones). In this PR I've modified attestation verification so that it now takes a reference.

I refactored the `beacon_chain/tests/attestation_verification.rs` tests so they use a builder-esque "tester" struct instead of a weird macro. It made it easier for me to test individual/batch with the same set of tests and I think it was a nice tidy-up. Notably, I did this last to try and make sure my new refactors to *actual* production code would pass under the existing test suite.
2021-09-22 08:49:41 +00:00
Michael Sproul
9667dc2f03 Implement checkpoint sync (#2244)
## Issue Addressed

Closes #1891
Closes #1784

## Proposed Changes

Implement checkpoint sync for Lighthouse, enabling it to start from a weak subjectivity checkpoint.

## Additional Info

- [x] Return unavailable status for out-of-range blocks requested by peers (#2561)
- [x] Implement sync daemon for fetching historical blocks (#2561)
- [x] Verify chain hashes (either in `historical_blocks.rs` or the calling module)
- [x] Consistency check for initial block + state
- [x] Fetch the initial state and block from a beacon node HTTP endpoint
- [x] Don't crash fetching beacon states by slot from the API
- [x] Background service for state reconstruction, triggered by CLI flag or API call.

Considered out of scope for this PR:

- Drop the requirement to provide the `--checkpoint-block` (this would require some pretty heavy refactoring of block verification)


Co-authored-by: Diva M <divma@protonmail.com>
2021-09-22 00:37:28 +00:00
Age Manning
280e4fe23d Increase connection limits and allow priority connections (#2604)
In previous network updates we have made our libp2p connections more lean by limiting the maximum number of connections a lighthouse node will accept before libp2p rejects new connections. 

However, we still maintain the logic that at maximum connections, we try to dial extra peers if they are needed by a validator client to publish messages on a specific subnet. The dials typically result in failures due the libp2p connection limits. 

This PR adds an extra factor, `PRIORITY_PEER_EXCESS` which sets aside a new allocation of peers we are able to dial in case we need these peers for the VC client.  This allocation sits along side the excess peer (which allows extra incoming peers on top of our target peer limit). 

The drawback here, is that libp2p now allows extra peers to connect to us (beyond the standard peer limit) which the peer manager should subsequently reject.
2021-09-21 07:45:13 +00:00
Age Manning
a73dcb7b6d Improved handling of IP Banning (#2530)
This PR in general improves the handling around peer banning. 

Specifically there were issues when multiple peers under a single IP connected to us after we banned the IP for poor behaviour.

This PR should now handle these peers gracefully as well as make some improvements around how we previously disconnected and banned peers. 

The logic now goes as follows:
- Once a peer gets banned, its gets registered with its known IP addresses
- Once enough banned peers exist under a single IP that IP is banned
- We retain connections with existing peers under this IP
- Any new connections under this IP are rejected
2021-09-17 04:02:31 +00:00
Pawan Dhananjay
64ad2af100 Subscribe to altair gossip topics 2 slots before fork (#2532)
## Issue Addressed

N/A

## Proposed Changes

Add a fork_digest to `ForkContext` only if it is set in the config.
Reject gossip messages on post fork topics before the fork happens.

Edit: Instead of rejecting gossip messages on post fork topics, we now subscribe to post fork topics 2 slots before the fork.

Co-authored-by: Age Manning <Age@AgeManning.com>
2021-09-17 01:11:16 +00:00
Age Manning
56e0615df8 Experimental discovery (#2577)
# Description

A few changes have been made to discovery. In particular a custom re-write of an LRU cache which previously was read/write O(N) for all our sessions ~5k, to a more reasonable hashmap-style O(1). 

Further there has been reported issues in the current discv5, so added error handling to help identify the issue has been added.
2021-09-16 04:45:05 +00:00
Age Manning
95b17137a8 Reduce network debug noise (#2593)
The identify network debug logs can get quite noisy and are unnecessary to print on every request/response. 

This PR reduces debug noise by only printing messages for identify messages that offer some new information.
2021-09-14 08:28:35 +00:00
Wink Saville
4755d4b236 Update sloggers to v2.0.2 (#2588)
fixes #2584
2021-09-14 06:48:26 +00:00
Paul Hauner
f9bba92db3 v1.5.2 (#2595)
## Issue Addressed

NA

## Proposed Changes

Version bump

## Additional Info

Please do not `bors` without my approval, I am still testing.
2021-09-13 23:01:19 +00:00
Paul Hauner
ddbd4e6965 v1.5.2-rc.0 (#2565)
## Issue Addressed

NA

## Proposed Changes

- Bump version
- Tidy some comments mangled by the version change regex.

## Additional Info

NA
2021-09-03 23:28:21 +00:00
Michael Sproul
9c785a9b33 Optimize process_attestation with active balance cache (#2560)
## Proposed Changes

Cache the total active balance for the current epoch in the `BeaconState`. Computing this value takes around 1ms, and this was negatively impacting block processing times on Prater, particularly when reconstructing states.

With a large number of attestations in each block, I saw the `process_attestations` function taking 150ms, which means that reconstructing hot states can take up to 4.65s (31 * 150ms), and reconstructing freezer states can take up to 307s (2047 * 150ms).

I opted to add the cache to the beacon state rather than computing the total active balance at the start of state processing and threading it through. Although this would be simpler in a way, it would waste time, particularly during block replay, as the total active balance doesn't change for the duration of an epoch. So we save ~32ms for hot states, and up to 8.1s for freezer states (using `--slots-per-restore-point 8192`).
2021-09-03 07:50:43 +00:00
realbigsean
50321c6671 Updates to make crates publishable (#2472)
## Issue Addressed

Related to: #2259

Made an attempt at all the necessary updates here to publish the crates to crates.io. I incremented the minor versions on all the crates that have been previously published. We still might run into some issues as we try to publish because I'm not able to test this out but I think it's a good starting point.

## Proposed Changes

- Add description and license to `ssz_types` and `serde_util`
- rename `serde_util` to `eth2_serde_util`
- increment minor versions
- remove path dependencies
- remove patch dependencies 

## Additional Info
Crates published: 

- [x] `tree_hash` -- need to publish `tree_hash_derive` and `eth2_hashing` first
- [x] `eth2_ssz_types` -- need to publish `eth2_serde_util` first
- [x] `tree_hash_derive`
- [x] `eth2_ssz`
- [x] `eth2_ssz_derive`
- [x] `eth2_serde_util`
- [x] `eth2_hashing`


Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-09-03 01:10:25 +00:00
Pawan Dhananjay
5a3bcd2904 Validator monitor support for sync committees (#2476)
## Issue Addressed

N/A

## Proposed Changes

Add functionality in the validator monitor to provide sync committee related metrics for monitored validators.


Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-08-31 23:31:36 +00:00
Paul Hauner
44fa54004c Persist to DB after setting canonical head (#2547)
## Issue Addressed

NA

## Proposed Changes

Missed head votes on attestations is a well-known issue. The primary cause is a block getting set as the head *after* the attestation deadline.

This PR aims to shorten the overall time between "block received" and "block set as head" by:

1. Persisting the head and fork choice *after* setting the canonical head
    - Informal measurements show this takes ~200ms
 1. Pruning the op pool *after* setting the canonical head.
 1. No longer persisting the op pool to disk during `BeaconChain::fork_choice`
     - Informal measurements show this can take up to 1.2s.
     
I also add some metrics to help measure the effect of these changes.
     
Persistence changes like this run the risk of breaking assumptions downstream. However, I have considered these risks and I think we're fine here. I will describe my reasoning for each change.

## Reasoning

### Change 1:  Persisting the head and fork choice *after* setting the canonical head

For (1), although the function is called `persist_head_and_fork_choice`, it only persists:

- Fork choice
- Head tracker
- Genesis block root

Since `BeaconChain::fork_choice_internal` does not modify these values between the original time we were persisting it and the current time, I assert that the change I've made is non-substantial in terms of what ends up on-disk. There's the possibility that some *other* thread has modified fork choice in the extra time we've given it, but that's totally fine.

Since the only time we *read* those values from disk is during startup, I assert that this has no impact during runtime. 

### Change 2: Pruning the op pool after setting the canonical head

Similar to the argument above, we don't modify the op pool during `BeaconChain::fork_choice_internal` so it shouldn't matter when we prune. This change should be non-substantial.

### Change 3: No longer persisting the op pool to disk during `BeaconChain::fork_choice`

This change *is* substantial. With the proposed changes, we'll only be persisting the op pool to disk when we shut down cleanly (i.e., the `BeaconChain` gets dropped). This means we'll save disk IO and time during usual operation, but a `kill -9` or similar "crash" will probably result in an out-of-date op pool when we reboot. An out-of-date op pool can only have an impact when producing blocks or aggregate attestations/sync committees.

I think it's pretty reasonable that a crash might result in an out-of-date op pool, since:

- Crashes are fairly rare. Practically the only time I see LH suffer a full crash is when the OOM killer shows up, and that's a very serious event.
- It's generally quite rare to produce a block/aggregate immediately after a reboot. Just a few slots of runtime is probably enough to have a decent-enough op pool again.

## Additional Info

Credits to @macladson for the timings referenced here.
2021-08-31 04:48:21 +00:00
Pawan Dhananjay
b4dd98b3c6 Shutdown after sync (#2519)
## Issue Addressed

Resolves #2033 

## Proposed Changes

Adds a flag to enable shutting down beacon node right after sync is completed.

## Additional Info

Will need modification after weak subjectivity sync is enabled to change definition of a fully synced node.
2021-08-30 13:46:13 +00:00
Michael Sproul
10945e0619 Revert bad blocks on missed fork (#2529)
## Issue Addressed

Closes #2526

## Proposed Changes

If the head block fails to decode on start up, do two things:

1. Revert all blocks between the head and the most recent hard fork (to `fork_slot - 1`).
2. Reset fork choice so that it contains the new head, and all blocks back to the new head's finalized checkpoint.

## Additional Info

I tweaked some of the beacon chain test harness stuff in order to make it generic enough to test with a non-zero slot clock on start-up. In the process I consolidated all the various `new_` methods into a single generic one which will hopefully serve all future uses 🤞
2021-08-30 06:41:31 +00:00
Mason Stallmo
bc14d1d73d Add more unix signal handlers (#2486)
## Issue Addressed

Resolves #2114 

Swapped out the ctrlc crate for tokio signals to hook register handlers for SIGPIPE and SIGHUP along with SIGTERM and SIGINT.

## Proposed Changes

- Swap out the ctrlc crate for tokio signals for unix signal handing
- Register signals for SIGPIPE and SHIGUP that trigger the same shutdown procedure as SIGTERM and SIGINT

## Additional Info

I tested these changes against the examples in the original issue and noticed some interesting behavior on my machine. When running `lighthouse bn --network pyrmont |& tee -a pyrmont_bn.log` or `lighthouse bn --network pyrmont 2>&1 | tee -a pyrmont_bn.log` none of the above signals are sent to the lighthouse program in a way I was able to observe. 

The only time it seems that the signal gets sent to the lighthouse program is if there is no redirection of stderr to stdout. I'm not as familiar with the details of how unix signals work in linux with a redirect like that so I'm not sure if this is a bug in the program or expected behavior.

Signals are correctly received without the redirection and if the above signals are sent directly to the program with something like `kill`.
2021-08-30 05:19:34 +00:00
Pawan Dhananjay
99737c551a Improve eth1 fallback logging (#2490)
## Issue Addressed

Resolves #2487 

## Proposed Changes

Logs a message once in every invocation of `Eth1Service::update` method if the primary endpoint is unavailable for some reason. 

e.g.
```log
Aug 03 00:09:53.517 WARN Error connecting to eth1 node endpoint  action: trying fallbacks, endpoint: http://localhost:8545/, service: eth1_rpc
Aug 03 00:09:56.959 INFO Fetched data from fallback              fallback_number: 1, service: eth1_rpc
```

The main aim of this PR is to have an accompanying message to the "action: trying fallbacks" error message that is returned when checking the endpoint for liveness. This is mainly to indicate to the user that the fallback was live and reachable. 

## Additional info
This PR is not meant to be a catch all for all cases where the primary endpoint failed. For instance, this won't log anything if the primary node was working fine during endpoint liveness checking and failed during deposit/block fetching. This is done intentionally to reduce number of logs while initial deposit/block sync and to avoid more complicated logic.
2021-08-30 00:51:26 +00:00
Paul Hauner
b0ac3464ca v1.5.1 (#2544)
## Issue Addressed

NA

## Proposed Changes

- Bump version

## Additional Info

NA
2021-08-27 01:58:19 +00:00
Paul Hauner
4405425726 Expand gossip duplicate cache time (#2542)
## Issue Addressed

NA

## Proposed Changes

This PR expands the time that entries exist in the gossip-sub duplicate cache. Recent investigations found that this cache is one slot (12s) shorter than the period for which an attestation is permitted to propagate on the gossip network.

Before #2540, this was causing peers to be unnecessarily down-scored for sending old attestations. Although that issue has been fixed, the duplicate cache time is increased here to avoid such messages from getting any further up the networking stack then required.

## Additional Info

NA
2021-08-26 23:25:50 +00:00
Paul Hauner
3fdad38eba Remove penality for duplicate attestation from same validator (#2540)
## Issue Addressed

NA

## Proposed Changes

A Discord user presented logs which indicated a drop in their peer count caused by a variety of peers sending attestations where we'd already seen an attestation for that validator. It's presently unclear how this case came about, but during our investigation I noticed that we are down-voting peers for sending such attestations.

There are three scenarios where we may receive duplicate unagg. attestations from the same validator:

1. The validator is committing a slashable offense.
2. The gossipsub message-deduping functionality is not working as expected.
3. We received the message via the HTTP prior to seeing it via gossip.

Scenario (1) would be so costly for an attacker that I don't think we need to add DoS protection for it.

Scenario (2) seems feasible. Our "seen message" caches in gossipsub might fill up/expire and let through these duplicates. There are also cases involving message ID mismatches with the other peers. In both these cases, I don't think we should be doing 1 attestation == -1 point down-voting.

Scenario (3) is not necessarily a fault of the peer and we shouldn't down-score them for it.

## Additional Info

NA
2021-08-26 08:00:50 +00:00
Age Manning
09545fe668 Increase maximum gossipsub subscriptions (#2531)
Due to the altair fork, in principle we can now subscribe to up to 148 topics. This bypasses our original limit and we can end up rejecting subscriptions. 

This PR increases the limit to account for the fork.
2021-08-26 02:01:10 +00:00
Pawan Dhananjay
d3b4cbed53 Packet filter cli option (#2523)
## Issue Addressed

N/A

## Proposed Changes

Adds a cli option to disable packet filter in `lighthouse bootnode`. This is useful in running local testnets as the bootnode bans requests from the same ip(localhost) if the packet filter is enabled.
2021-08-26 00:29:39 +00:00
realbigsean
5b8436e33f Fork schedule api (#2525)
## Issue Addressed

Resolves #2524

## Proposed Changes

- Return all known forks in the `/config/fork_schedule`, previously returned only the head of the chain's fork.
- Deleted the `StateId::head` method because it was only previously used in this endpoint.


Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-08-24 01:36:27 +00:00
Pawan Dhananjay
02fd54bea7 Refactor discovery queries (#2518)
## Issue Addressed

N/A

## Proposed Changes

Refactor discovery queries such that only `QueryType::Subnet` queries are queued for grouping. `QueryType::FindPeers` is always made regardless of the number of active `Subnet` queries (max = 2). This prevents `FindPeers` queries from getting starved if `Subnet` queries start queuing up. 

Also removes `GroupedQueryType` struct and use `QueryType` for all queuing and processing of discovery requests.

## Additional Info

Currently, no distinction is made between subnet discovery queries for attestation and sync committee subnets when grouping queries. Potentially we could prioritise attestation queries over sync committee queries in the future.
2021-08-24 00:12:13 +00:00
Paul Hauner
90d5ab1566 v1.5.0 (#2535)
## Issue Addressed

NA

## Proposed Changes

- Version bump
- Increase queue sizes for aggregated attestations and re-queued attestations. 

## Additional Info

NA
2021-08-23 04:27:36 +00:00
Paul Hauner
f2a8c6229c Metrics and DEBG log for late gossip blocks (#2533)
## Issue Addressed

Which issue # does this PR address?

## Proposed Changes

- Add a counter metric to log when a block is received late from gossip.
- Also push a `DEBG` log for the above condition.
- Use Debug (`?`) instead of Display (`%`) for a bunch of logs in the beacon processor, so we don't have to deal with concatenated block roots.
- Add new ERRO and CRIT to HTTP API to alert users when they're publishing late blocks.

## Additional Info

NA
2021-08-23 00:59:14 +00:00
Paul Hauner
c7379836a5 v1.5.0-rc.1 (#2516)
## Issue Addressed

NA

## Proposed Changes

- Bump version

## Additional Info

NA
2021-08-17 05:34:31 +00:00
Michael Sproul
c0a2f501d9 Upgrade dependencies (#2513)
## Proposed Changes

* Consolidate Tokio versions: everything now uses the latest v1.10.0, no more `tokio-compat`.
* Many semver-compatible changes via `cargo update`. Notably this upgrades from the yanked v0.8.0 version of crossbeam-deque which is present in v1.5.0-rc.0
* Many semver incompatible upgrades via `cargo upgrades` and `cargo upgrade --workspace pkg_name`. Notable ommissions:
    - Prometheus, to be handled separately: https://github.com/sigp/lighthouse/issues/2485
    - `rand`, `rand_xorshift`: the libsecp256k1 package requires 0.7.x, so we'll stick with that for now
    - `ethereum-types` is pinned at 0.11.0 because that's what `web3` is using and it seems nice to have just a single version
    
## Additional Info

We still have two versions of `libp2p-core` due to `discv5` depending on the v0.29.0 release rather than `master`. AFAIK it should be OK to release in this state (cc @AgeManning )
2021-08-17 01:00:24 +00:00
Pawan Dhananjay
d17350c0fa Lower penalty for past/future slot errors (#2510)
## Issue Addressed

N/A

## Proposed Changes

Reduce the penalties with past/future slot errors for sync committee messages.
2021-08-16 23:30:18 +00:00
Paul Hauner
4c4ebfbaa1 v1.5.0 rc.0 (#2506)
## Issue Addressed

NA

## Proposed Changes

- Bump to `v1.5.0-rc.0`.
- Increase attestation reprocessing queue size (I saw this filling up on Prater).
- Reduce error log for full attn reprocessing queue to warn.

## TODO

- [x] Manual testing
- [x] Resolve https://github.com/sigp/lighthouse/pull/2493
- [x] Include https://github.com/sigp/lighthouse/pull/2501
2021-08-12 04:02:46 +00:00
Paul Hauner
4af6fcfafd Bump libp2p to address inconsistency in mesh peer tracking (#2493)
## Issue Addressed

- Resolves #2457
- Resolves #2443

## Proposed Changes

Target the (presently unreleased) head of `libp2p/rust-libp2p:master` in order to obtain the fix from https://github.com/libp2p/rust-libp2p/pull/2175.

Additionally:

- `libsecp256k1` needed to be upgraded to satisfy the new version of `libp2p`.
- There were also a handful of minor changes to `eth2_libp2p` to suit some interface changes.
- Two `cargo audit --ignore` flags were remove due to libp2p upgrades.

## Additional Info
 
 NA
2021-08-12 01:59:20 +00:00
Paul Hauner
ceda27371d Ensure doppelganger detects attestations in blocks (#2495)
## Issue Addressed

NA

## Proposed Changes

When testing our (not-yet-released) Doppelganger implementation, I noticed that we aren't detecting attestations included in blocks (only those on the gossip network).

This is because during [block processing](e8c0d1f19b/beacon_node/beacon_chain/src/beacon_chain.rs (L2168)) we only update the `observed_attestations` cache with each attestation, but not the `observed_attesters` cache. This is the correct behaviour when we consider the [p2p spec](https://github.com/ethereum/eth2.0-specs/blob/v1.0.1/specs/phase0/p2p-interface.md):

> [IGNORE] There has been no other valid attestation seen on an attestation subnet that has an identical attestation.data.target.epoch and participating validator index.

We're doing the right thing here and still allowing attestations on gossip that we've seen in a block. However, this doesn't work so nicely for Doppelganger.

To resolve this, I've taken the following steps:

- Add a `observed_block_attesters` cache.
- Rename `observed_attesters` to `observed_gossip_attesters`.

## TODO

- [x] Add a test to ensure a validator that's been seen in a block attestation (but not a gossip attestation) returns `true` for `BeaconChain::validator_seen_at_epoch`.
- [x] Add a test to ensure `observed_block_attesters` isn't polluted via gossip attestations and vice versa. 


Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-08-09 02:43:03 +00:00
Michael Sproul
17a2c778e3 Altair validator client and HTTP API (#2404)
## Proposed Changes

* Implement the validator client and HTTP API changes necessary to support Altair


Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-08-06 00:47:31 +00:00
Pawan Dhananjay
e8c0d1f19b Altair networking (#2300)
## Issue Addressed

Resolves #2278 

## Proposed Changes

Implements the networking components for the Altair hard fork https://github.com/ethereum/eth2.0-specs/blob/dev/specs/altair/p2p-interface.md

## Additional Info

This PR acts as the base branch for networking changes and tracks https://github.com/sigp/lighthouse/pull/2279 . Changes to gossip, rpc and discovery can be separate PRs to be merged here for ease of review.

Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-08-04 01:44:57 +00:00
Michael Sproul
187425cdc1 Bump discv5 to v0.1.0-beta.9 (#2479)
Bump discv5 to fix the issues with IP filters and removing nodes.

~~Blocked on an upstream release, and more testnet data.~~
2021-08-03 01:05:06 +00:00
realbigsean
c5786a8821 Doppelganger detection (#2230)
## Issue Addressed

Resolves #2069 

## Proposed Changes

- Adds a `--doppelganger-detection` flag
- Adds a `lighthouse/seen_validators` endpoint, which will make it so the lighthouse VC is not interopable with other client beacon nodes if the `--doppelganger-detection` flag is used, but hopefully this will become standardized. Relevant Eth2 API repo issue: https://github.com/ethereum/eth2.0-APIs/issues/64
- If the `--doppelganger-detection` flag is used, the VC will wait until the beacon node is synced, and then wait an additional 2 epochs. The reason for this is to make sure the beacon node is able to subscribe to the subnets our validators should be attesting on. I think an alternative would be to have the beacon node subscribe to all subnets for 2+ epochs on startup by default.

## Additional Info

I'd like to add tests and would appreciate feedback. 

TODO:  handle validators started via the API, potentially make this default behavior

Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
2021-07-31 03:50:52 +00:00
realbigsean
303deb9969 Rust 1.54.0 lints (#2483)
## Issue Addressed

N/A

## Proposed Changes

- Removing a bunch of unnecessary references
- Updated `Error::VariantError` to `Error::Variant`
- There were additional enum variant lints that I ignored, because I thought our variant names were fine
- removed `MonitoredValidator`'s `pubkey` field, because I couldn't find it used anywhere. It looks like we just use the string version of the pubkey (the `id` field) if there is no index

## Additional Info



Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-07-30 01:11:47 +00:00
Paul Hauner
8efd9fc324 Add AttesterCache for attestation production (#2478)
## Issue Addressed

- Resolves #2169

## Proposed Changes

Adds the `AttesterCache` to allow validators to produce attestations for older slots. Presently, some arbitrary restrictions can force validators to receive an error when attesting to a slot earlier than the present one. This can cause attestation misses when there is excessive load on the validator client or time sync issues between the VC and BN.

## Additional Info

NA
2021-07-29 04:38:26 +00:00
Michael Sproul
923486f34c Use bulk verification for sync_aggregate signature (#2415)
## Proposed Changes

Add the `sync_aggregate` from `BeaconBlock` to the bulk signature verifier for blocks. This necessitates a new signature set constructor for the sync aggregate, which is different from the others due to the use of [`eth2_fast_aggregate_verify`](https://github.com/ethereum/eth2.0-specs/blob/v1.1.0-alpha.7/specs/altair/bls.md#eth2_fast_aggregate_verify) for sync aggregates, per [`process_sync_aggregate`](https://github.com/ethereum/eth2.0-specs/blob/v1.1.0-alpha.7/specs/altair/beacon-chain.md#sync-aggregate-processing). I made the choice to return an optional signature set, with `None` representing the case where the signature is valid on account of being the point at infinity (requires no further checking).

To "dogfood" the changes and prevent duplication, the consensus logic now uses the signature set approach as well whenever it is required to verify signatures (which should only be in testing AFAIK). The EF tests pass with the code as it exists currently, but failed before I adapted the `eth2_fast_aggregate_verify` changes (which is good).

As a result of this change Altair block processing should be a little faster, and importantly, we will no longer accidentally verify signatures when replaying blocks, e.g. when replaying blocks from the database.
2021-07-28 05:40:21 +00:00
Paul Hauner
6e3ca48cb9 Cache participating indices for Altair epoch processing (#2416)
## Issue Addressed

NA

## Proposed Changes

This PR addresses two things:

1. Allows the `ValidatorMonitor` to work with Altair states.
1. Optimizes `altair::process_epoch` (see [code](https://github.com/paulhauner/lighthouse/blob/participation-cache/consensus/state_processing/src/per_epoch_processing/altair/participation_cache.rs) for description)

## Breaking Changes

The breaking changes in this PR revolve around one premise:

*After the Altair fork, it's not longer possible (given only a `BeaconState`) to identify if a validator had *any* attestation included during some epoch. The best we can do is see if that validator made the "timely" source/target/head flags.*

Whilst this seems annoying, it's not actually too bad. Finalization is based upon "timely target" attestations, so that's really the most important thing. Although there's *some* value in knowing if a validator had *any* attestation included, it's far more important to know about "timely target" participation, since this is what affects finality and justification.

For simplicity and consistency, I've also removed the ability to determine if *any* attestation was included from metrics and API endpoints. Now, all Altair and non-Altair states will simply report on the head/target attestations.

The following section details where we've removed fields and provides replacement values.

### Breaking Changes: Prometheus Metrics

Some participation metrics have been removed and replaced. Some were removed since they are no longer relevant to Altair (e.g., total attesting balance) and others replaced with gwei values instead of pre-computed values. This provides more flexibility at display-time (e.g., Grafana).

The following metrics were added as replacements:

- `beacon_participation_prev_epoch_head_attesting_gwei_total`
- `beacon_participation_prev_epoch_target_attesting_gwei_total`
- `beacon_participation_prev_epoch_source_attesting_gwei_total`
- `beacon_participation_prev_epoch_active_gwei_total`

The following metrics were removed:

- `beacon_participation_prev_epoch_attester`
   - instead use `beacon_participation_prev_epoch_source_attesting_gwei_total / beacon_participation_prev_epoch_active_gwei_total`.
- `beacon_participation_prev_epoch_target_attester`
   - instead use `beacon_participation_prev_epoch_target_attesting_gwei_total / beacon_participation_prev_epoch_active_gwei_total`.
- `beacon_participation_prev_epoch_head_attester`
   - instead use `beacon_participation_prev_epoch_head_attesting_gwei_total / beacon_participation_prev_epoch_active_gwei_total`.

The `beacon_participation_prev_epoch_attester` endpoint has been removed. Users should instead use the pre-existing `beacon_participation_prev_epoch_target_attester`. 

### Breaking Changes: HTTP API

The `/lighthouse/validator_inclusion/{epoch}/{validator_id}` endpoint loses the following fields:

- `current_epoch_attesting_gwei` (use `current_epoch_target_attesting_gwei` instead)
- `previous_epoch_attesting_gwei` (use `previous_epoch_target_attesting_gwei` instead)

The `/lighthouse/validator_inclusion/{epoch}/{validator_id}` endpoint lose the following fields:

- `is_current_epoch_attester` (use `is_current_epoch_target_attester` instead)
- `is_previous_epoch_attester` (use `is_previous_epoch_target_attester` instead)
- `is_active_in_current_epoch` becomes `is_active_unslashed_in_current_epoch`.
- `is_active_in_previous_epoch` becomes `is_active_unslashed_in_previous_epoch`.

## Additional Info

NA

## TODO

- [x] Deal with total balances
- [x] Update validator_inclusion API
- [ ] Ensure `beacon_participation_prev_epoch_target_attester` and `beacon_participation_prev_epoch_head_attester` work before Altair

Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-07-27 07:01:01 +00:00
Michael Sproul
f5bdca09ff Update to spec v1.1.0-beta.1 (#2460)
## Proposed Changes

Update to the latest version of the Altair spec, which includes new tests and a tweak to the target sync aggregators.

## Additional Info

This change is _not_ required for the imminent Altair devnet, and is waiting on the merge of #2321 to unstable.


Co-authored-by: Paul Hauner <paul@paulhauner.com>
2021-07-27 05:43:35 +00:00
Michael Sproul
84e6d71950 Tree hash caching and optimisations for Altair (#2459)
## Proposed Changes

Remove the remaining Altair `FIXME`s from consensus land.

1. Implement tree hash caching for the participation lists. This required some light type manipulation, including removing the `TreeHash` bound from `CachedTreeHash` which was purely descriptive.
2. Plumb the proposer index through Altair attestation processing, to avoid calculating it for _every_ attestation (potentially 128ms on large networks). This duplicates some work from #2431, but with the aim of getting it in sooner, particularly for the Altair devnets.
3. Removes two FIXMEs related to `superstruct` and cloning, which are unlikely to be particularly detrimental and will be tracked here instead: https://github.com/sigp/superstruct/issues/5
2021-07-23 00:23:53 +00:00
Michael Sproul
63923eaa29 Bump discv5 to v0.1.0-beta.8 (#2471)
## Proposed Changes

Update discv5 to fix bugs seen on `altair-devnet-1`
2021-07-21 07:10:52 +00:00
Age Manning
08fedbfcba
Libp2p Connection Limit (#2455)
* Get libp2p to handle connection limits

* fmt
2021-07-15 16:43:18 +10:00
Age Manning
6818a94171
Discovery update (#2458) 2021-07-15 16:43:18 +10:00
Age Manning
381befbf82
Ensure disconnecting peers are added to the peerdb (#2451) 2021-07-15 16:43:18 +10:00
Age Manning
059d9ec1b1
Gossipsub scoring improvements (#2391)
* Tweak gossipsub parameters for improved scoring

* Modify gossip history

* Update settings

* Make mesh window constant

* Decrease the mesh message deliveries weight

* Fmt
2021-07-15 16:43:18 +10:00
Age Manning
c62810b408
Update to Libp2p to 39.1 (#2448)
* Adjust beacon node timeouts for validator client HTTP requests (#2352)

Resolves #2313

Provide `BeaconNodeHttpClient` with a dedicated `Timeouts` struct.
This will allow granular adjustment of the timeout duration for different calls made from the VC to the BN. These can either be a constant value, or as a ratio of the slot duration.

Improve timeout performance by using these adjusted timeout duration's only whenever a fallback endpoint is available.

Add a CLI flag called `use-long-timeouts` to revert to the old behavior.

Additionally set the default `BeaconNodeHttpClient` timeouts to the be the slot duration of the network, rather than a constant 12 seconds. This will allow it to adjust to different network specifications.

Co-authored-by: Paul Hauner <paul@paulhauner.com>

* Use read_recursive locks in database (#2417)

Closes #2245

Replace all calls to `RwLock::read` in the `store` crate with `RwLock::read_recursive`.

* Unfortunately we can't run the deadlock detector on CI because it's pinned to an old Rust 1.51.0 nightly which cannot compile Lighthouse (one of our deps uses `ptr::addr_of!` which is too new). A fun side-project at some point might be to update the deadlock detector.
* The reason I think we haven't seen this deadlock (at all?) in practice is that _writes_ to the database's split point are quite infrequent, and a concurrent write is required to trigger the deadlock. The split point is only written when finalization advances, which is once per epoch (every ~6 minutes), and state reads are also quite sporadic. Perhaps we've just been incredibly lucky, or there's something about the timing of state reads vs database migration that protects us.
* I wrote a few small programs to demo the deadlock, and the effectiveness of the `read_recursive` fix: https://github.com/michaelsproul/relock_deadlock_mvp
* [The docs for `read_recursive`](https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.read_recursive) warn of starvation for writers. I think in order for starvation to occur the database would have to be spammed with so many state reads that it's unable to ever clear them all and find time for a write, in which case migration of states to the freezer would cease. If an attack could be performed to trigger this starvation then it would likely trigger a deadlock in the current code, and I think ceasing migration is preferable to deadlocking in this extreme situation. In practice neither should occur due to protection from spammy peers at the network layer. Nevertheless, it would be prudent to run this change on the testnet nodes to check that it doesn't cause accidental starvation.

* Return more detail when invalid data is found in the DB during startup (#2445)

- Resolves #2444

Adds some more detail to the error message returned when the `BeaconChainBuilder` is unable to access or decode block/state objects during startup.

NA

* Use hardware acceleration for SHA256 (#2426)

Modify the SHA256 implementation in `eth2_hashing` so that it switches between `ring` and `sha2` to take advantage of [x86_64 SHA extensions](https://en.wikipedia.org/wiki/Intel_SHA_extensions). The extensions are available on modern Intel and AMD CPUs, and seem to provide a considerable speed-up: on my Ryzen 5950X it dropped state tree hashing times by about 30% from 35ms to 25ms (on Prater).

The extensions became available in the `sha2` crate [last year](https://www.reddit.com/r/rust/comments/hf2vcx/ann_rustcryptos_sha1_and_sha2_now_support/), and are not available in Ring, which uses a [pure Rust implementation of sha2](https://github.com/briansmith/ring/blob/main/src/digest/sha2.rs). Ring is faster on CPUs that lack the extensions so I've implemented a runtime switch to use `sha2` only when the extensions are available. The runtime switching seems to impose a miniscule penalty (see the benchmarks linked below).

* Start a release checklist (#2270)

NA

Add a checklist to the release draft created by CI. I know @michaelsproul was also working on this and I suspect @realbigsean also might have useful input.

NA

* Serious banning

* fmt

Co-authored-by: Mac L <mjladson@pm.me>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-07-15 16:43:18 +10:00
Age Manning
3c0d3227ab
Global Network Behaviour Refactor (#2442)
* Network upgrades (#2345)

* Discovery patch (#2382)

* Upgrade libp2p and unstable gossip

* Network protocol upgrades

* Correct dependencies, reduce incoming bucket limit

* Clean up dirty DHT entries before repopulating

* Update cargo lock

* Update lockfile

* Update ENR dep

* Update deps to specific versions

* Update test dependencies

* Update docker rust, and remote signer tests

* More remote signer test fixes

* Temp commit

* Update discovery

* Remove cached enrs after dialing

* Increase the session capacity, for improved efficiency

* Bleeding edge discovery (#2435)

* Update discovery banning logic and tokio

* Update to latest discovery

* Shift to latest discovery

* Fmt

* Initial re-factor of the behaviour

* More progress

* Missed changes

* First draft

* Discovery as a behaviour

* Adding back event waker (not convinced its neccessary, but have made this many changes already)

* Corrections

* Speed up discovery

* Remove double log

* Fmt

* After disconnect inform swarm about ban

* More fmt

* Appease clippy

* Improve ban handling

* Update tests

* Update cargo.lock

* Correct tests

* Downgrade log
2021-07-15 16:43:17 +10:00
Pawan Dhananjay
64226321b3
Relax requirement for enr fork digest predicate (#2433) 2021-07-15 16:43:17 +10:00
Age Manning
c1d2e35c9e
Bleeding edge discovery (#2435)
* Update discovery banning logic and tokio

* Update to latest discovery

* Shift to latest discovery

* Fmt
2021-07-15 16:43:17 +10:00
Age Manning
f4bc9db16d
Change the window mode of yamux (#2390) 2021-07-15 16:43:17 +10:00
Age Manning
6fb48b45fa
Discovery patch (#2382)
* Upgrade libp2p and unstable gossip

* Network protocol upgrades

* Correct dependencies, reduce incoming bucket limit

* Clean up dirty DHT entries before repopulating

* Update cargo lock

* Update lockfile

* Update ENR dep

* Update deps to specific versions

* Update test dependencies

* Update docker rust, and remote signer tests

* More remote signer test fixes

* Temp commit

* Update discovery

* Remove cached enrs after dialing

* Increase the session capacity, for improved efficiency
2021-07-15 16:43:17 +10:00
Age Manning
4aa06c9555
Network upgrades (#2345) 2021-07-15 16:43:10 +10:00
Paul Hauner
b0f5c4c776 Clarify eth1 error message (#2461)
## Issue Addressed

- Closes #2452

## Proposed Changes

Addresses: https://github.com/sigp/lighthouse/issues/2452#issuecomment-879873511

## Additional Info

NA
2021-07-15 04:22:06 +00:00
realbigsean
a3a7f39b0d [Altair] Sync committee pools (#2321)
Add pools supporting sync committees:
- naive sync aggregation pool
- observed sync contributions pool
- observed sync contributors pool
- observed sync aggregators pool

Add SSZ types and tests related to sync committee signatures.

Co-authored-by: Michael Sproul <michael@sigmaprime.io>
Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-07-15 00:52:02 +00:00
Paul Hauner
fc4c611476 Remove msg about longer sync with remote eth1 nodes (#2453)
## Issue Addressed

- Resolves #2452

## Proposed Changes

I've seen a few people confused by this and I don't think the message is really worth it.

## Additional Info

NA
2021-07-14 05:24:09 +00:00
divma
304fb05e44 Maintain attestations that reference unknown blocks (#2319)
## Issue Addressed

#635 

## Proposed Changes
- Keep attestations that reference a block we have not seen for 30secs before being re processed
- If we do import the block before that time elapses, it is reprocessed in that moment
- The first time it fails, do nothing wrt to gossipsub propagation or peer downscoring. If after being re processed it fails, downscore with a `LowToleranceError` and ignore the message.
2021-07-14 05:24:08 +00:00
Paul Hauner
9656ffee7c Metrics for sync aggregate fullness (#2439)
## Issue Addressed

NA

## Proposed Changes

Adds a metric to see how many set bits are in the sync aggregate for each beacon block being imported.

## Additional Info

NA
2021-07-13 02:22:55 +00:00
Paul Hauner
27aec1962c Add more detail to "Prior attestation known" log (#2447)
## Issue Addressed

NA

## Proposed Changes

Adds more detail to the log when an attestation is ignored due to a prior one being known. This will help identify which validators are causing the issue.

## Additional Info

NA
2021-07-13 01:02:03 +00:00
Paul Hauner
a7b7134abb Return more detail when invalid data is found in the DB during startup (#2445)
## Issue Addressed

- Resolves #2444

## Proposed Changes

Adds some more detail to the error message returned when the `BeaconChainBuilder` is unable to access or decode block/state objects during startup.

## Additional Info

NA
2021-07-12 07:31:27 +00:00
Michael Sproul
371c216ac3 Use read_recursive locks in database (#2417)
## Issue Addressed

Closes #2245

## Proposed Changes

Replace all calls to `RwLock::read` in the `store` crate with `RwLock::read_recursive`.

## Additional Info

* Unfortunately we can't run the deadlock detector on CI because it's pinned to an old Rust 1.51.0 nightly which cannot compile Lighthouse (one of our deps uses `ptr::addr_of!` which is too new). A fun side-project at some point might be to update the deadlock detector.
* The reason I think we haven't seen this deadlock (at all?) in practice is that _writes_ to the database's split point are quite infrequent, and a concurrent write is required to trigger the deadlock. The split point is only written when finalization advances, which is once per epoch (every ~6 minutes), and state reads are also quite sporadic. Perhaps we've just been incredibly lucky, or there's something about the timing of state reads vs database migration that protects us.
* I wrote a few small programs to demo the deadlock, and the effectiveness of the `read_recursive` fix: https://github.com/michaelsproul/relock_deadlock_mvp
* [The docs for `read_recursive`](https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.read_recursive) warn of starvation for writers. I think in order for starvation to occur the database would have to be spammed with so many state reads that it's unable to ever clear them all and find time for a write, in which case migration of states to the freezer would cease. If an attack could be performed to trigger this starvation then it would likely trigger a deadlock in the current code, and I think ceasing migration is preferable to deadlocking in this extreme situation. In practice neither should occur due to protection from spammy peers at the network layer. Nevertheless, it would be prudent to run this change on the testnet nodes to check that it doesn't cause accidental starvation.
2021-07-12 07:31:26 +00:00
Mac L
b3c7e59a5b Adjust beacon node timeouts for validator client HTTP requests (#2352)
## Issue Addressed

Resolves #2313 

## Proposed Changes

Provide `BeaconNodeHttpClient` with a dedicated `Timeouts` struct.
This will allow granular adjustment of the timeout duration for different calls made from the VC to the BN. These can either be a constant value, or as a ratio of the slot duration.

Improve timeout performance by using these adjusted timeout duration's only whenever a fallback endpoint is available.

Add a CLI flag called `use-long-timeouts` to revert to the old behavior.

## Additional Info

Additionally set the default `BeaconNodeHttpClient` timeouts to the be the slot duration of the network, rather than a constant 12 seconds. This will allow it to adjust to different network specifications.


Co-authored-by: Paul Hauner <paul@paulhauner.com>
2021-07-12 01:47:48 +00:00
Michael Sproul
b4689e20c6 Altair consensus changes and refactors (#2279)
## Proposed Changes

Implement the consensus changes necessary for the upcoming Altair hard fork.

## Additional Info

This is quite a heavy refactor, with pivotal types like the `BeaconState` and `BeaconBlock` changing from structs to enums. This ripples through the whole codebase with field accesses changing to methods, e.g. `state.slot` => `state.slot()`.


Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-07-09 06:15:32 +00:00
Paul Hauner
78e5c0c157 Capture a missed VC error (#2436)
## Issue Addressed

Related to #2430, #2394

## Proposed Changes

As per https://github.com/sigp/lighthouse/issues/2430#issuecomment-875323615, ensure that the `ProductionValidatorClient::new` error raises a log and shuts down the VC. Also, I implemened `spawn_ignoring_error`, as per @michaelsproul's suggestion in https://github.com/sigp/lighthouse/pull/2436#issuecomment-876084419.

I got unlucky and CI picked up a [new rustsec vuln](https://rustsec.org/advisories/RUSTSEC-2021-0072). To fix this, I had to update the following crates:

- `tokio`
- `web3`
- `tokio-compat-02`

## Additional Info

NA
2021-07-09 03:20:24 +00:00
Mac L
406e3921d9 Use forwards iterator for state root lookups (#2422)
## Issue Addressed

#2377 

## Proposed Changes

Implement the same code used for block root lookups (from #2376) to state root lookups in order to improve performance and reduce associated memory spikes (e.g. from certain HTTP API requests).

## Additional Changes

- Tests using `rev_iter_state_roots` and `rev_iter_block_roots` have been refactored to use their `forwards` versions instead.
- The `rev_iter_state_roots` and `rev_iter_block_roots` functions are now unused and have been removed.
- The `state_at_slot` function has been changed to use the `forwards` iterator.

## Additional Info

- Some tests still need to be refactored to use their `forwards_iter` versions. These tests start their iteration from a specific beacon state and thus use the `rev_iter_state_roots_from` and `rev_iter_block_roots_from` functions. If they can be refactored, those functions can also be removed.
2021-07-06 02:38:53 +00:00
Age Manning
73d002ef92 Update outdated dependencies (#2425)
This updates some older dependencies to address a few cargo audit warnings.

The majority of warnings come from network dependencies which will be addressed in #2389. 

This PR contains some minor dep updates that are not network related.

Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-07-05 00:54:17 +00:00
realbigsean
b84ff9f793 rust 1.53.0 updates (#2411)
## Issue Addressed

`make lint` failing on rust 1.53.0.

## Proposed Changes

1.53.0 updates

## Additional Info

I haven't figure out why yet, we were now hitting the recursion limit in a few crates. So I had to add `#![recursion_limit = "256"]` in a few places


Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-06-18 05:58:01 +00:00
Michael Sproul
3dc1eb5eb6 Ignore inactive validators in validator monitor (#2396)
## Proposed Changes

A user on Discord (`@ChewsMacRibs`) reported that the validator monitor was logging `WARN Attested to an incorrect head` for their validator while it was awaiting activation.

This PR modifies the monitor so that it ignores inactive validators, by the logic that they are either awaiting activation, or have already exited. Either way, there's no way for an inactive validator to have their attestations included on chain, so no need for the monitor to report on them.

## Additional Info

To reproduce the bug requires registering validator keys manually with `--validator-monitor-pubkeys`. I don't think the bug will present itself with `--validator-monitor-auto`.
2021-06-17 02:10:48 +00:00
Jack
98ab00cc52 Handle Geth pre-EIP-155 block sync error condition (#2304)
## Issue Addressed

#2293 

## Proposed Changes

 - Modify the handler for the `eth_chainId` RPC (i.e., `get_chain_id`) to explicitly match against the Geth error string returned for pre-EIP-155 synced Geth nodes
 - ~~Add a new helper function, `rpc_error_msg`, to aid in the above point~~
 - Refactor `response_result` into `response_result_or_error` and patch reliant RPC handlers accordingly (thanks to @pawanjay176)

## Additional Info

Geth, as of Pangaea Expanse (v1.10.0), returns an explicit error when it is not synced past the EIP-155 block (2675000). Previously, Geth simply returned a chain ID of 0 (which was obviously much easier to handle on Lighthouse's part).


Co-authored-by: Paul Hauner <paul@paulhauner.com>
2021-06-17 02:10:47 +00:00
realbigsean
b1657a60e9 Reorg events (#2090)
## Issue Addressed

Resolves #2088

## Proposed Changes

Add the `chain_reorg` SSE event topic

## Additional Info


Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
2021-06-17 02:10:46 +00:00
divma
3261eff0bf split outbound and inbound codecs encoded types (#2410)
Splits the inbound and outbound requests, for maintainability.
2021-06-17 00:40:16 +00:00
Paul Hauner
3b600acdc5 v1.4.0 (#2402)
## Issue Addressed

NA

## Proposed Changes

- Bump versions and update `Cargo.lock`

## Additional Info

NA

## TODO

- [x] Ensure #2398 gets merged succesfully
2021-06-10 01:44:49 +00:00
Paul Hauner
93100f221f Make less logs for attn with unknown head (#2395)
## Issue Addressed

NA

## Proposed Changes

I am starting to see a lot of slog-async overflows (i.e., too many logs) on Prater whenever we see attestations for an unknown block. Since these logs are identical (except for peer id) and we expose volume/count of these errors via `metrics::GOSSIP_ATTESTATION_ERRORS_PER_TYPE`, I took the following actions to remove them from `DEBUG` logs:

- Push the "Attestation for unknown block" log to trace.
- Add a debug log in `search_for_block`. In effect, this should serve as a de-duped version of the previous, downgraded log.

## Additional Info

TBC
2021-06-07 02:34:09 +00:00
Pawan Dhananjay
502402c6b9 Fix options for --eth1-endpoints flag (#2392)
## Issue Addressed

N/A

## Proposed Changes

Set `config.sync_eth1_chain` to true when using just the  `--eth1-endpoints` flag (without `--eth1`).
2021-06-04 00:10:59 +00:00
Paul Hauner
f6280aa663 v1.4.0-rc.0 (#2379)
## Issue Addressed

NA

## Proposed Changes

Bump versions.

## Additional Info

This is not exactly the v1.4.0 release described in [Lighthouse Update #36](https://lighthouse.sigmaprime.io/update-36.html).

Whilst it contains:

- Beta Windows support
- A reduction in Eth1 queries
- A reduction in memory footprint

It does not contain:

- Altair
- Doppelganger Protection
- The remote signer

We have decided to release some features early. This is primarily due to the desire to allow users to benefit from the memory saving improvements as soon as possible.

## TODO

- [x] Wait for #2340, #2356 and #2376 to merge and then rebase on `unstable`. 
- [x] Ensure discovery issues are fixed (see #2388)
- [x] Ensure https://github.com/sigp/lighthouse/pull/2382 is merged/removed.
- [x] Ensure https://github.com/sigp/lighthouse/pull/2383 is merged/removed.
- [x] Ensure https://github.com/sigp/lighthouse/pull/2384 is merged/removed.
- [ ] Double-check eth1 cache is carried between boots
2021-06-03 00:13:02 +00:00
Paul Hauner
90ea075c62 Revert "Network protocol upgrades (#2345)" (#2388)
## Issue Addressed

NA

## Proposed Changes

Reverts #2345 in the interests of getting v1.4.0 out this week. Once we have released that, we can go back to testing this again.

## Additional Info

NA
2021-06-02 01:07:28 +00:00
Paul Hauner
d34f922c1d Add early check for RPC block relevancy (#2289)
## Issue Addressed

NA

## Proposed Changes

When observing `jemallocator` heap profiles and Grafana, it became clear that Lighthouse is spending significant RAM/CPU on processing blocks from the RPC. On investigation, it seems that we are loading the parent of the block *before* we check to see if the block is already known. This is a big waste of resources.

This PR adds an additional `check_block_relevancy` call as the first thing we do when we try to process a `SignedBeaconBlock` via the RPC (or other similar methods). Ultimately, `check_block_relevancy` will be called again later in the block processing flow. It's a very light function and I don't think trying to optimize it out is worth the risk of a bad block slipping through. 

Also adds a `New RPC block received` info log when we process a new RPC block. This seems like interesting and infrequent info.

## Additional Info

NA
2021-06-02 01:07:27 +00:00
Paul Hauner
bf4e02e2cc Return a specific error for frozen attn states (#2384)
## Issue Addressed

NA

## Proposed Changes

Return a very specific error when at attestation reads shuffling from a frozen `BeaconState`. Previously, this was returning `MissingBeaconState` which indicates a much more serious issue.

## Additional Info

Since `get_inconsistent_state_for_attestation_verification_only` is only called once in `BeaconChain::with_committee_cache`, it is quite easy to reason about the impact of this change.
2021-06-01 06:59:43 +00:00
Paul Hauner
ba9c4c5eea Return more detail in Eth1 HTTP errors (#2383)
## Issue Addressed

NA

## Proposed Changes

Whilst investigating #2372, I [learned](https://github.com/sigp/lighthouse/issues/2372#issuecomment-851725049) that the error message returned from some failed Eth1 requests are always `NotReachable`. This makes debugging quite painful.

This PR adds more detail to these errors. For example:

- Bad infura key: `ERRO Failed to update eth1 cache             error: Failed to update Eth1 service: "All fallback errored: https://mainnet.infura.io/ => EndpointError(RequestFailed(\"Response HTTP status was not 200 OK:  401 Unauthorized.\"))", retry_millis: 60000, service: eth1_rpc`
- Unreachable server: `ERRO Failed to update eth1 cache             error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8545/ => EndpointError(RequestFailed(\"Request failed: reqwest::Error { kind: Request, url: Url { scheme: \\\"http\\\", cannot_be_a_base: false, username: \\\"\\\", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(8545), path: \\\"/\\\", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError(\\\"tcp connect error\\\", Os { code: 111, kind: ConnectionRefused, message: \\\"Connection refused\\\" })) }\"))", retry_millis: 60000, service: eth1_rpc`
- Bad server: `ERRO Failed to update eth1 cache             error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8545/ => EndpointError(RequestFailed(\"Response HTTP status was not 200 OK:  501 Not Implemented.\"))", retry_millis: 60000, service: eth1_rpc`

## Additional Info

NA
2021-06-01 06:59:41 +00:00
Paul Hauner
4c7bb4984c Use the forwards iterator more often (#2376)
## Issue Addressed

NA

## Primary Change

When investigating memory usage, I noticed that retrieving a block from an early slot (e.g., slot 900) would cause a sharp increase in the memory footprint (from 400mb to 800mb+) which seemed to be ever-lasting.

After some investigation, I found that the reverse iteration from the head back to that slot was the likely culprit. To counter this, I've switched the `BeaconChain::block_root_at_slot` to use the forwards iterator, instead of the reverse one.

I also noticed that the networking stack is using `BeaconChain::root_at_slot` to check if a peer is relevant (`check_peer_relevance`). Perhaps the steep, seemingly-random-but-consistent increases in memory usage are caused by the use of this function.

Using the forwards iterator with the HTTP API alleviated the sharp increases in memory usage. It also made the response much faster (before it felt like to took 1-2s, now it feels instant).

## Additional Changes

In the process I also noticed that we have two functions for getting block roots:

- `BeaconChain::block_root_at_slot`: returns `None` for a skip slot.
- `BeaconChain::root_at_slot`: returns the previous root for a skip slot.

I unified these two functions into `block_root_at_slot` and added the `WhenSlotSkipped` enum. Now, the caller must be explicit about the skip-slot behaviour when requesting a root. 

Additionally, I replaced `vec![]` with `Vec::with_capacity` in `store::chunked_vector::range_query`. I stumbled across this whilst debugging and made this modification to see what effect it would have (not much). It seems like a decent change to keep around, but I'm not concerned either way.

Also, `BeaconChain::get_ancestor_block_root` is unused, so I got rid of it 🗑️.

## Additional Info

I haven't also done the same for state roots here. Whilst it's possible and a good idea, it's more work since the fwds iterators are presently block-roots-specific.

Whilst there's a few places a reverse iteration of state roots could be triggered (e.g., attestation production, HTTP API), they're no where near as common as the `check_peer_relevance` call. As such, I think we should get this PR merged first, then come back for the state root iters. I made an issue here https://github.com/sigp/lighthouse/issues/2377.
2021-05-31 04:18:20 +00:00
Kevin Lu
320a683e72 Minimum Outbound-Only Peers Requirement (#2356)
## Issue Addressed

#2325 

## Proposed Changes

This pull request changes the behavior of the Peer Manager by including a minimum outbound-only peers requirement. The peer manager will continue querying for peers if this outbound-only target number hasn't been met. Additionally, when peers are being removed, an outbound-only peer will not be disconnected if doing so brings us below the minimum. 

## Additional Info

Unit test for heartbeat function tests that disconnection behavior is correct. Continual querying for peers if outbound-only hasn't been met is not directly tested, but indirectly through unit testing of the helper function that counts the number of outbound-only peers.

EDIT: Am concerned about the behavior of ```update_peer_scores```. If we have connected to a peer with a score below the disconnection threshold (-20), then its connection status will remain connected, while its score state will change to disconnected. 

```rust
let previous_state = info.score_state();            
// Update scores            
info.score_update();
Self::handle_score_transitions(                
               previous_state,
                peer_id,
                info, 
               &mut to_ban_peers,
               &mut to_unban_peers,
               &mut self.events,
               &self.log,
);
```

```previous_state``` will be set to Disconnected, and then because ```handle_score_transitions``` only changes connection status for a peer if the state changed, the peer remains connected. Then in the heartbeat code, because we only disconnect healthy peers if we have too many peers, these peers don't get disconnected. I'm not sure realistically how often this scenario would occur, but it might be better to adjust the logic to account for scenarios where the score state implies a connection status different from the current connection status. 

Co-authored-by: Kevin Lu <kevlu93@gmail.com>
2021-05-31 04:18:19 +00:00
Mac L
0847986936 Reduce outbound requests to eth1 endpoints (#2340)
## Issue Addressed

#2282 

## Proposed Changes

Reduce the outbound requests made to eth1 endpoints by caching the results from `eth_chainId` and `net_version`.
Further reduce the overall request count by increasing `auto_update_interval_millis` from `7_000` (7 seconds) to `60_000` (1 minute). 
This will result in a reduction from ~2000 requests per hour to 360 requests per hour (during normal operation). A reduction of 82%.

## Additional Info

If an endpoint fails, its state is dropped from the cache and the `eth_chainId` and `net_version` calls will be made for that endpoint again during the regular update cycle (once per minute) until it is back online.


Co-authored-by: Paul Hauner <paul@paulhauner.com>
2021-05-31 04:18:18 +00:00
Age Manning
ec5cceba50 Correct issue with dialing peers (#2375)
The ordering of adding new peers to the peerdb and deciding when to dial them was not considered in a previous update.

This adds the condition that if a peer is not in the peer-db then it is an acceptable peer to dial.

This makes #2374 obsolete.
2021-05-29 07:25:06 +00:00
Age Manning
d12e746b50 Network protocol upgrades (#2345)
This provides a number of upgrades to gossipsub and discovery. 

The updates are extensive and this needs thorough testing.
2021-05-28 22:02:10 +00:00
Paul Hauner
456b313665 Tune GNU malloc (#2299)
## Issue Addressed

NA

## Proposed Changes

Modify the configuration of [GNU malloc](https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html) to reduce memory footprint.

- Set `M_ARENA_MAX` to 4.
    - This reduces memory fragmentation at the cost of contention between threads.
- Set `M_MMAP_THRESHOLD` to 2mb
    - This means that any allocation >= 2mb is allocated via an anonymous mmap, instead of on the heap/arena. This reduces memory fragmentation since we don't need to keep growing the heap to find big contiguous slabs of free memory.
- ~~Run `malloc_trim` every 60 seconds.~~
    - ~~This shaves unused memory from the top of the heap, preventing the heap from constantly growing.~~
    - Removed, see: https://github.com/sigp/lighthouse/pull/2299#issuecomment-825322646

*Note: this only provides memory savings on the Linux (glibc) platform.*
    
## Additional Info

I'm going to close #2288 in favor of this for the following reasons:

- I've managed to get the memory footprint *smaller* here than with jemalloc.
- This PR seems to be less of a dramatic change than bringing in the jemalloc dep.
- The changes in this PR are strictly runtime changes, so we can create CLI flags which disable them completely. Since this change is wide-reaching and complex, it's nice to have an easy "escape hatch" if there are undesired consequences.

## TODO

- [x] Allow configuration via CLI flags
- [x] Test on Mac
- [x] Test on RasPi.
- [x] Determine if GNU malloc is present?
    - I'm not quite sure how to detect for glibc.. This issue suggests we can't really: https://github.com/rust-lang/rust/issues/33244
- [x] Make a clear argument regarding the affect of this on CPU utilization.
- [x] Test with higher `M_ARENA_MAX` values.
- [x] Test with longer trim intervals
- [x] Add some stats about memory savings
- [x] Remove `malloc_trim` calls & code
2021-05-28 05:59:45 +00:00
Pawan Dhananjay
fdaeec631b Monitoring service api (#2251)
## Issue Addressed

N/A

## Proposed Changes

Adds a client side api for collecting system and process metrics and pushing it to a monitoring service.
2021-05-26 05:58:41 +00:00
Age Manning
55aada006f
More stringent dialing (#2363)
* More stringent dialing

* Cover cached enr dialing
2021-05-26 14:21:44 +10:00
ethDreamer
ba55e140ae Enable Compatibility with Windows (#2333)
## Issue Addressed

Windows incompatibility.

## Proposed Changes

On windows, lighthouse needs to default to STDIN as tty doesn't exist. Also Windows uses ACLs for file permissions. So to mirror chmod 600, we will remove every entry in a file's ACL and add only a single SID that is an alias for the file owner.

Beyond that, there were several changes made to different unit tests because windows has slightly different error messages as well as frustrating nuances around killing a process :/

## Additional Info

Tested on my Windows VM and it appears to work, also compiled & tested on Linux with these changes. Permissions look correct on both platforms now. Just waiting for my validator to activate on Prater so I can test running full validator client on windows.

Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com>
Co-authored-by: Michael Sproul <micsproul@gmail.com>
2021-05-19 23:05:16 +00:00
ethDreamer
cb47388ad7 Updated to comply with new clippy formatting rules (#2336)
## Issue Addressed

The latest version of Rust has new clippy rules & the codebase isn't up to date with them.

## Proposed Changes

Small formatting changes that clippy tells me are functionally equivalent
2021-05-10 00:53:09 +00:00
Mac L
bacc38c3da Add testing for beacon node and validator client CLI flags (#2311)
## Issue Addressed

N/A

## Proposed Changes

Add unit tests for the various CLI flags associated with the beacon node and validator client. These changes require the addition of two new flags: `dump-config` and `immediate-shutdown`.

## Additional Info

Both `dump-config` and `immediate-shutdown` are marked as hidden since they should only be used in testing and other advanced use cases.
**Note:** This requires changing `main.rs` so that the flags can adjust the program behavior as necessary.

Co-authored-by: Paul Hauner <paul@paulhauner.com>
2021-05-06 00:36:22 +00:00
Mac L
4cc613d644 Add SensitiveUrl to redact user secrets from endpoints (#2326)
## Issue Addressed

#2276 

## Proposed Changes

Add the `SensitiveUrl` struct which wraps `Url` and implements custom `Display` and `Debug` traits to redact user secrets from being logged in eth1 endpoints, beacon node endpoints and metrics.

## Additional Info

This also includes a small rewrite of the eth1 crate to make requests using `Url` instead of `&str`. 
Some error messages have also been changed to remove `Url` data.
2021-05-04 01:59:51 +00:00
ethDreamer
0aa8509525 Filter Disconnected Peers from Discv5 DHT (#2219)
## Issue Addressed
#2107

## Proposed Change
The peer manager will mark peers as disconnected in the discv5 DHT when they disconnect or dial fails

## Additional Info
Rationale for this particular change is explained in my comment on #2107
2021-04-28 04:07:37 +00:00
realbigsean
2c2c443718 404's on API requests for slots that have been skipped or orphaned (#2272)
## Issue Addressed

Resolves #2186

## Proposed Changes

404 for any block-related information on a slot that was skipped or orphaned

Affected endpoints:
- `/eth/v1/beacon/blocks/{block_id}`
- `/eth/v1/beacon/blocks/{block_id}/root`
- `/eth/v1/beacon/blocks/{block_id}/attestations`
- `/eth/v1/beacon/headers/{block_id}`

## Additional Info



Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-04-25 03:59:59 +00:00
Paul Hauner
3a24ca5f14 v1.3.0 (#2310)
## Issue Addressed

NA

## Proposed Changes

Bump versions.

## Additional Info

This is a minor release (not patch) due to the very slight change introduced by #2291.
2021-04-13 22:46:34 +00:00
Michael Sproul
3b901dc5ec Pack attestations into blocks in parallel (#2307)
## Proposed Changes

Use two instances of max cover when packing attestations into blocks: one for the previous epoch, and one for the current epoch. This reduces the amount of computation done by roughly half due to the `O(n^2)` running time of max cover (`2 * (n/2)^2 = n^2/2`). This should help alleviate some load on block proposal, particularly on Prater.
2021-04-13 05:27:42 +00:00
Paul Hauner
c1203f5e52 Add specific log and metric for delayed blocks (#2308)
## Issue Addressed

NA

## Proposed Changes

- Adds a specific log and metric for when a block is enshrined as head with a delay that will caused bad attestations
    - We *technically* already expose this information, but it's a little tricky to determine during debugging. This makes it nice and explicit.
- Fixes a minor reporting bug with the validator monitor where it was expecting agg. attestations too early (at half-slot rather than two-thirds-slot).

## Additional Info

NA
2021-04-13 02:16:59 +00:00
Paul Hauner
0df7be1814 Add check for aggregate target (#2306)
## Issue Addressed
NA

## Proposed Changes

- Ensure that the [target consistency check](b356f52c5c) is always performed on aggregates.
- Add a regression test.

## Additional Info

NA
2021-04-13 00:24:39 +00:00
Age Manning
aaa14073ff Clean up warnings (#2240)
This is a small PR that cleans up compiler warnings. 

The most controversial change is removing the `data_dir` field from the `BeaconChainBuilder`. 

It was removed because it was never read.


Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: Herman Junge <hermanjunge@protonmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-04-12 00:57:43 +00:00
Mac L
f6f64cf0f5 Correcting disable-enr-auto-update flag definition (#2303)
## Issue Addressed

N/A

## Proposed Changes

Correct the `disable-enr-auto-update` boolean flag so that it no longer requires a value.
Previously it would require a value which was never used.

## Additional Info

Flag is read here: https://github.com/sigp/lighthouse/blob/unstable/beacon_node/src/config.rs#L585-L587
2021-04-11 23:52:29 +00:00
Paul Hauner
e7e5878953 Avoid BeaconState clone during metrics scrape (#2298)
## Issue Addressed

Which issue # does this PR address?

## Proposed Changes

Avoids cloning the `BeaconState` each time Prometheus scrapes our metrics (generally every 5s 😱).

I think the original motivation behind this was *"don't hold the lock on the head whilst we do computation on it"*, however I think is flawed since our computation here is so small that it'll be quicker than the clone.

The primary motivation here is to maintain a small memory footprint by holding less in memory (i.e., the cloned `BeaconState`) and to avoid the fragmentation-creep that occurs when cloning the big contiguous slabs of memory in the `BeaconState`.

I also collapsed the active/slashed/withdrawn counters into a single loop to increase efficiency.

## Additional Info

NA
2021-04-07 01:02:56 +00:00
Pawan Dhananjay
95a362213d Fix local testnet scripts (#2229)
## Issue Addressed

Resolves #2094 

## Proposed Changes

Fixes scripts for creating local testnets. Adds an option in `lighthouse boot_node` to run with a previously generated enr.
2021-03-30 05:17:58 +00:00
Paul Hauner
9eb1945136 v1.2.2 (#2287)
## Issue Addressed

NA

## Proposed Changes

- Bump versions

## Additional Info

NA
2021-03-30 04:07:03 +00:00
Paul Hauner
3d239b85ac Allow for a clock disparity on the duties endpoints (#2283)
## Issue Addressed

Resolves #2280

## Proposed Changes

Allows for API consumers to call the proposer/attester duties endpoints [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](b34a79dc0b/beacon_node/beacon_chain/src/beacon_chain.rs (L99-L102)) earlier than the current epoch. For additional reasoning, see https://github.com/sigp/lighthouse/issues/2280#issuecomment-805358897.

## Additional Info

NA
2021-03-29 23:42:35 +00:00
Paul Hauner
03cefd0065 Expand observed attestations capacity (#2266)
## Issue Addressed

NA

## Proposed Changes

I noticed the following error on one of our nodes:

```
Mar 18 00:03:35 ip-xxxx lighthouse-bn[333503]: Mar 18 00:03:35.103 ERRO Unable to validate aggregate            error: ObservedAttestersError(EpochTooLow { epoch: Epoch(23961), lowest_permissible_epoch: Epoch(23962) }), peer_id: 16Uiu2HAm5GL5KzPLhvfg9MBBFSpBqTVGRFSiTg285oezzWcZzwEv
```

The slot during this log was 766,815 (the last slot of the epoch). I believe this is due to an off-by-one error in `observed_attesters` where we were failing to provide enough capacity to store observations from the previous, current and next epochs. See code comments for further reasoning.

Here's a link to the spec: https://github.com/ethereum/eth2.0-specs/blob/v1.0.1/specs/phase0/p2p-interface.md#beacon_aggregate_and_proof

## Additional Info

NA
2021-03-29 23:42:34 +00:00
Michael Sproul
f9d60f5436 VC: accept unknown fields in chain spec (#2277)
## Issue Addressed

Closes #2274

## Proposed Changes

* Modify the `YamlConfig` to collect unknown fields into an `extra_fields` map, instead of failing hard.
* Log a debug message if there are extra fields returned to the VC from one of its BNs.

This restores Lighthouse's compatibility with Teku beacon nodes (and therefore Infura)
2021-03-26 04:53:57 +00:00
Paul Hauner
b34a79dc0b v1.2.1 (#2263)
## Issue Addressed

NA

## Proposed Changes

- Bump version.
- Add some new ENR for Prater
    - Afri: https://github.com/eth2-clients/eth2-testnets/pull/42
    - Prysm: https://github.com/eth2-clients/eth2-testnets/pull/43
- Apply the fixes from #2181 to the no-eth1-sim to try fix CI issues. 

## Additional Info

NA
2021-03-18 04:20:46 +00:00
Paul Hauner
015ab7d0a7 Optimize validator duties (#2243)
## Issue Addressed

Closes #2052

## Proposed Changes

- Refactor the attester/proposer duties endpoints in the BN
    - Performance improvements
    - Fixes some potential inconsistencies with the dependent root fields.
    - Removes `http_api::beacon_proposer_cache` and just uses the one on the `BeaconChain` instead.
    - Move the code for the proposer/attester duties endpoints into separate files, for readability.
- Refactor the `DutiesService` in the VC
    - Required to reduce the delay on broadcasting new blocks.
    - Gets rid of the `ValidatorDuty` shim struct that came about when we adopted the standard API.
    - Separate block/attestation duty tasks so that they don't block each other when one is slow.
- In the VC, use `PublicKeyBytes` to represent validators instead of `PublicKey`. `PublicKey` is a legit crypto object whilst `PublicKeyBytes` is just a byte-array, it's much faster to clone/hash `PublicKeyBytes` and this change has had a significant impact on runtimes.
    - Unfortunately this has created lots of dust changes.
 - In the BN, store `PublicKeyBytes` in the `beacon_proposer_cache` and allow access to them. The HTTP API always sends `PublicKeyBytes` over the wire and the conversion from `PublicKey` -> `PublickeyBytes` is non-trivial, especially when queries have 100s/1000s of validators (like Pyrmont).
 - Add the `state_processing::state_advance` mod which dedups a lot of the "apply `n` skip slots to the state" code.
    - This also fixes a bug with some functions which were failing to include a state root as per [this comment](072695284f/consensus/state_processing/src/state_advance.rs (L69-L74)). I couldn't find any instance of this bug that resulted in anything more severe than keying a shuffling cache by the wrong block root.
 - Swap the VC block service to use `mpsc` from `tokio` instead of `futures`. This is consistent with the rest of the code base.
    
~~This PR *reduces* the size of the codebase 🎉~~ It *used* to reduce the size of the code base before I added more comments. 

## Observations on Prymont

- Proposer duties times down from peaks of 450ms to consistent <1ms.
- Current epoch attester duties times down from >1s peaks to a consistent 20-30ms.
- Block production down from +600ms to 100-200ms.

## Additional Info

- ~~Blocked on #2241~~
- ~~Blocked on #2234~~

## TODO

- [x] ~~Refactor this into some smaller PRs?~~ Leaving this as-is for now.
- [x] Address `per_slot_processing` roots.
- [x] Investigate slow next epoch times. Not getting added to cache on block processing?
- [x] Consider [this](072695284f/beacon_node/store/src/hot_cold_store.rs (L811-L812)) in the scenario of replacing the state roots


Co-authored-by: pawan <pawandhananjay@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-03-17 05:09:57 +00:00
Michael Sproul
3919737978 Release v1.2.0 (#2249)
## Proposed Changes

Release v1.2.0 unchanged from the release candidate.
2021-03-10 01:28:32 +00:00
Michael Sproul
770a2ca030 Fix proposer cache priming upon state advance (#2252)
## Proposed Changes

While investigating an incorrect head + target vote for the epoch boundary block 708544, I noticed that the state advance failed to prime the proposer cache, as per these logs:

```
Mar 09 21:42:47.448 DEBG Subscribing to subnet                   target_slot: 708544, subnet: Y, service: attestation_service
Mar 09 21:49:08.063 DEBG Advanced head state one slot            current_slot: 708543, state_slot: 708544, head_root: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: state_advance
Mar 09 21:49:08.063 DEBG Completed state advance                 initial_slot: 708543, advanced_slot: 708544, head_root: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: state_advance
Mar 09 21:49:14.787 DEBG Proposer shuffling cache miss           block_slot: 708544, block_root: 0x9b14bf68667ab1d9c35e6fd2c95ff5d609aa9e8cf08e0071988ae4aa00b9f9fe, parent_slot: 708543, parent_root: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: beacon
Mar 09 21:49:14.800 DEBG Successfully processed gossip block     root: 0x9b14bf68667ab1d9c35e6fd2c95ff5d609aa9e8cf08e0071988ae4aa00b9f9fe, slot: 708544, graffiti: , service: beacon
Mar 09 21:49:14.800 INFO New block received                      hash: 0x9b14…f9fe, slot: 708544
Mar 09 21:49:14.984 DEBG Head beacon block                       slot: 708544, root: 0x9b14…f9fe, finalized_epoch: 22140, finalized_root: 0x28ec…29a7, justified_epoch: 22141, justified_root: 0x59db…e451, service: beacon
Mar 09 21:49:15.055 INFO Unaggregated attestation                validator: XXXXX, src: api, slot: 708544, epoch: 22142, delay_ms: 53, index: Y, head: 0xaf5e69de09f384ee3b4fb501458b7000c53bb6758a48817894ec3d2b030e3e6f, service: val_mon
Mar 09 21:49:17.001 DEBG Slot timer                              sync_state: Synced, current_slot: 708544, head_slot: 708544, head_block: 0x9b14…f9fe, finalized_epoch: 22140, finalized_root: 0x28ec…29a7, peers: 55, service: slot_notifier
```

The reason for this is that the condition was backwards, so that whole block of code was unreachable.

Looking at the attestations for the block included in the block after, we can see that lots of validators missed it. Some of them may be Lighthouse v1.1.1-v1.2.0-rc.0, but it's probable that they would have missed even with the proposer cache primed, given how late the block 708544 arrived (the cache miss occurred 3.787s after the slot start): https://beaconcha.in/block/708545#attestations
2021-03-10 00:20:50 +00:00
Michael Sproul
786e25ea08 Release candidate v1.2.0-rc.0 (#2248)
Prepare for v1.2.0 with this release candidate.

To be merged after #2247 and #2246

Co-authored-by: Age Manning <Age@AgeManning.com>
2021-03-08 06:27:50 +00:00
Age Manning
babd153352 Prevent adding and dialing bootnodes when discovery is disabled (#2247)
This is a small PR which prevents unwanted bootnodes from being added to the DHT and being dialed when the `--disable-discovery` flag is set. 

The main reason one would want to disable discovery is to connect to a fix set of peers. Currently, regardless of what the user does, Lighthouse will populate its DHT with previously known peers and also fill it with the spec's bootnodes. It will then dial the bootnodes that are capable of being dialed. This prevents testing with a fixed peer list.

This PR prevents these excess nodes from being added and dialed if the user has set `--disable-discovery`.
2021-03-08 06:27:49 +00:00
Paul Hauner
e4eb0eb168 Use advanced state for block production (#2241)
## Issue Addressed

NA

## Proposed Changes

- Use the pre-states from #2174 during block production.
    - Running this on Pyrmont shows block production times dropping from ~550ms to ~150ms.
- Create `crit` and `warn` logs when a block is published to the API later than we expect.
    - On mainnet we are issuing a warn if the block is published more than 1s later than the slot start and a crit for more than 3s.
- Rename some methods on the `SnapshotCache` for clarity.
- Add the ability to pass the state root to `BeaconChain::produce_block_on_state` to avoid computing a state root. This is a very common LH optimization.
- Add a metric that tracks how late we broadcast blocks received from the HTTP API. This is *technically* a duplicate of a `ValidatorMonitor` log, but I wanted to have it for the case where we aren't monitoring validators too.
2021-03-04 04:43:31 +00:00
Michael Sproul
363f15f362 Use the database to persist the pubkey cache (#2234)
## Issue Addressed

Closes #1787

## Proposed Changes

* Abstract the `ValidatorPubkeyCache` over a "backing" which is either a file (legacy), or the database.
* Implement a migration from schema v2 to schema v3, whereby the contents of the cache file are copied to the DB, and then the file is deleted. The next release to include this change must be a minor version bump, and we will need to warn users of the inability to downgrade (this is our first DB schema change since mainnet genesis).
* Move the schema migration code from the `store` crate into the `beacon_chain` crate so that it can access the datadir and the `ValidatorPubkeyCache`, etc. It gets injected back into the `store` via a closure (similar to what we do in fork choice).
2021-03-04 01:25:12 +00:00
Age Manning
1c507c588e Update to the latest libp2p (#2239)
Updates to the latest libp2p and ignores RUSTSEC-2020-0146 from cargo-audit


Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-03-02 05:59:49 +00:00
realbigsean
ed9b245de0 update tokio-stream to 0.1.3 and use BroadcastStream (#2212)
## Issue Addressed

Resolves #2189 

## Proposed Changes

use tokio's `BroadcastStream`

## Additional Info

N/A


Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-03-01 01:58:05 +00:00
Michael Sproul
2f077b11fe Allow HTTP API to return SSZ blocks (#2209)
## Issue Addressed

Implements https://github.com/ethereum/eth2.0-APIs/pull/125

## Proposed Changes

Optionally return SSZ bytes from the `beacon/blocks` endpoint.
2021-02-24 04:15:14 +00:00
realbigsean
5bc93869c8 Update ValidatorStatus to match the v1 API (#2149)
## Issue Addressed

N/A

## Proposed Changes

We are currently a bit off of the standard API spec because we have [this](https://hackmd.io/bQxMDRt1RbS1TLno8K4NPg?view) proposal implemented for validator status.  Based on discussion [here](https://github.com/ethereum/eth2.0-APIs/pull/94), it looks like this won't be added to the spec until v2, so this PR implements [this](https://hackmd.io/ofFJ5gOmQpu1jjHilHbdQQ) validator status logic instead

## Additional Info

N/A


Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-02-24 04:15:13 +00:00
Paul Hauner
a764c3b247 Handle early blocks (#2155)
## Issue Addressed

NA

## Problem this PR addresses

There's an issue where Lighthouse is banning a lot of peers due to the following sequence of events:

1. Gossip block 0xabc arrives ~200ms early
    - It is propagated across the network, with respect to [`MAXIMUM_GOSSIP_CLOCK_DISPARITY`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/p2p-interface.md#why-is-there-maximum_gossip_clock_disparity-when-validating-slot-ranges-of-messages-in-gossip-subnets).
    - However, it is not imported to our database since the block is early.
2. Attestations for 0xabc arrive, but the block was not imported.
    - The peer that sent the attestation is down-voted.
        - Each unknown-block attestation causes a score loss of 1, the peer is banned at -100.
        - When the peer is on an attestation subnet there can be hundreds of attestations, so the peer is banned quickly (before the missed block can be obtained via rpc).

## Potential solutions

I can think of three solutions to this:

1. Wait for attestation-queuing (#635) to arrive and solve this.
    - Easy
    - Not immediate fix.
    - Whilst this would work, I don't think it's a perfect solution for this particular issue, rather (3) is better.
1. Allow importing blocks with a tolerance of `MAXIMUM_GOSSIP_CLOCK_DISPARITY`.
    - Easy
    - ~~I have implemented this, for now.~~
1. If a block is verified for gossip propagation (i.e., signature verified) and it's within `MAXIMUM_GOSSIP_CLOCK_DISPARITY`, then queue it to be processed at the start of the appropriate slot.
    - More difficult
    - Feels like the best solution, I will try to implement this.
    
    
**This PR takes approach (3).**

## Changes included

- Implement the `block_delay_queue`, based upon a [`DelayQueue`](https://docs.rs/tokio-util/0.6.3/tokio_util/time/delay_queue/struct.DelayQueue.html) which can store blocks until it's time to import them.
- Add a new `DelayedImportBlock` variant to the `beacon_processor::WorkEvent` enum to handle this new event.
- In the `BeaconProcessor`, refactor a `tokio::select!` to a struct with an explicit `Stream` implementation. I experienced some issues with `tokio::select!` in the block delay queue and I also found it hard to debug. I think this explicit implementation is nicer and functionally equivalent (apart from the fact that `tokio::select!` randomly chooses futures to poll, whereas now we're deterministic).
- Add a testing framework to the `beacon_processor` module that tests this new block delay logic. I also tested a handful of other operations in the beacon processor (attns, slashings, exits) since it was super easy to copy-pasta the code from the `http_api` tester.
    - To implement these tests I added the concept of an optional `work_journal_tx` to the `BeaconProcessor` which will spit out a log of events. I used this in the tests to ensure that things were happening as I expect.
    - The tests are a little racey, but it's hard to avoid that when testing timing-based code. If we see CI failures I can revise. I haven't observed *any* failures due to races on my machine or on CI yet.
    - To assist with testing I allowed for directly setting the time on the `ManualSlotClock`.
- I gave the `beacon_processor::Worker` a `Toolbox` for two reasons; (a) it avoids changing tons of function sigs when you want to pass a new object to the worker and (b) it seemed cute.
2021-02-24 03:08:52 +00:00
Paul Hauner
46920a84e8 v1.1.3 (#2217)
## Issue Addressed

NA

## Proposed Changes

Bump versions

## Additional Info

NA
2021-02-22 06:21:38 +00:00
Paul Hauner
4362ea4f98 Fix false positive "State advance too slow" logs (#2218)
## Issue Addressed

- Resolves #2214

## Proposed Changes

Fix the false positive warning log described in #2214.

## Additional Info

NA
2021-02-21 23:47:53 +00:00
Paul Hauner
8949ae7c4e Address ENR update loop (#2216)
## Issue Addressed

- Resolves #2215

## Proposed Changes

Addresses a potential loop when the majority of peers indicate that we are contactable via an IPv6 address.

See https://github.com/sigp/discv5/pull/62 for further rationale.

## Additional Info

The alternative to this PR is to use `--disable-enr-auto-update` and then manually supply an `--enr-address` and `--enr-upd-port`. However, that requires the user to know their IP addresses in order for discovery to work properly. This might not be practical/achievable for some users, hence this hotfix.
2021-02-21 23:47:52 +00:00
Paul Hauner
8c6537e71d v1.1.2 (#2213)
## Issue Addressed

NA

## Proposed Changes

Bump versions

## Additional Info

NA
2021-02-19 00:49:32 +00:00
Paul Hauner
f8cc82f2b1 Switch back to warp with cors wildcard support (#2211)
## Issue Addressed

- Resolves #2204
- Resolves #2205

## Proposed Changes

Switches to my fork of `warp` which contains support for cors wildcards: https://github.com/paulhauner/warp/tree/cors-wildcard

I have a PR open on the `warp` repo but it hasn't had any interest from the maintainers as of yet: https://github.com/seanmonstar/warp/pull/726. I think running from a fork is the best we can do for now.

## Additional Info

NA
2021-02-18 22:33:12 +00:00
Lion - dapplion
613382f304 Add slot offset computing to be downloaded slot (#2198)
The current implementation assumes the range offset of slots downloaded on a batch to equal zero. This conflicts with the condition to consider this chain as sync. For finalized sync, it results in one extra batch being downloaded which can't be processed.

CC @wemeetagain
2021-02-18 08:24:46 +00:00
Paul Hauner
f819ba5414 v1.1.1 (#2202)
## Issue Addressed

NA

## Proposed Changes

Bump versions
2021-02-16 00:09:02 +00:00
Pawan Dhananjay
4a357c9947 Upgrade rand_core (#2201)
## Issue Addressed

N/A

## Proposed Changes

Upgrade `rand_core` to latest version to fix https://rustsec.org/advisories/RUSTSEC-2021-0023
2021-02-15 20:34:49 +00:00
Paul Hauner
88cc222204 Advance state to next slot after importing block (#2174)
## Issue Addressed

NA

## Proposed Changes

Add an optimization to perform `per_slot_processing` from the *leading-edge* of block processing to the *trailing-edge*. Ultimately, this allows us to import the block at slot `n` faster because we used the tail-end of slot `n - 1` to perform `per_slot_processing`.

Additionally, add a "block proposer cache" which allows us to cache the block proposer for some epoch. Since we're now doing trailing-edge `per_slot_processing`, we can prime this cache with the values for the next epoch before those blocks arrive (assuming those blocks don't have some weird forking).

There were several ancillary changes required to achieve this: 

- Remove the `state_root` field  of `BeaconSnapshot`, since there's no need to know it on a `pre_state` and in all other cases we can just read it from `block.state_root()`.
    - This caused some "dust" changes of `snapshot.beacon_state_root` to `snapshot.beacon_state_root()`, where the `BeaconSnapshot::beacon_state_root()` func just reads the state root from the block.
- Rename `types::ShuffingId` to `AttestationShufflingId`. I originally did this because I added a `ProposerShufflingId` struct which turned out to be not so useful. I thought this new name was more descriptive so I kept it.
- Address https://github.com/ethereum/eth2.0-specs/pull/2196
- Add a debug log when we get a block with an unknown parent. There was previously no logging around this case.
- Add a function to `BeaconState` to compute all proposers for an epoch without re-computing the active indices for each slot.

## Additional Info

- ~~Blocked on #2173~~
- ~~Blocked on #2179~~ That PR was wrapped into this PR.
- There's potentially some places where we could avoid computing the proposer indices in `per_block_processing` but I haven't done this here. These would be an optimization beyond the issue at hand (improving block propagation times) and I think this PR is already doing enough. We can come back for that later.

## TODO

- [x] Tidy, improve comments.
- [x] ~~Try avoid computing proposer index in `per_block_processing`?~~
2021-02-15 07:17:52 +00:00
Paul Hauner
3000f3e5da Dht persistence on drop (v2) (#2200)
## Issue Addressed

NA

## Proposed Changes

This is simply #2177 with a merge conflict fixed.

Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-02-15 06:09:55 +00:00
Paul Hauner
8e5c20b6d1 Update for clippy 1.50 (#2193)
## Issue Addressed

NA

## Proposed Changes

Rust 1.50 has landed 🎉

The shiny new `clippy` peers down upon us mere mortals with disgust. Brutish peasants wrapping our `usize`s in superfluous `Option`s... tsk tsk.

I've performed the goat sacrifice and corrected our evil ways in this PR. Tonight we shall pray that Github Actions bestows the almighty green tick upon us.

## Additional Info

NA


Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-02-15 00:09:12 +00:00
realbigsean
e20f64b21a Update to tokio 1.1 (#2172)
## Issue Addressed

resolves #2129
resolves #2099 
addresses some of #1712
unblocks #2076
unblocks #2153 

## Proposed Changes

- Updates all the dependencies mentioned in #2129, except for web3. They haven't merged their tokio 1.0 update because they are waiting on some dependencies of their own. Since we only use web3 in tests, I think updating it in a separate issue is fine. If they are able to merge soon though, I can update in this PR. 

- Updates `tokio_util` to 0.6.2 and `bytes` to 1.0.1.

- We haven't made a discv5 release since merging tokio 1.0 updates so I'm using a commit rather than release atm. **Edit:** I think we should merge an update of `tokio_util` to 0.6.2 into discv5 before this release because it has panic fixes in `DelayQueue`  --> PR in discv5:  https://github.com/sigp/discv5/pull/58

## Additional Info

tokio 1.0 changes that required some changes in lighthouse:

- `interval.next().await.is_some()` -> `interval.tick().await`
- `sleep` future is now `!Unpin` -> https://github.com/tokio-rs/tokio/issues/3028
- `try_recv` has been temporarily removed from `mpsc` -> https://github.com/tokio-rs/tokio/issues/3350
- stream features have moved to `tokio-stream` and `broadcast::Receiver::into_stream()` has been temporarily removed -> `https://github.com/tokio-rs/tokio/issues/2870
- I've copied over the `BroadcastStream` wrapper from this PR, but can update to use `tokio-stream` once it's merged https://github.com/tokio-rs/tokio/pull/3384

Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-02-10 23:29:49 +00:00
Paul Hauner
e383ef3e91 Avoid temp allocations with slog (#2183)
## Issue Addressed

Which issue # does this PR address?

## Proposed Changes

Replaces use of `format!` in `slog` logging with it's special no-allocation `?` and `%` shortcuts. According to a `heaptrack` analysis today over about a period of an hour, this will reduce temporary allocations by at least 4%.

## Additional Info

NA
2021-02-04 07:31:47 +00:00
Paul Hauner
ff35fbb121 Add metrics for beacon block propagation (#2173)
## Issue Addressed

NA

## Proposed Changes

Adds some metrics to track delays regarding:

- LH processing of blocks
- delays receiving blocks from other nodes.

## Additional Info

NA
2021-02-04 05:33:56 +00:00
Akihito Nakano
1a22a096c6 Fix clippy errors on tests (#2160)
## Issue Addressed

There are some clippy error on tests.


## Proposed Changes

Enable clippy check on tests and fix the errors. 💪
2021-01-28 23:31:06 +00:00
Paul Hauner
e4b62139d7 v1.1.0 (#2168)
## Issue Addressed

NA

## Proposed Changes

- Bump version
- ~~Run `cargo update`~~

## Additional Info

NA
2021-01-21 02:37:08 +00:00
Paul Hauner
2b2a358522 Detailed validator monitoring (#2151)
## Issue Addressed

- Resolves #2064

## Proposed Changes

Adds a `ValidatorMonitor` struct which provides additional logging and Grafana metrics for specific validators.

Use `lighthouse bn --validator-monitor` to automatically enable monitoring for any validator that hits the [subnet subscription](https://ethereum.github.io/eth2.0-APIs/#/Validator/prepareBeaconCommitteeSubnet) HTTP API endpoint.

Also, use `lighthouse bn --validator-monitor-pubkeys` to supply a list of validators which will always be monitored.

See the new docs included in this PR for more info.

## TODO

- [x] Track validator balance, `slashed` status, etc.
- [x] ~~Register slashings in current epoch, not offense epoch~~
- [ ] Publish Grafana dashboard, update TODO link in docs
- [x] ~~#2130 is merged into this branch, resolve that~~
2021-01-20 19:19:38 +00:00
Paul Hauner
1eb0915301 Fix bug from #2163 (#2165)
## Issue Addressed

NA

## Proposed Changes

Fixes a bug that I missed during a review in #2163. I found this bug by observing that nodes were receiving far less attestations (~1/2 of previous).

I'm not certain on *exactly* how this mistake manifested in a reduction in attestations, but the mistake touches so much code that I think it's reasonable to declare that this it the cause of the observed issue (drop in attestations).

## Additional Info

NA
2021-01-20 10:28:12 +00:00
Paul Hauner
b06559ae97 Disallow attestation production earlier than head (#2130)
## Issue Addressed

The non-finality period on Pyrmont between epochs [`9114`](https://pyrmont.beaconcha.in/epoch/9114) and [`9182`](https://pyrmont.beaconcha.in/epoch/9182) was contributed to by all the `lighthouse_team` validators going down. The nodes saw excessive CPU and RAM usage, resulting in the system to kill the `lighthouse bn` process. The `Restart=on-failure` directive for `systemd` caused the process to bounce in ~10-30m intervals.

Diagnosis with `heaptrack` showed that the `BeaconChain::produce_unaggregated_attestation` function was calling `store::beacon_state::get_full_state` and sometimes resulting in a tree hash cache allocation. These allocations were approximately the size of the hosts physical memory and still allocated when `lighthouse bn` was killed by the OS.

There was no CPU analysis (e.g., `perf`), but the `BeaconChain::produce_unaggregated_attestation` is very CPU-heavy so it is reasonable to assume it is the cause of the excessive CPU usage, too.

## Proposed Changes

`BeaconChain::produce_unaggregated_attestation` has two paths:

1. Fast path: attesting to the head slot or later.
2. Slow path: attesting to a slot earlier than the head block.

Path (2) is the only path that calls `store::beacon_state::get_full_state`, therefore it is the path causing this excessive CPU/RAM usage.

This PR removes the current functionality of path (2) and replaces it with a static error (`BeaconChainError::AttestingPriorToHead`).

This change reduces the generality of `BeaconChain::produce_unaggregated_attestation` (and therefore [`/eth/v1/validator/attestation_data`](https://ethereum.github.io/eth2.0-APIs/#/Validator/produceAttestationData)), but I argue that this functionality is an edge-case and arguably a violation of the [Honest Validator spec](https://github.com/ethereum/eth2.0-specs/blob/dev/specs/phase0/validator.md).

It's possible that a validator goes back to a prior slot to "catch up" and submit some missed attestations. This change would prevent such behaviour, returning an error. My concerns with this catch-up behaviour is that it is:

- Not specified as "honest validator" attesting behaviour.
- Is behaviour that is risky for slashing (although, all validator clients *should* have slashing protection and will eventually fail if they do not).
- It disguises clock-sync issues between a BN and VC.

## Additional Info

It's likely feasible to implement path (2) if we implement some sort of caching mechanism. This would be a multi-week task and this PR gets the issue patched in the short term. I haven't created an issue to add path (2), instead I think we should implement it if we get user-demand.
2021-01-20 06:52:37 +00:00
Paul Hauner
d9f940613f Represent slots in secs instead of millisecs (#2163)
## Issue Addressed

NA

## Proposed Changes

Copied from #2083, changes the config milliseconds_per_slot to seconds_per_slot to avoid errors when slot duration is not a multiple of a second. To avoid deserializing old serialized data (with milliseconds instead of seconds) the Serialize and Deserialize derive got removed from the Spec struct (isn't currently used anyway).

This PR replaces #2083 for the purpose of fixing a merge conflict without requiring the input of @blacktemplar.

## Additional Info

NA


Co-authored-by: blacktemplar <blacktemplar@a1.net>
2021-01-19 09:39:51 +00:00
Paul Hauner
805e152f66 Simplify enum -> str with strum (#2164)
## Issue Addressed

NA

## Proposed Changes

As per #2100, uses derives from the sturm library to implement AsRef<str> and AsStaticRef to easily get str values from enums without creating new Strings. Furthermore unifies all attestation error counter into one IntCounterVec vector.

These works are originally by @blacktemplar, I've just created this PR so I can resolve some merge conflicts.

## Additional Info

NA


Co-authored-by: blacktemplar <blacktemplar@a1.net>
2021-01-19 06:33:58 +00:00
realbigsean
7a71977987 Clippy 1.49.0 updates and dht persistence test fix (#2156)
## Issue Addressed

`test_dht_persistence` failing

## Proposed Changes

Bind `NetworkService::start` to an underscore prefixed variable rather than `_`.  `_` was causing it to be dropped immediately

This was failing 5/100 times before this update, but I haven't been able to get it to fail after updating it

Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-01-19 00:34:28 +00:00
Pawan Dhananjay
28238d97b1 Disconnect from peers quicker on internet issues (#2147)
## Issue Addressed

Fixes #2146 

## Proposed Changes

Change ping timeout errors to return `LowToleranceErrors` so that we disconnect faster on internet failures/changes.
2021-01-13 08:09:10 +00:00
realbigsean
423dea169c update smallvec (#2152)
## Issue Addressed

`cargo audit` is failing because of a potential for an overflow in the version of `smallvec` we're using

## Proposed Changes

Update to the latest version of `smallvec`, which has the fix


Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-01-11 23:32:11 +00:00
Arthur Woimbée
851a4dca3c replace tempdir by tempfile (#2143)
## Issue Addressed

Fixes #2141 
Remove [tempdir](https://docs.rs/tempdir/0.3.7/tempdir/) in favor of [tempfile](https://docs.rs/tempfile/3.1.0/tempfile/).

## Proposed Changes

`tempfile` has a slightly different api that makes creating temp folders with a name prefix a chore (`tempdir::TempDir::new("toto")` => `tempfile::Builder::new().prefix("toto").tempdir()`).

So I removed temp folder name prefix where I deemed it not useful.

Otherwise, the functionality is the same.
2021-01-06 06:36:11 +00:00
Age Manning
7e4b190df0 Reduce ping interval (#2132)
## Issue Addressed

#2123

## Description

Reduces the TCP ping interval to increase our responsiveness to peer liveness changes.
2021-01-06 04:35:52 +00:00
realbigsean
588b90157d Ssz state api endpoint (#2111)
## Issue Addressed

Catching up to a recently merged API spec PR: https://github.com/ethereum/eth2.0-APIs/pull/119

## Proposed Changes

- Return an SSZ beacon state on `/eth/v1/debug/beacon/states/{stateId}` when passed this header: `accept: application/octet-stream`.
- requests to this endpoint with no  `accept` header or an `accept` header and a value of `application/json` or `*/*` , or will result in a JSON response

## Additional Info


Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-01-06 03:01:46 +00:00
Samuel E. Moelius
939fa717fd test_decode_malicious_status_message improvements (#2104)
## Issue Addressed

None

## Proposed Changes

* Correct typo in one comment, elaborate some others.
* Add asserts to ensure comments match code.
* Eliminate one unnecessary `clone`.

## Additional Info

None
2021-01-06 01:10:26 +00:00
Samuel E. Moelius
0245ddd37b Fix typo in ssz_snappy.rs comment (#2103)
## Issue Addressed

None

## Proposed Changes

Correct a typo in `ssz_snappy.rs`.

## Additional Info

Pedantry at it finest.
2021-01-06 01:10:24 +00:00
Paul Hauner
f183af20e3 Version v1.0.6 (#2126)
## Issue Addressed

NA

## Proposed Changes

- Bump versions
- Run `cargo update`

## Additional Info

NA
2020-12-28 23:38:02 +00:00
Akihito Nakano
78d17c3255 Tweak error messages for ease of investigation (#2122)
## Proposed Changes

<!-- Please list or describe the changes introduced by this PR. -->

Tweaked the error message for ease of investigation as `Failed to update eth1 cache` is used in multiple places. 😃
2020-12-28 01:25:33 +00:00
Paul Hauner
9ed65a64f8 Version v1.0.5 (#2117)
## Issue Addressed

NA

## Proposed Changes

- Bump versions to `v1.0.5`
- Run `cargo update`

## Additional Info

NA
2020-12-23 18:52:48 +00:00
Age Manning
2931b05582 Update libp2p (#2101)
This is a little bit of a tip-of-the-iceberg PR. It houses a lot of code changes in the libp2p dependency. 

This needs a bit of thorough testing before merging. 

The primary code changes are:
- General libp2p dependency update
- Gossipsub refactor to shift compression into gossipsub providing performance improvements and improved API for handling compression



Co-authored-by: Paul Hauner <paul@paulhauner.com>
2020-12-23 07:53:36 +00:00
Samuel E. Moelius
3381266998 Eliminate uses of expect in ssz_snappy.rs (#2105)
## Issue Addressed

None

## Proposed Changes

Eliminate three uses of `expect` in `ssz_snappy.rs`.

## Additional Info

None
2020-12-22 02:28:37 +00:00
Michael Sproul
e5bf2576f1 Optimise tree hash caching for block production (#2106)
## Proposed Changes

`@potuz` on the Eth R&D Discord observed that Lighthouse blocks on Pyrmont were always arriving at other nodes after at least 1 second. Part of this could be due to processing and slow propagation, but metrics also revealed that the Lighthouse nodes were usually taking 400-600ms to even just produce a block before broadcasting it.

I tracked the slowness down to the lack of a pre-built tree hash cache (THC) on the states being used for block production. This was due to using the head state for block production, which lacks a THC in order to keep fork choice fast (cloning a THC takes at least 30ms for 100k validators). This PR modifies block production to clone a state from the snapshot cache rather than the head, which speeds things up by 200-400ms by avoiding the tree hash cache rebuild. In practice this seems to have cut block production time down to 300ms or less. Ideally we could _remove_ the snapshot from the cache (and save the 30ms), but it is required for when we re-process the block after signing it with the validator client.

## Alternatives

I experimented with 2 alternatives to this approach, before deciding on it:

* Alternative 1: ensure the `head` has a tree hash cache. This is too slow, as it imposes a +30ms hit on fork choice, which currently takes ~5ms (with occasional spikes).
* Alternative 2: use `Arc<BeaconSnapshot>` in the snapshot cache and share snapshots between the cache and the `head`. This made fork choice blazing fast (1ms), and block production the same as in this PR, but had a negative impact on block processing which I don't think is worth it. It ended up being necessary to clone the full state from the snapshot cache during block production, imposing the +30ms penalty there _as well_ as in block production.

In contract, the approach in this PR should only impact block production, and it improves it! Yay for pareto improvements 🎉

## Additional Info

This commit (ac59dfa) is currently running on all the Lighthouse Pyrmont nodes, and I've added a dashboard to the Pyrmont grafana instance with the metrics.

In future work we should optimise the attestation packing, which consumes around 30-60ms and is now a substantial contributor to the total.
2020-12-21 06:29:39 +00:00
Paul Hauner
a62dc65ca4 BN Fallback v2 (#2080)
## Issue Addressed

- Resolves #1883

## Proposed Changes

This follows on from @blacktemplar's work in #2018.

- Allows the VC to connect to multiple BN for redundancy.
  - Update the simulator so some nodes always need to rely on their fallback.
- Adds some extra deprecation warnings for `--eth1-endpoint`
- Pass `SignatureBytes` as a reference instead of by value.

## Additional Info

NA

Co-authored-by: blacktemplar <blacktemplar@a1.net>
2020-12-18 09:17:03 +00:00
Pawan Dhananjay
f998eff7ce Subnet discovery fixes (#2095)
## Issue Addressed

N/A

## Proposed Changes

Fixes multiple issues related to discovering of subnet peers.
1. Subnet discovery retries after yielding no results
2. Metadata updates if peer send older metadata
3. peerdb stores the peer subscriptions from gossipsub
2020-12-17 00:39:15 +00:00
blacktemplar
3fcc517993 Fix Syncing Simulator (#2049)
## Issue Addressed

NA

## Proposed Changes

Fixes problems with slot times below 1 second which got revealed by running the syncing simulator with the default speedup time.
2020-12-16 05:37:38 +00:00
Michael Sproul
0c529b8d52 Add slasher broadcast (#2079)
## Issue Addressed

Closes #2048

## Proposed Changes

* Broadcast slashings when the `--slasher-broadcast` flag is provided.
* In the process of implementing this I refactored the slasher service into its own crate so that it could access the network code without creating a circular dependency. I moved the responsibility for putting slashings into the op pool into the service as well, as it makes sense for it to handle the whole slashing lifecycle.
2020-12-16 03:44:01 +00:00
Pawan Dhananjay
63eeb14a81 Improve eth1 fallback logging (#2096)
## Issue Addressed

N/A

## Proposed Changes

There seemed to be confusion among discord users on the eth1 fallback logging
```
WARN Error connecting to eth1 node. Trying fallback ..., endpoint: http://127.0.0.1:8545/, service: eth1_rpc
```
The assumption users seem to be making here is that it is trying the fallback and fallback=endpoint in the log.

This PR improves the logging to be like
```
WARN Error connecting to eth1 node endpoint, endpoint: http://127.0.0.1:8545/, action: trying fallbacks, service: eth1_rpc
```

I think this is a bit more clear that the endpoint that failed is the one in the log.
2020-12-16 02:39:09 +00:00
divma
11c299cbf6 impl Resource Unavailable RPC error (#2072)
## Issue Addressed

Related to #1891, The error is not in the spec yet (see ethereum/eth2.0-specs#2131)

## Proposed Changes

Implement the proposed error, banning peers that send it

## Additional Info

NA
2020-12-15 00:17:32 +00:00
blacktemplar
701843aaa0 Update dependencies (#2084)
## Issue Addressed

Partially addresses dependencies mentioned in issue #1712.

## Proposed Changes

Updates dependencies (including an update avoiding a vulnerability) + add tokio compatibility to `remote_signer_test`
2020-12-14 02:28:19 +00:00
Michael Sproul
1abc70e815 Version v1.0.4 (#2073)
## Proposed Changes

Run cargo update and bump version in prep for v1.0.4 release

## Additional Info

Planning to merge this commit to `unstable`, test on Pyrmont and canary nodes, then push to `stable`.
2020-12-10 04:01:40 +00:00
Age Manning
dfb588e521 Softer penalties for missing blocks (#2075)
## Issue Addressed

Users are reporting errors for sending attestations to peers. If the clock sync is a little out or we receive attestations before blocks, peers are being too harshly penalized. They can get scored many times per missing block and we typically need these peers on subnets. 


## Proposed Changes

This removes the penalization for missing blocks with attestations. The penalty should be handled when #635 gets built as it will allow us to group attestations per missing block and penalize once.
2020-12-10 00:40:12 +00:00
Michael Sproul
aa45fa3ff7 Revert fork choice if disk write fails (#2068)
## Issue Addressed

Closes #2028
Replaces #2059

## Proposed Changes

If writing to the database fails while importing a block, revert fork choice to the last version stored on disk. This prevents fork choice from being ahead of the blocks on disk. Having fork choice ahead is particularly bad if it is later successfully written to disk, because it renders the database corrupt (see #2028).

## Additional Info

* This mitigation might fail if the head+fork choice haven't been persisted yet, which can only happen at first startup (see #2067)
* This relies on it being OK for the head tracker to be ahead of fork choice. I figure this is tolerable because blocks only get added to the head tracker after successfully being written on disk _and_ to fork choice, so even if fork choice reverts a little bit, when the pruning algorithm runs, those blocks will still be on disk and OK to prune. The pruning algorithm also doesn't rely on heads being unique, technically it's OK for multiple blocks from the same linear chain segment to be present in the head tracker. This begs the question of #1785 (i.e. things would be simpler with the head tracker out of the way). Alternatively, this PR could just revert the head tracker as well (I'll look into this tomorrow).
2020-12-09 05:10:34 +00:00
Michael Sproul
82753f842d Improve compile time (#1989)
## Issue Addressed

Closes #1264

## Proposed Changes

* Milagro BLS: tweak the feature flags so that Milagro doesn't get compiled if we're using BLST. Profiling showed that it was consuming about 1 minute of CPU time out of 60 minutes of CPU time (real time ~15 mins). A 1.6% saving.
* Reduce monomorphization: compiling for 3 different `EthSpec` types causes a heck of a lot of generic functions to be instantiated (monomorphized). Removing 2 of 3 cuts the LLVM+linking step from around 250 seconds to 180 seconds, a saving of 70 seconds (real time!). This applies only to `make` and not the CI build, because we test with the minimal spec on CI.
* Update `web3` crate to v0.13. This is perhaps the most controversial change, because it requires axing some deposit contract tools from `lcli`. I suspect these tools weren't used much anyway, and could be maintained separately, but I'm also happy to revert this change. However, it does save us a lot of compile time. With #1839, we now have 3 versions of Tokio (and all of Tokio's deps). This change brings us down to 2 versions, but 1 should be achievable once web3 (and reqwest) move to Tokio 0.3.
* Remove `lcli` from the Docker image. It's a dev tool and can be built from the repo if required.
2020-12-09 01:34:58 +00:00
Age Manning
4f85371ce8 Downgrades a valid log (#2057)
## Issue Addressed

#2046 

## Proposed Changes

The log was originally intended to verify the correct logic and ordering of events when scoring peers. The queued tasks can be structured in such a way that peers can be banned after they are disconnected. Therefore the error log is now downgraded to  debug log.
2020-12-08 10:48:45 +00:00
divma
57489e620f fix default network handling (#2029)
## Issue Addressed
#1992 and #1987, and also to be considered a continuation of #1751

## Proposed Changes
many changed files but most are renaming to align the code with the semantics of `--network` 
- remove the `--network` default value (in clap) and instead set it after checking the `network` and `testnet-dir` flags
- move `eth2_testnet_config` crate to `eth2_network_config`
- move `Eth2TestnetConfig` to `Eth2NetworkConfig`
- move `DEFAULT_HARDCODED_TESTNET` to `DEFAULT_HARDCODED_NETWORK`
- `beacon_node`s `get_eth2_testnet_config` loads the `DEFAULT_HARDCODED_NETWORK` if there is no network nor testnet provided
- `boot_node`s config loads the config same as the `beacon_node`, it was using the configuration only for preconfigured networks (That code is ~1year old so I asume it was not intended)
- removed a one year old comment stating we should try to emulate `https://github.com/eth2-clients/eth2-testnets/tree/master/nimbus/testnet1` it looks outdated (?)
- remove `lighthouse`s `load_testnet_config` in favor of `get_eth2_network_config` to centralize that logic (It had differences)
- some spelling

## Additional Info
Both the command of #1992 and the scripts of #1987 seem to work fine, same as `bn` and `vc`
2020-12-08 05:41:10 +00:00
divma
f3200784b4 More metrics + RPC tweaks (#2041)
## Issue Addressed

NA

## Proposed Changes
This was mostly done to find the reason why LH was dropping peers from Nimbus. It proved to be useful so I think it's worth it. But there is also some functional stuff here
- Add metrics for rpc errors per client, error type and direction
- Add metrics for downscoring events per source type, client and penalty type
- Add metrics for gossip validation results per client for non-accepted messages
- Make the RPC handler return errors and requests/responses in the order we see them
- Allow a small burst for the Ping rate limit, from 1 every 5 seconds to 2 every 10 seconds
- Send rate limiting errors with a particular code and use that same code to identify them. I picked something different to 128 since that is most likely what other clients are using for their own errors
- Remove some unused code in the `PeerAction` and the rpc handler
- Remove the unused variant `RateLimited`. tTis was never produced directly, since the only way to get the request's protocol is via de handler. The handler upon receiving from LH a response with an error (rate limited in this case) emits this event with the missing info (It was always like this, just pointing out that we do downscore rate limiting errors regardless of the change)

Metrics for Nimbus looked like this:
Downscoring events: `increase(libp2p_peer_actions_per_client{client="Nimbus"}[5m])`
![image](https://user-images.githubusercontent.com/26765164/101210880-862bf280-3676-11eb-94c0-399f0bf5aa2e.png)

RPC Errors: `increase(libp2p_rpc_errors_per_client{client="Nimbus"}[5m])`
![image](https://user-images.githubusercontent.com/26765164/101210997-ba071800-3676-11eb-847a-f32405ede002.png)

Unaccepted gossip message: `increase(gossipsub_unaccepted_messages_per_client{client="Nimbus"}[5m])`
![image](https://user-images.githubusercontent.com/26765164/101211124-f470b500-3676-11eb-9459-132ecff058ec.png)
2020-12-08 03:55:50 +00:00
blacktemplar
a28e8decbf update dependencies (#2032)
## Issue Addressed

NA

## Proposed Changes

Updates out of date dependencies.

## Additional Info

See also https://github.com/sigp/lighthouse/issues/1712 for a list of dependencies that are still out of date and the resasons.
2020-12-07 08:20:33 +00:00
Michael Sproul
c1ec386d18 Pass failed gossip blocks to the slasher (#2047)
## Issue Addressed

Closes #2042

## Proposed Changes

Pass blocks that fail gossip verification to the slasher. Blocks that are successfully verified are not passed immediately, but will be passed as part of full block verification.
2020-12-04 05:03:30 +00:00
Pawan Dhananjay
7933596c89 Add a purge-eth1-cache cli option (#2039)
## Issue

Some eth1 clients are missing deposit logs on mainnet for multiple reasons (not fully synced, eth1 client issues) because of which we are getting `FailedToInsertDeposit` errors.
Ideally, LH should pick up where it left off after pointing it to a nice eth1 client endpoint (which has all deposits). 

However, I have seen instances where LH keeps getting `FailedToInsertDeposit` even after switching to a good endpoint. Only deleting the beacon directory (which also wipes the eth1 cache) and resyncing the eth1 caches seems to be the solution. This wouldn't be great for mainnet if you have to sync your beacon node again as well.

## Proposed Changes

Add a `--purge-eth1-db` option which just wipes the eth1 cache and doesn't touch the rest of the beacon db. 
Still need to investigate if and why LH isn't picking up where it left off for the deposit logs sync, but I think it would be good to have an option to just delete eth1 caches regardless.
2020-12-04 05:03:28 +00:00
realbigsean
fdfb81a74a Server sent events (#1920)
## Issue Addressed

Resolves #1434 (this is the last major feature in the standard spec. There are only a couple of places we may be off-spec due to recent spec changes or ongoing discussion)
Partly addresses #1669
 
## Proposed Changes

- remove the websocket server
- remove the `TeeEventHandler` and `NullEventHandler` 
- add server sent events according to the eth2 API spec

## Additional Info

This is according to the currently unmerged PR here: https://github.com/ethereum/eth2.0-APIs/pull/117


Co-authored-by: realbigsean <seananderson33@gmail.com>
2020-12-04 00:18:58 +00:00
realbigsean
2b5c0df9e5 Validators endpoint status code (#2040)
## Issue Addressed

Resolves #2035 

## Proposed Changes

Update 405's to 400's for failures when we are parsing path params.

## Additional Info

Haven't updated the same for non-standard endpoints

Co-authored-by: realbigsean <seananderson33@gmail.com>
2020-12-03 23:10:08 +00:00
Age Manning
2682f46025 Fingerprint new client identify agent string (#2027)
Nimbus have modified their identify agent string. 

This PR adds their new agent string to identify new nimbus peers.
2020-12-03 22:07:14 +00:00
Pawan Dhananjay
482695142a Minor fixes (#2038)
Fixes a couple of low hanging fruits.

- Fixes #2037 
- `validators-dir` and `secrets-dir` flags don't really need to depend upon each other
- Fixes #2006 and Fixes #1995
2020-12-03 01:10:28 +00:00
blacktemplar
d8cda2d86e Fix new clippy lints (#2036)
## Issue Addressed

NA

## Proposed Changes

Fixes new clippy lints in the whole project (mainly [manual_strip](https://rust-lang.github.io/rust-clippy/master/index.html#manual_strip) and [unnecessary_lazy_evaluations](https://rust-lang.github.io/rust-clippy/master/index.html#unnecessary_lazy_evaluations)). Furthermore, removes `to_string()` calls on literals when used with the `?`-operator.
2020-12-03 01:10:26 +00:00
Paul Hauner
b8bd80d2fb Add Content-Type to metrics server (#2019)
## Issue Addressed

- Resolves #2013

## Proposed Changes

Adds the `Content-Type text/plain` header as per #2013

## Additional Info

NA
2020-12-01 00:04:46 +00:00
Paul Hauner
65dcdc361b Bump version to v1.0.3 (#2024)
## Issue Addressed

NA

## Proposed Changes

- Set version to `v1.0.3`
- Run cargo update

## Additional Info

- ~~Blocked on #2008~~
2020-11-30 22:55:10 +00:00
Age Manning
c718e81eaf Add privacy option (#2016)
Adds a `--privacy` CLI flag to the beacon node that users may opt into. 

This does two things:
- Removes client identifying information from the identify libp2p protocol
- Changes the default graffiti to "" if no graffiti is set.
2020-11-30 22:55:08 +00:00
Paul Hauner
77f3539654 Improve eth1 block sync (#2008)
## Issue Addressed

NA

## Proposed Changes

- Log about eth1 whilst waiting for genesis.
- For the block and deposit caches, update them after each download instead of when *all* downloads are complete.
  - This prevents the case where a single timeout error can cause us to drop *all* previously download blocks/deposits.
- Set `max_log_requests_per_update` to avoid timeouts due to very large log counts in a response.
- Set `max_blocks_per_update` to prevent a single update of the block cache to download an unreasonable number of blocks.
  - This shouldn't have any affect in normal use, it's just a safe-guard against bugs.
- Increase the timeout for eth1 calls from 15s to 60s, as per @pawanjay176's experience with Infura.

## Additional Info

NA
2020-11-30 20:29:17 +00:00
divma
8fcd22992c No string in slog (#2017)
## Issue Addressed

Following slog's documentation, this should help a bit with string allocations. I left it run for two days and mem usage is lower. This is of course anecdotal, but shouldn't harm anyway 

## Proposed Changes

remove `String` creation in logs when possible
2020-11-30 10:33:00 +00:00
Paul Hauner
85e69249e6 Drop discovery log to trace (#2007)
## Issue Addressed

NA

## Proposed Changes

This was causing:

```
Nov 28 21:56:08.154 ERRO slog-async: logger dropped messages due to channel overflow, count: 44, service: libp2p
```

## Additional Info

NA
2020-11-29 03:02:23 +00:00
Age Manning
f7183098ee Bump to version v1.0.2 (#2001)
Update lighthouse to version `v1.0.2`. 

There are two major updates in this version:
- Updates to the task executor to tokio 0.3 and all sub-dependencies relying on core execution, including libp2p
- Update BLST
2020-11-28 13:22:37 +00:00
Age Manning
a567f788bd Upgrade to tokio 0.3 (#1839)
## Description

This PR updates Lighthouse to tokio 0.3. It includes a number of dependency updates and some structural changes as to how we create and spawn tasks.

This also brings with it a number of various improvements:

- Discv5 update
- Libp2p update
- Fix for recompilation issues
- Improved UPnP port mapping handling
- Futures dependency update
- Log downgrade to traces for rejecting peers when we've reached our max



Co-authored-by: blacktemplar <blacktemplar@a1.net>
2020-11-28 05:30:57 +00:00
Paul Hauner
5a3b94cbb4
Update to v1.0.1, run cargo update 2020-11-27 21:16:59 +11:00
blacktemplar
38b15deccb Fallback nodes for eth1 access (#1918)
## Issue Addressed

part of  #1883

## Proposed Changes

Adds a new cli argument `--eth1-endpoints` that can be used instead of `--eth1-endpoint` to specify a comma-separated list of endpoints. If the first endpoint returns an error for some request the other endpoints are tried in the given order.

## Additional Info

Currently if the first endpoint fails the fallbacks are used silently (except for `try_fallback_test_endpoint` that is used in `do_update` which logs a `WARN` for each endpoint that is not reachable). A question is if we should add more logs so that the user gets warned if his main endpoint is for example just slow and sometimes hits timeouts.
2020-11-27 08:37:44 +00:00
Michael Sproul
1312844f29 Disable snappy in LevelDB to fix build issues (#1983)
## Proposed Changes

A user on Discord reported build issues when trying to compile Lighthouse checked out to a path with spaces in it. I've fixed the issue upstream in `leveldb-sys` (https://github.com/skade/leveldb-sys/pull/22), but rather than waiting for a new release of the `leveldb` crate, we can also work around the issue by disabling Snappy in LevelDB, which we weren't using anyway.

This may also have the side-effect of slightly improving compilation times, as LevelDB+Snappy was found to be a substantial contributor to build time (although I'm not sure how much was LevelDB and how much was Snappy).
2020-11-27 03:01:57 +00:00
Pawan Dhananjay
0589a14afe Log better error message (#1981)
## Issue Addressed

Fixes #1965 

## Proposed Changes

Log an error and don't update eth1 caches if `chain_id = 0`
2020-11-26 23:13:46 +00:00
divma
fc07cc3fdf Sync metrics (#1975)
## Issue Addressed
- Add metrics to keep track of peer counts by sync type
- Add metric to keep track of the number of syncing chains in range

## Proposed Changes
Plugin to the network metrics update interval and update too the counts for peers wrt to their sync status with us

## Additional Info
For the peer counts
- By the way it is implemented the numbers won't always match to the total peer count in the `libp2p` metric.
- Updating the gauge with every change is messy because it requires to be updated on connection (in the `eth2_libp2p` crate, while metrics are defined in the `network` crate) on Goodbye sent (for an `IrrelevantPeer`) either in the `beacon_processor` or the `peer_manager`, and on disconnection. Since this is not a critical metric I think counting once every second is enough. If you think more accuracy is needed we can do it too, but it would be harder to maintain)

ATM those look like this
![image](https://user-images.githubusercontent.com/26765164/100275387-22137b00-2f60-11eb-93b9-94b0f265240c.png)
2020-11-26 05:23:17 +00:00
Paul Hauner
26741944b1 Add metrics to VC (#1954)
## Issue Addressed

NA

## Proposed Changes

- Adds a HTTP server to the VC which provides Prometheus metrics.
- Moves the health metrics into the `lighthouse_metrics` crate so it can be shared between BN/VC.
- Sprinkle some metrics around the VC.
- Update the book to indicate that we now have VC metrics.
- Shifts the "waiting for genesis" logic later in the `ProductionValidatorClient::new_from_cli`
  - This is worth attention during the review.

## Additional Info

- ~~`clippy` has some new lints that are failing. I'll deal with that in another PR.~~
2020-11-26 01:10:51 +00:00
divma
3b4afc27bf Status race condition (#1967)
## Issue Addressed

Sync stalls due to race conditions between dc notifications and status processing
2020-11-25 02:15:38 +00:00
Paul Hauner
c6baa0eed1
Bump to v1.0.0, run cargo update 2020-11-25 02:02:19 +11:00
Age Manning
a96893744c
Update bootnodes and boot_node cli (#1961) 2020-11-25 02:01:37 +11:00
divma
6f890c398e Sync Bug fixes (#1950)
## Issue Addressed

Two issues related to empty batches
- Chain target's was not being advanced when the batch was successful, empty and the chain didn't have an optimistic batch
- Not switching finalized chains. We now switch finalized chains requiring a minimum work first
2020-11-24 02:11:31 +00:00
Paul Hauner
21617aa87f Change --testnet flag to --network (#1751)
## Issue Addressed

- Resolves #1689

## Proposed Changes

TBC

## Additional Info

NA
2020-11-23 23:54:03 +00:00
Michael Sproul
7d644103c6 Tweak slasher DB schema and pruning (#1948)
## Issue Addressed

Resolves #1890

## Proposed Changes

Change the slasher database schema to key indexed attestations by `(target_epoch, indexed_attestation_root)` instead of just `indexed_attestation_root`. This allows more straight-forward pruning (linear scan), that is also "re-entrant". By re-entrant, we mean that a pruning pass that gets stuck because of a `MapFull` error can attempt to commit midway, and be resumed later without issue. The previous pruning strategy for indexed attestations did not have this property. There was also a flaw in the previous pruning that could leave "zombie" indexed attestations in the database (ones not referenced by any attester record), which could build up and contribute to bloat (although in practice I think they occur quite infrequently).

## Additional Info

During testing I noticed that a `MapFull` error can still occur during the commit of the transaction itself, which is irritating, but not unbearable. This PR should at least reduce the frequency with which users need to manually resize their DB, and if the `MapFull` on commit rears its ugly head too often we could use a dynamic strategy (temporarily increase the size of the map until the transaction commits).

The extra bytes for the epoch make the database a bit heavier, so the size estimate docs have been updated to reflect this. This is also a breaking schema change, so anyone using a v0 database from a few hours ago will need to drop it and update 😅
2020-11-23 21:33:51 +00:00
Michael Sproul
5828ff1204 Implement slasher (#1567)
This is an implementation of a slasher that lives inside the BN and can be enabled via `lighthouse bn --slasher`.

Features included in this PR:

- [x] Detection of attester slashing conditions (double votes, surrounds existing, surrounded by existing)
- [x] Integration into Lighthouse's attestation verification flow
- [x] Detection of proposer slashing conditions
- [x] Extraction of attestations from blocks as they are verified
- [x] Compression of chunks
- [x] Configurable history length
- [x] Pruning of old attestations and blocks
- [x] More tests

Future work:

* Focus on a slice of history separate from the most recent N epochs (e.g. epochs `current - K` to `current - M`)
* Run out-of-process
* Ingest attestations from the chain without a resync

Design notes are here https://hackmd.io/@sproul/HJSEklmPL
2020-11-23 03:43:22 +00:00
Paul Hauner
59b2247ab8 Improve UX whilst VC is waiting for genesis (#1915)
## Issue Addressed

- Resolves #1424

## Proposed Changes

Add a `GET lighthouse/staking` that returns 200 if the node is ready to stake (i.e., `--eth1` flag is present) or a 404 otherwise.

Whilst the VC is waiting for the genesis time to start (i.e., when the genesis state is known), check the `lighthouse/staking` endpoint and log an error if the node isn't configured for staking.

## Additional Info

NA
2020-11-23 01:00:22 +00:00
Paul Hauner
65b1cf2af1 Add flag to import all attestations (#1941)
## Issue Addressed

NA

## Proposed Changes

Adds the `--import-all-attestations` flag which tells the `network::AttestationService` to import/aggregate all attestations after verification (instead of only ones for subnets that are relevant to local validators).

This is useful for testing/debugging and also for creating back-up nodes that should be all cached up and ready for any validator.

## Additional Info

NA
2020-11-22 23:58:25 +00:00
divma
d0cbf3111a move sync state to the chains KV (#1940)
## Issue Addressed
we have a log saying we add a peer to a chain, and an another one in case the chain is not syncing. To avoid needing to peer there two (and reduce log entries) simply log the chain's syncing state in the chain's KV
2020-11-22 23:58:23 +00:00
Michael Sproul
426b3001e0 Fix race condition in seen caches (#1937)
## Issue Addressed

Closes #1719

## Proposed Changes

Lift the internal `RwLock`s and `Mutex`es from the `Observed*` data structures to resolve the race conditions described in #1719.

Most of this work was done by @paulhauner on his `lift-locks` branch, I merely updated it for the current `master` and checked over it.

## Additional Info

I think it would be prudent to test this on a testnet or two before mainnet launch, just to be sure that the extra lock contention doesn't negatively impact performance.
2020-11-22 23:02:51 +00:00
Paul Hauner
0b556c4405 Fix metrics http server error messages (#1946)
## Issue Addressed

- Resolves #1945

## Proposed Changes

- As per #1945, fix a log message from the metrics server that was falsely claiming to be from the api server.
- Ensure successful api request logs are published to debug, not trace. This is something I've wanted to do for a while.

## Additional Info

NA
2020-11-22 03:39:13 +00:00
Paul Hauner
48f73b21e6 Expand eth1 block cache, add more logs (#1938)
## Issue Addressed

NA

## Proposed Changes

- Caches later blocks than is required by `ETH1_FOLLOW_DISTANCE`.
- Adds logging to `warn` if the eth1 cache is insufficiently primed.
- Use `max_by_key` instead of `max_by` in `BeaconChain::Eth1Chain` since it's simpler.
- Rename `voting_period_start_timestamp` to `voting_target_timestamp` for accuracy.

## Additional Info

The reason for eating into the `ETH1_FOLLOW_DISTANCE` and caching blocks that are closer to the head is due to possibility for `SECONDS_PER_ETH1_BLOCK` to be incorrect (as is the case for the Pyrmont testnet on Goerli).

If `SECONDS_PER_ETH1_BLOCK` is too short, we'll skip back too far from the head and skip over blocks that would be valid [`is_candidate_block`](https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/validator.md#eth1-data) blocks. This was the case on the Pyrmont testnet and resulted in Lighthouse choosing blocks that were about 30 minutes older than is ideal.
2020-11-21 00:26:15 +00:00
Kirk Baird
3b405f10ea Ensure deposit signatures do not use aggregate functions (#1935)
## Issue Addressed

Resolves #1333 

## Proposed Changes

- Remove `deposit_signature_set()` function
- Prevent deposits from being in `SignatureSets`
- User `Signature.verify()` to verify deposit signatures rather than a signature set which uses `fast_aggregate_verify()`

## Additional Info

n/a
2020-11-20 03:37:20 +00:00
divma
d727e55abe Move some rpc processing to the beacon_processor (#1936)
## Issue Addressed
`BlocksByRange` requests were the main culprit of a series of timeouts to peer's requests in general because they produce build up in the router's processor. Those were moved to the blocking executor but a task is being spawned for each; also not ideal since the amount of resources we give to those is not controlled

## Proposed Changes
- Move `BlocksByRange` and `BlocksByRoots` to the `beacon_processor`. The processor crafts the responses and sends them.
- Move too the processing of `StatusMessage`s from other peers. This is a fast operation but it can also build up and won't scale if we keep it in the router (processing one at the time). These don't need to send an answer, so there is no harm in processing them "later" if that were to happen. Sending responses to status requests is still in the router, so we answer as soon as we see them.
- Some "extras" that are basically clean up:
  - Split the `Worker` logic in sync methods (chain processing and rpc blocks), gossip methods (the majority of methods) and rpc methods (the new ones)
  - Move the `status_message` function previously provided by the router's processor to a more central place since it is used by the router, sync, network_context and beacon_processor
 - Some spelling

## Additional Info
What's left to decide/test more thoroughly is the length of the queues and the priority rules. @paulhauner suggested at some point to put status above attestations, and @AgeManning had described an importance of "protecting gossipsub" so my solution is leaving status requests in the router and RPC methods below attestations. Slashings and Exits are at the end.
2020-11-19 23:33:44 +00:00
Pawan Dhananjay
e47739047d Add additional libp2p tests (#1867)
## Issue Addressed

N/A

## Proposed Changes

Adds tests for the eth2_libp2p crate.
2020-11-19 22:32:09 +00:00
realbigsean
79fd9b32b9 Update pool/attestations and committees endpoints (#1899)
## Issue Addressed

Catching up on a few eth2 spec updates:

## Proposed Changes

- adding query params to the `GET pool/attestations` endpoint
- allowing the `POST pool/attestations` endpoint to accept an array of attestations
    - batching attestation submission
- moving `epoch` from a path param to a query param in the `committees` endpoint

## Additional Info


Co-authored-by: realbigsean <seananderson33@gmail.com>
2020-11-18 23:31:39 +00:00
blacktemplar
3408de8151 Avoid string initialization in network metrics and replace by &str where possible (#1898)
## Issue Addressed

NA

## Proposed Changes

Removes most of the temporary string initializations in network metrics and replaces them by directly using `&str`. This further improves on PR https://github.com/sigp/lighthouse/pull/1895.

For the subnet id handling the current approach uses a build script to create a static map. This has the disadvantage that the build script hardcodes the number of subnets. If we want to use more than 64 subnets we need to adjust this in the build script.

## Additional Info

We still have some string initializations for the enum `PeerKind`. To also replace that by `&str` I created a PR in the libp2p dependency: https://github.com/sigp/rust-libp2p/pull/91. Either we wait with merging until this dependency PR is merged (and all conflicts with the newest libp2p version are resolved) or we just merge as is and I will create another PR when the dependency is ready.
2020-11-18 23:31:37 +00:00
Paul Hauner
bcc7f6b143 Add new flag to set blocks per eth1 query (#1931)
## Issue Addressed

NA

## Proposed Changes

Users on Discord (and @protolambda) have experienced this error (or variants of it):

```
Failed to update eth1 cache: GetDepositLogsFailed("Eth1 node returned error: {\"code\":-32005,\"message\":\"query returned more than 10000 results\"}")
```

This PR allows users to reduce the span of blocks searched for deposit logs and therefore reduce the size of the return result. Hopefully experimentation with this flag can lead to finding a better default value.


## Additional Info

NA
2020-11-18 22:18:59 +00:00
Paul Hauner
7e4ee58729 Bump to v0.3.5 (#1927)
## Issue Addressed

NA

## Proposed Changes

- Bump version to `v0.3.5`
- Run `cargo update`

## Additional Info

NA
2020-11-18 00:44:28 +00:00
Paul Hauner
103103e72e Address queue congestion in migrator (#1923)
## Issue Addressed

*Should* address #1917

## Proposed Changes

Stops the `BackgroupMigrator` rx channel from backing up with big `BeaconState` messages.

Looking at some logs from my Medalla node, we can see a discrepancy between the head finalized epoch and the migrator finalized epoch:

```
Nov 17 16:50:21.606 DEBG Head beacon block                       slot: 129214, root: 0xbc7a…0b99, finalized_epoch: 4033, finalized_root: 0xf930…6562, justified_epoch: 4035, justified_root: 0x206b…9321, service: beacon
Nov 17 16:50:21.626 DEBG Batch processed                         service: sync, processed_blocks: 43, last_block_slot: 129214, chain: 8274002112260436595, first_block_slot: 129153, batch_epoch: 4036
Nov 17 16:50:21.626 DEBG Chain advanced                          processing_target: 4036, new_start: 4036, previous_start: 4034, chain: 8274002112260436595, service: sync
Nov 17 16:50:22.162 DEBG Completed batch received                awaiting_batches: 5, blocks: 47, epoch: 4048, chain: 8274002112260436595, service: sync
Nov 17 16:50:22.162 DEBG Requesting batch                        start_slot: 129601, end_slot: 129664, downloaded: 0, processed: 0, state: Downloading(16Uiu2HAmG3C3t1McaseReECjAF694tjVVjkDoneZEbxNhWm1nZaT, 0 blocks, 1273), epoch: 4050, chain: 8274002112260436595, service: sync
Nov 17 16:50:22.654 DEBG Database compaction complete            service: beacon
Nov 17 16:50:22.655 INFO Starting database pruning               new_finalized_epoch: 2193, old_finalized_epoch: 2192, service: beacon
```

I believe this indicates that the migrator rx has a backed-up queue of `MigrationNotification` items which each contain a `BeaconState`.

## TODO

- [x] Remove finalized state requirement for op-pool
2020-11-17 23:11:26 +00:00
Michael Sproul
a60ab4eff2 Refine compaction (#1916)
## Proposed Changes

In an attempt to fix OOM issues and database consistency issues observed by some users after the introduction of compaction in v0.3.4, this PR makes the following changes:

* Run compaction less often: roughly every 1024 epochs, including after long periods of non-finality. I think the division check proposed by Paul is pretty solid, and ensures we don't miss any events where we should be compacting. LevelDB lacks an easy way to check the size of the DB, which would be another good trigger.
* Make it possible to disable the compaction on finalization using `--auto-compact-db=false`
* Make it possible to trigger a manual, single-threaded foreground compaction on start-up using `--compact-db`
* Downgrade the pruning log to `DEBUG`, as it's particularly noisy during sync

I would like to ship these changes to affected users ASAP, and will document them further in the Advanced Database section of the book if they prove effective.
2020-11-17 09:10:53 +00:00
divma
398919b5d4 router: drop requests from peers that have dc'd (#1919)
## Issue Addressed

A peer might send a lot of requests that comply to the rate limit and the disconnect, this humongous pr makes sure we don't process them if the peer is not connected
2020-11-17 02:06:21 +00:00
Pawan Dhananjay
280334b1b0 Validate eth1 chain id (#1877)
## Issue Addressed

Resolves #1815 

## Proposed Changes

Adds extra validation for eth1 chain id apart from the existing check for eth1 network id.
2020-11-16 23:10:42 +00:00
Age Manning
49c4630045 Performance improvement for db reads (#1909)
This PR adds a number of improvements:
- Downgrade a warning log when we ignore blocks for gossipsub processing
- Revert a a correction to improve logging of peer score changes
- Shift syncing DB reads off the core-executor allowing parallel processing of large sync messages
- Correct the timeout logic of RPC chunk sends, giving more time before timing out RPC outbound messages.
2020-11-16 07:28:30 +00:00
divma
eb56140582 Update logs + do not downscore peers if WE time out (#1901)
## Issue Addressed

- RPC Errors were being logged twice: first in the peer manager and then again in the router, so leave just the peer manager's one 
- The "reduce peer count" warn message gets thrown to the user for every missed chunk, so instead print it when the request times out and also do not include there info that is not relevant to the user
- The processor didn't have the service tag so add it
- Impl `KV` for status message
- Do not downscore peers if we are the ones that timed out

Other small improvements
2020-11-16 04:06:14 +00:00
realbigsean
6a7d221f72 add slot validation to attestation_data endpoint (#1888)
## Issue Addressed

Resolves #1801

## Proposed Changes

Verify queries to `attestation_data` are for no later than `current_slot + 1`. If they are later than this, return a 400.


Co-authored-by: realbigsean <seananderson33@gmail.com>
2020-11-16 02:59:35 +00:00
divma
8a16548715 Misc Peer sync info adjustments (#1896)
## Issue Addressed
#1856 

## Proposed Changes
- For clarity, the router's processor now only decides if a peer is compatible and it disconnects it or sends it to sync accordingly. No logic here regarding how useful is the peer. 
- Update peer_sync_info's rules
- Add an `IrrelevantPeer` sync status to account for incompatible peers (maybe this should be "IncompatiblePeer" now that I think about it?) this state is update upon receiving an internal goodbye in the peer manager
- Misc code cleanups
- Reduce the need to create `StatusMessage`s (and thus, `Arc` accesses )
- Add missing calls to update the global sync state

The overall effect should be:
- More peers recognized as Behind, and less as Unknown
- Peers identified as incompatible
2020-11-13 09:00:10 +00:00
Michael Sproul
46a06069c6 Release v0.3.4 (#1894)
## Proposed Changes

Bump version to v0.3.4 and update dependencies with `cargo update`.


Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2020-11-13 06:06:35 +00:00
Age Manning
c00e6c2c6f Small network adjustments (#1884)
## Issue Addressed

- Asymmetric pings - Currently with symmetric ping intervals, lighthouse nodes race each other to ping often ending in simultaneous ping connections. This shifts the ping interval to be asymmetric based on inbound/outbound connections
- Correct inbound/outbound peer-db registering - It appears we were accounting inbound as outbound and vice versa in the peerdb, this has been corrected
- Improved logging

There is likely more to come - I'll leave this open as we investigate further testnets
2020-11-13 06:06:33 +00:00
Paul Hauner
8772c02fa0 Reduce temp allocations in network metrics (#1895)
## Issue Addressed

Using `heaptrack` I could see that ~75% of Lighthouse temporary allocations are caused by temporary string allocations here.

## Proposed Changes

Reduces temporary `String` allocations when updating metrics in the `network` crate. The solution isn't perfect since we rebuild our caches with each call, but it's a significant improvement.

## Additional Info

NA
2020-11-13 04:19:38 +00:00
blacktemplar
c7ac967d5a handle peer state transitions on gossipsub score changes + refactoring (#1892)
## Issue Addressed

NA

## Proposed Changes

Correctly handles peer state transitions on gossipsub changes + refactors handling of peer state transitions into one function used for lighthouse score changes and gossipsub score changes.


Co-authored-by: Age Manning <Age@AgeManning.com>
2020-11-13 03:15:03 +00:00
realbigsean
cb26c15eb6 Peer endpoint updates (#1893)
## Issue Addressed

N/A

## Proposed Changes

- rename `address` -> `last_seen_p2p_address`
- state and direction filters for `peers` endpoint
- metadata count addition to `peers` endpoint
- add `peer_count` endpoint


Co-authored-by: realbigsean <seananderson33@gmail.com>
2020-11-13 02:02:41 +00:00
blacktemplar
fcb4893f72 do subnet discoveries until we have MESH_N_LOW many peers (#1886)
## Issue Addressed

NA

## Proposed Changes

Increases the target peers for a subnet, so that subnet queries are executed until we have at least the minimum required peers for a mesh (`MESH_N_LOW`). We keep the limit of `6` target peers for aggregated subnet discovery queries, therefore the size (and the time needed) for a query doesn't change.
2020-11-13 00:56:05 +00:00
blacktemplar
7404f1ce54 Gossipsub scoring (#1668)
## Issue Addressed

#1606 

## Proposed Changes

Uses dynamic gossipsub scoring parameters depending on the number of active validators as specified in https://gist.github.com/blacktemplar/5c1862cb3f0e32a1a7fb0b25e79e6e2c.

## Additional Info

Although the parameters got tested on Medalla, extensive testing using simulations on larger networks is still to be done and we expect that we need to change the parameters, although this might only affect constants within the dynamic parameter framework.
2020-11-12 01:48:28 +00:00
realbigsean
f8da151b0b Standard beacon api updates (#1831)
## Issue Addressed

Resolves #1809 
Resolves #1824
Resolves #1818
Resolves #1828 (hopefully)

## Proposed Changes

- add `validator_index` to the proposer duties endpoint
- add the ability to query for historical proposer duties
- `StateId` deserialization now fails with a 400 warp rejection
- add the `validator_balances` endpoint
- update the `aggregate_and_proofs` endpoint to accept an array
- updates the attester duties endpoint from a `GET` to a `POST`
- reduces the number of times we query for proposer duties from once per slot per validator to only once per slot 


Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
2020-11-09 23:13:56 +00:00
Michael Sproul
556190ff46 Compact database on finalization (#1871)
## Issue Addressed

Closes #1866

## Proposed Changes

* Compact the database on finalization. This removes the deleted states from disk completely. Because it happens in the background migrator, it doesn't block other database operations while it runs. On my Medalla node it took about 1 minute and shrank the database from 90GB to 9GB.
* Fix an inefficiency in the pruning algorithm where it would always use the genesis checkpoint as the `old_finalized_checkpoint` when running for the first time after start-up. This would result in loading lots of states one-at-a-time back to genesis, and storing a lot of block roots in memory. The new code stores the old finalized checkpoint on disk and only uses genesis if no checkpoint is already stored. This makes it both backwards compatible _and_ forwards compatible -- no schema change required!
* Introduce two new `INFO` logs to indicate when pruning has started and completed. Users seem to want to know this information without enabling debug logs!
2020-11-09 07:02:21 +00:00
Paul Hauner
2f9999752e Add --testnet mainnet and start HTTP server before genesis (#1862)
## Issue Addressed

NA

## Proposed Changes

- Adds support for `--testnet mainnet`
- Start HTTP server prior to genesis

## Additional Info

**Note: This is an incomplete work-in-progress. Use Lighthouse for mainnet at your own risk.**

With this PR, you can check the deposits:

```bash
lighthouse --testnet mainnet bn --http
```
```bash
curl localhost:5052/lighthouse/eth1/deposit_cache | jq
```

```json
{
  "data": [
    {
      "deposit_data": {
        "pubkey": "0x854980aa9bf2e84723e1fa6ef682e3537257984cc9cb1daea2ce6b268084b414f0bb43206e9fa6fd7a202357d6eb2b0d",
        "withdrawal_credentials": "0x00cacf703c658b802d55baa2a5c1777500ef5051fc084330d2761bcb6ab6182b",
        "amount": "32000000000",
        "signature": "0xace226cdfd9da6b1d827c3a6ab93f91f53e8e090eb6ca5ee7c7c5fe3acc75558240ca9291684a2a7af5cac67f0558d1109cc95309f5cdf8c125185ec9dcd22635f900d791316924aed7c40cff2ffccdac0d44cf496853db678c8c53745b3545b"
      },
      "block_number": 3492981,
      "index": 0,
      "signature_is_valid": true
    },
    {
      "deposit_data": {
        "pubkey": "0x93da03a71bc4ed163c2f91c8a54ea3ba2461383dd615388fd494670f8ce571b46e698fc8d04b49e4a8ffe653f581806b",
        "withdrawal_credentials": "0x006ebfbb7c8269a78018c8b810492979561d0404d74ba9c234650baa7524dcc4",
        "amount": "32000000000",
        "signature": "0x8d1f4a1683f798a76effcc6e2cdb8c3eed5a79123d201c5ecd4ab91f768a03c30885455b8a952aeec3c02110457f97ae0a60724187b6d4129d7c352f2e1ac19b4210daacd892fe4629ad3260ce2911dceae3890b04ed28267b2d8cb831f6a92d"
      },
      "block_number": 3493427,
      "index": 1,
      "signature_is_valid": true
    },
```
2020-11-09 05:04:03 +00:00
divma
b0e9e3dcef Seen addresses store port (#1841)
## Issue Addressed
#1764
2020-11-09 04:01:03 +00:00
Age Manning
e2ae5010a6 Update libp2p (#1865)
Updates libp2p to the latest version. 

This adds tokio 0.3 support and brings back yamux support. 

This also updates some discv5 configuration parameters for leaner discovery queries
2020-11-06 04:14:14 +00:00
blacktemplar
7e7fad5734 Ignore RPC messages of disconnected peers and remove old peers based on disconnection time (#1854)
## Issue Addressed

NA

## Proposed Changes

Lets the networking behavior ignore messages of peers that are not connected. Furthermore, old peers are not removed from the peerdb based on score anymore but based on the disconnection time.
2020-11-03 23:43:10 +00:00
Age Manning
0a0f4daf9d Prevent errors for stream termination race (#1853)
Prevents an error being propagated on a race condition for RPC stream termination
2020-11-03 10:37:00 +00:00
Paul Hauner
0cde4e285c Bump version to v0.3.3 (#1850)
## Issue Addressed

NA

## Proposed Changes

- Update versions
- Run `cargo update`

## Additional Info

- Blocked on #1846
2020-11-02 23:55:15 +00:00
Paul Hauner
7afbaa807e Return eth1-related data via the API (#1797)
## Issue Addressed

- Related to #1691

## Proposed Changes

Adds the following API endpoints:

- `GET lighthouse/eth1/syncing`: status about how synced we are with Eth1.
- `GET lighthouse/eth1/block_cache`: all locally cached eth1 blocks.
- `GET lighthouse/eth1/deposit_cache`: all locally cached eth1 deposits.

Additionally:

- Moves some types from the `beacon_node/eth1` to the `common/eth2` crate, so they can be used in the API without duplication.
- Allow `update_deposit_cache` and `update_block_cache` to take an optional head block number to avoid duplicate requests.

## Additional Info

TBC
2020-11-02 00:37:30 +00:00
divma
6c0c050fbb Tweak head syncing (#1845)
## Issue Addressed

Fixes head syncing

## Proposed Changes

- Get back to statusing peers after removing chain segments and making the peer manager deal with status according to the Sync status, preventing an old known deadlock
- Also a bug where a chain would get removed if the optimistic batch succeeds being empty

## Additional Info

Tested on Medalla and looking good
2020-11-01 23:37:39 +00:00
Paul Hauner
f64f8246db Only run http_api tests in release (#1827)
## Issue Addressed

NA

## Proposed Changes

As raised by @hermanjunge in a DM, the `http_api` tests have been observed taking 100+ minutes on debug. This PR:

- Moves the `http_api` tests to only run in release.
- Groups some `http_api` tests to reduce test-setup overhead.

## Additional Info

NA
2020-10-29 22:25:20 +00:00
realbigsean
ae0f025375 Beacon state validator id filter (#1803)
## Issue Addressed

Michael's comment here: https://github.com/sigp/lighthouse/issues/1434#issuecomment-708834079
Resolves #1808

## Proposed Changes

- Add query param `id` and `status` to the `validators` endpoint
- Add string serialization and deserialization for `ValidatorStatus`
- Drop `Epoch` from `ValidatorStatus` variants

## Additional Info

Please provide any additional information. For example, future considerations
or information useful for reviewers.
2020-10-29 05:13:04 +00:00
divma
9f45ac2f5e More sync edge cases + prettify range (#1834)
## Issue Addressed
Sync edge case when we get an empty optimistic batch that passes validation and is inside the download buffer. Eventually the chain would reach the batch and treat it as an ugly state. 

## Proposed Changes
- Handle the edge case advancing the chain's target + code clarification
- Some largey changes for readability + ergonomics since rust has try ops
- Better handling of bad batch and chain states
2020-10-29 02:29:24 +00:00
blacktemplar
2bd5b9182f fix unbanning of peers (#1838)
## Issue Addressed

NA

## Proposed Changes

Currently a banned peer will remain banned indefinitely as long as update is called on the score struct regularly. This fixes this bug and the score decay starts after `BANNED_BEFORE_DECAY` seconds after banning.
2020-10-29 01:25:02 +00:00
Michael Sproul
36bd4d87f0 Update to spec v1.0.0-rc.0 and BLSv4 (#1765)
## Issue Addressed

Closes #1504 
Closes #1505
Replaces #1703
Closes #1707

## Proposed Changes

* Update BLST and Milagro to versions compatible with BLSv4 spec
* Update Lighthouse to spec v1.0.0-rc.0, and update EF test vectors
* Use the v1.0.0 constants for `MainnetEthSpec`.
* Rename `InteropEthSpec` -> `V012LegacyEthSpec`
    * Change all constants to suit the mainnet `v0.12.3` specification (i.e., Medalla).
* Deprecate the `--spec` flag for the `lighthouse` binary
    * This value is now obtained from the `config_name` field of the `YamlConfig`.
        * Built in testnet YAML files have been updated.
    * Ignore the `--spec` value, if supplied, log a warning that it will be deprecated
    * `lcli` still has the spec flag, that's fine because it's dev tooling.
* Remove the `E: EthSpec` from `YamlConfig`
    * This means we need to deser the genesis `BeaconState` on-demand, but this is fine.
* Swap the old "minimal", "mainnet" strings over to the new `EthSpecId` enum.
* Always require a `CONFIG_NAME` field in `YamlConfig` (it used to have a default).

## Additional Info

Lots of breaking changes, do not merge! ~~We will likely need a Lighthouse v0.4.0 branch, and possibly a long-term v0.3.0 branch to keep Medalla alive~~.

Co-authored-by: Kirk Baird <baird.k@outlook.com>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
2020-10-28 22:19:38 +00:00
divma
ad846ad280 Inform peers of requests that exceed the maximum rate limit + log downgrade (#1830)
## Issue Addressed

#1825 

## Proposed Changes

Since we penalize more blocks by range requests that have large steps, it is possible to get requests that will never be processed. We were not informing peers about this requests and also logging CRIT that is no longer relevant. Later we should check if more sophisticated handling for those requests is needed
2020-10-27 11:46:38 +00:00
Paul Hauner
92c8eba8ca Ensure eth1 deposit/chain IDs are used from YamlConfig (#1829)
## Issue Addressed

 NA

## Proposed Changes

Fixes a bug which causes the node to reject valid eth1 nodes.

- Fix core bug: failure to apply `YamlConfig` values to `ChainSpec`.
- Add a test to prevent regression in this specific case.
- Fix an invalid log message

## Additional Info

NA
2020-10-26 03:34:14 +00:00
Paul Hauner
f157d61cc7 Address clippy lints, panic in ssz_derive on overflow (#1714)
## Issue Addressed

NA

## Proposed Changes

- Panic or return error if we overflow `usize` in SSZ decoding/encoding derive macros.
  - I claim that the panics can only be triggered by a faulty type definition in lighthouse, they cannot be triggered externally on a validly defined struct.
- Use `Ordering` instead of some `if` statements, as demanded by clippy.
- Remove some old clippy `allow` that seem to no longer be required.
- Add comments to interesting clippy statements that we're going to continue to ignore.
- Create #1713

## Additional Info

NA
2020-10-25 23:27:39 +00:00
Paul Hauner
eba51f0973 Update testnet configs, change on-disk format (#1799)
## Issue Addressed

- Related to #1691

## Proposed Changes

- Add `DEPOSIT_CHAIN_ID` and `DEPOSIT_NETWORK_ID` to `config.yaml`.
    - Pass the `DEPOSIT_NETWORK_ID` to the `eth1::Service`.
- Remove the unused `MAX_EPOCHS_PER_CROSSLINK` from the `altona` and `medalla` configs (see [spec commit](2befe90032 (diff-efb845ac2ebd4aafbc23df40f47ce25699255064e99d36d0406d0a14ca7953ec))).
- Change from compressing the whole testnet directory, to only compressing the genesis state file. This is the only file we need to compress and *not* compressing the others makes them work nicely with git.
    - We can modify the boot nodes, configs, etc. without incurring an eternal binary-blob cost on our git history.
    - This change is backwards compatible (i.e., non-breaking).

## Additional Info

NA
2020-10-25 22:15:46 +00:00
Age Manning
7453f39d68 Prevent unbanning of disconnected peers (#1822)
## Issue Addressed

Further testing revealed another edge case where we attempt to unban a peer that can be in a disconnected start. Although this causes no real issue, it does log an error to the user. 

This PR adds a check to prevent this edge case and prevents the error being logged to the user.
2020-10-24 05:24:20 +00:00
Age Manning
a3cc1a1e0f Call unban only when necessary (#1821)
This PR prevents a user-facing error. 

It prevents optimistically unbanning a peer and instead checks the state of the peer before requesting the peers state to be unbanned.
2020-10-24 03:24:19 +00:00
blacktemplar
1644289a08 Updates the libp2p to the second newest commit => Allow only one topic per message (#1819)
As @AgeManning mentioned the newest libp2p version had some problems and got downgraded again on lighthouse master. This is an intermediate version that makes no problems and only adds a small change of allowing only one topic per message.
2020-10-24 01:05:37 +00:00
Age Manning
7870b81ade Downgrade libp2p (#1817)
## Description

This downgrades the recent libp2p upgrade. 

There were issues with the RPC which prevented syncing of the chain and this upgrade needs to be further investigated.
2020-10-23 09:33:59 +00:00
Age Manning
55eee18ebb Version bump to 0.3.1 (#1813)
## Description

Bumps Lighthouse to version 0.3.1.
2020-10-23 04:16:36 +00:00
Age Manning
64c5899d25 Adds colour help to bn and vc subcommands (#1811)
Adds coloured help to the bn and vc subcommands
2020-10-23 04:16:34 +00:00
Age Manning
2c7f362908 Discovery v5.1 (#1786)
## Overview 

This updates lighthouse to discovery v5.1

Note: This makes lighthouse's discovery not compatible with any previous version. Lighthouse cannot discover peers or send/receive ENR's from any previous version. This is a breaking change. 

This resolves #1605
2020-10-23 04:16:33 +00:00
Age Manning
ae96dab5d2 Increase UPnP logging and decrease batch sizes (#1812)
## Description

This increases the logging of the underlying UPnP tasks to inform the user of UPnP error/success. 

This also decreases the batch syncing size to two epochs per batch.
2020-10-23 03:01:33 +00:00
Age Manning
c49dd94e20 Update to latest libp2p (#1810)
## Description

Updates to the latest libp2p and includes gossipsub updates. 

Of particular note is the limitation of a single topic per gossipsub message.

Co-authored-by: blacktemplar <blacktemplar@a1.net>
2020-10-23 03:01:31 +00:00
Michael Sproul
acd49d988d Implement database temp states to reduce memory usage (#1798)
## Issue Addressed

Closes #800
Closes #1713

## Proposed Changes

Implement the temporary state storage algorithm described in #800. Specifically:

* Add `DBColumn::BeaconStateTemporary`, for storing 0-length temporary marker values.
* Store intermediate states immediately as they are created, marked temporary. Delete the temporary flag if the block is processed successfully.
* Add a garbage collection process to delete leftover temporary states on start-up.
* Bump the database schema version to 2 so that a DB with temporary states can't accidentally be used with older versions of the software. The auto-migration is a no-op, but puts in place some infra that we can use for future migrations (e.g. #1784)

## Additional Info

There are two known race conditions, one potentially causing permanent faults (hopefully rare), and the other insignificant.

### Race 1: Permanent state marked temporary

EDIT: this has been fixed by the addition of a lock around the relevant critical section

There are 2 threads that are trying to store 2 different blocks that share some intermediate states (e.g. they both skip some slots from the current head). Consider this sequence of events:

1. Thread 1 checks if state `s` already exists, and seeing that it doesn't, prepares an atomic commit of `(s, s_temporary_flag)`.
2. Thread 2 does the same, but also gets as far as committing the state txn, finishing the processing of its block, and _deleting_ the temporary flag.
3. Thread 1 is (finally) scheduled again, and marks `s` as temporary with its transaction.
4.
    a) The process is killed, or thread 1's block fails verification and the temp flag is not deleted. This is a permanent failure! Any attempt to load state `s` will fail... hope it isn't on the main chain! Alternatively (4b) happens...
    b) Thread 1 finishes, and re-deletes the temporary flag. In this case the failure is transient, state `s` will disappear temporarily, but will come back once thread 1 finishes running.

I _hope_ that steps 1-3 only happen very rarely, and 4a even more rarely. It's hard to know

This once again begs the question of why we're using LevelDB (#483), when it clearly doesn't care about atomicity! A ham-fisted fix would be to wrap the hot and cold DBs in locks, which would bring us closer to how other DBs handle read-write transactions. E.g. [LMDB only allows one R/W transaction at a time](https://docs.rs/lmdb/0.8.0/lmdb/struct.Environment.html#method.begin_rw_txn).

### Race 2: Temporary state returned from `get_state`

I don't think this race really matters, but in `load_hot_state`, if another thread stores a state between when we call `load_state_temporary_flag` and when we call `load_hot_state_summary`, then we could end up returning that state even though it's only a temporary state. I can't think of any case where this would be relevant, and I suspect if it did come up, it would be safe/recoverable (having data is safer than _not_ having data).

This could be fixed by using a LevelDB read snapshot, but that would require substantial changes to how we read all our values, so I don't think it's worth it right now.
2020-10-23 01:27:51 +00:00
Age Manning
66f0cf4430 Improve peer handling (#1796)
## Issue Addressed

Potentially resolves #1647 and sync stalls. 

## Proposed Changes

The handling of the state of banned peers was inadequate for the complex peerdb data structure. We store a limited number of disconnected and banned peers in the db. We were not tracking intermediate "disconnecting" states and the in some circumstances we were updating the peer state without informing the peerdb. This lead to a number of inconsistencies in the peer state. 

Further, the peer manager could ban a peer changing a peer's state from being connected to banned. In this circumstance, if the peer then disconnected, we didn't inform the application layer, which lead to applications like sync not being informed of a peers disconnection. This could lead to sync stalling and having to require a lighthouse restart. 

Improved handling for peer states and interactions with the peerdb is made in this PR.
2020-10-23 01:27:48 +00:00
Paul Hauner
b829257cca Ssz state (#1749)
## Issue Addressed

NA

## Proposed Changes

Adds a `lighthouse/beacon/states/:state_id/ssz` endpoint to allow us to pull the genesis state from the API.

## Additional Info

NA
2020-10-22 06:05:49 +00:00
Michael Sproul
7f73dccebc Refine op pool pruning (#1805)
## Issue Addressed

Closes #1769
Closes #1708

## Proposed Changes

Tweaks the op pool pruning so that the attestation pool is pruned against the wall-clock epoch instead of the finalized state's epoch. This should reduce the unbounded growth that we've seen during periods without finality.

Also fixes up the voluntary exit pruning as raised in #1708.
2020-10-22 04:47:29 +00:00
Paul Hauner
a3704b971e Support pre-flight CORS check (#1772)
## Issue Addressed

- Resolves #1766 

## Proposed Changes

- Use the `warp::filters::cors` filter instead of our work-around.

## Additional Info

It's not trivial to enable/disable `cors` using `warp`, since using `routes.with(cors)` changes the type of `routes`.  This makes it difficult to apply/not apply cors at runtime. My solution has been to *always* use the `warp::filters::cors` wrapper but when cors should be disabled, just pass the HTTP server listen address as the only permissible origin.
2020-10-22 04:47:27 +00:00
realbigsean
a3552a4b70 Node endpoints (#1778)
## Issue Addressed

`node` endpoints in #1434

## Proposed Changes

Implement these:
```
 /eth/v1/node/health
 /eth/v1/node/peers/{peer_id}
 /eth/v1/node/peers
```
- Add an `Option<Enr>` to `PeerInfo`
- Finish implementation of `/eth/v1/node/identity`

## Additional Info
- should update the `peers` endpoints when #1764 is resolved



Co-authored-by: realbigsean <seananderson33@gmail.com>
2020-10-22 02:59:42 +00:00
Daniel Schonfeld
8f86baa48d Optimize attester slashing (#1745)
## Issue Addressed

Closes #1548 

## Proposed Changes

Optimizes attester slashing choice by choosing the ones that cover the most amount of validators slashed, with the highest effective balances 

## Additional Info

Initial pass, need to write a test for it
2020-10-22 01:43:54 +00:00
divma
668513b67e Sync state adjustments (#1804)
check for advanced peers and the state of the chain wrt the clock slot to decide if a chain is or not synced /transitioning to a head sync. Also a fix that prevented getting the right state while syncing heads
2020-10-22 00:26:06 +00:00