Commit Graph

1432 Commits

Author SHA1 Message Date
realbigsean
79fd9b32b9 Update pool/attestations and committees endpoints (#1899)
## Issue Addressed

Catching up on a few eth2 spec updates:

## Proposed Changes

- adding query params to the `GET pool/attestations` endpoint
- allowing the `POST pool/attestations` endpoint to accept an array of attestations
    - batching attestation submission
- moving `epoch` from a path param to a query param in the `committees` endpoint

## Additional Info


Co-authored-by: realbigsean <seananderson33@gmail.com>
2020-11-18 23:31:39 +00:00
blacktemplar
3408de8151 Avoid string initialization in network metrics and replace by &str where possible (#1898)
## Issue Addressed

NA

## Proposed Changes

Removes most of the temporary string initializations in network metrics and replaces them by directly using `&str`. This further improves on PR https://github.com/sigp/lighthouse/pull/1895.

For the subnet id handling the current approach uses a build script to create a static map. This has the disadvantage that the build script hardcodes the number of subnets. If we want to use more than 64 subnets we need to adjust this in the build script.

## Additional Info

We still have some string initializations for the enum `PeerKind`. To also replace that by `&str` I created a PR in the libp2p dependency: https://github.com/sigp/rust-libp2p/pull/91. Either we wait with merging until this dependency PR is merged (and all conflicts with the newest libp2p version are resolved) or we just merge as is and I will create another PR when the dependency is ready.
2020-11-18 23:31:37 +00:00
Paul Hauner
bcc7f6b143 Add new flag to set blocks per eth1 query (#1931)
## Issue Addressed

NA

## Proposed Changes

Users on Discord (and @protolambda) have experienced this error (or variants of it):

```
Failed to update eth1 cache: GetDepositLogsFailed("Eth1 node returned error: {\"code\":-32005,\"message\":\"query returned more than 10000 results\"}")
```

This PR allows users to reduce the span of blocks searched for deposit logs and therefore reduce the size of the return result. Hopefully experimentation with this flag can lead to finding a better default value.


## Additional Info

NA
2020-11-18 22:18:59 +00:00
Paul Hauner
7e4ee58729 Bump to v0.3.5 (#1927)
## Issue Addressed

NA

## Proposed Changes

- Bump version to `v0.3.5`
- Run `cargo update`

## Additional Info

NA
2020-11-18 00:44:28 +00:00
Paul Hauner
103103e72e Address queue congestion in migrator (#1923)
## Issue Addressed

*Should* address #1917

## Proposed Changes

Stops the `BackgroupMigrator` rx channel from backing up with big `BeaconState` messages.

Looking at some logs from my Medalla node, we can see a discrepancy between the head finalized epoch and the migrator finalized epoch:

```
Nov 17 16:50:21.606 DEBG Head beacon block                       slot: 129214, root: 0xbc7a…0b99, finalized_epoch: 4033, finalized_root: 0xf930…6562, justified_epoch: 4035, justified_root: 0x206b…9321, service: beacon
Nov 17 16:50:21.626 DEBG Batch processed                         service: sync, processed_blocks: 43, last_block_slot: 129214, chain: 8274002112260436595, first_block_slot: 129153, batch_epoch: 4036
Nov 17 16:50:21.626 DEBG Chain advanced                          processing_target: 4036, new_start: 4036, previous_start: 4034, chain: 8274002112260436595, service: sync
Nov 17 16:50:22.162 DEBG Completed batch received                awaiting_batches: 5, blocks: 47, epoch: 4048, chain: 8274002112260436595, service: sync
Nov 17 16:50:22.162 DEBG Requesting batch                        start_slot: 129601, end_slot: 129664, downloaded: 0, processed: 0, state: Downloading(16Uiu2HAmG3C3t1McaseReECjAF694tjVVjkDoneZEbxNhWm1nZaT, 0 blocks, 1273), epoch: 4050, chain: 8274002112260436595, service: sync
Nov 17 16:50:22.654 DEBG Database compaction complete            service: beacon
Nov 17 16:50:22.655 INFO Starting database pruning               new_finalized_epoch: 2193, old_finalized_epoch: 2192, service: beacon
```

I believe this indicates that the migrator rx has a backed-up queue of `MigrationNotification` items which each contain a `BeaconState`.

## TODO

- [x] Remove finalized state requirement for op-pool
2020-11-17 23:11:26 +00:00
Michael Sproul
a60ab4eff2 Refine compaction (#1916)
## Proposed Changes

In an attempt to fix OOM issues and database consistency issues observed by some users after the introduction of compaction in v0.3.4, this PR makes the following changes:

* Run compaction less often: roughly every 1024 epochs, including after long periods of non-finality. I think the division check proposed by Paul is pretty solid, and ensures we don't miss any events where we should be compacting. LevelDB lacks an easy way to check the size of the DB, which would be another good trigger.
* Make it possible to disable the compaction on finalization using `--auto-compact-db=false`
* Make it possible to trigger a manual, single-threaded foreground compaction on start-up using `--compact-db`
* Downgrade the pruning log to `DEBUG`, as it's particularly noisy during sync

I would like to ship these changes to affected users ASAP, and will document them further in the Advanced Database section of the book if they prove effective.
2020-11-17 09:10:53 +00:00
divma
398919b5d4 router: drop requests from peers that have dc'd (#1919)
## Issue Addressed

A peer might send a lot of requests that comply to the rate limit and the disconnect, this humongous pr makes sure we don't process them if the peer is not connected
2020-11-17 02:06:21 +00:00
Pawan Dhananjay
280334b1b0 Validate eth1 chain id (#1877)
## Issue Addressed

Resolves #1815 

## Proposed Changes

Adds extra validation for eth1 chain id apart from the existing check for eth1 network id.
2020-11-16 23:10:42 +00:00
Age Manning
49c4630045 Performance improvement for db reads (#1909)
This PR adds a number of improvements:
- Downgrade a warning log when we ignore blocks for gossipsub processing
- Revert a a correction to improve logging of peer score changes
- Shift syncing DB reads off the core-executor allowing parallel processing of large sync messages
- Correct the timeout logic of RPC chunk sends, giving more time before timing out RPC outbound messages.
2020-11-16 07:28:30 +00:00
divma
eb56140582 Update logs + do not downscore peers if WE time out (#1901)
## Issue Addressed

- RPC Errors were being logged twice: first in the peer manager and then again in the router, so leave just the peer manager's one 
- The "reduce peer count" warn message gets thrown to the user for every missed chunk, so instead print it when the request times out and also do not include there info that is not relevant to the user
- The processor didn't have the service tag so add it
- Impl `KV` for status message
- Do not downscore peers if we are the ones that timed out

Other small improvements
2020-11-16 04:06:14 +00:00
realbigsean
6a7d221f72 add slot validation to attestation_data endpoint (#1888)
## Issue Addressed

Resolves #1801

## Proposed Changes

Verify queries to `attestation_data` are for no later than `current_slot + 1`. If they are later than this, return a 400.


Co-authored-by: realbigsean <seananderson33@gmail.com>
2020-11-16 02:59:35 +00:00
divma
8a16548715 Misc Peer sync info adjustments (#1896)
## Issue Addressed
#1856 

## Proposed Changes
- For clarity, the router's processor now only decides if a peer is compatible and it disconnects it or sends it to sync accordingly. No logic here regarding how useful is the peer. 
- Update peer_sync_info's rules
- Add an `IrrelevantPeer` sync status to account for incompatible peers (maybe this should be "IncompatiblePeer" now that I think about it?) this state is update upon receiving an internal goodbye in the peer manager
- Misc code cleanups
- Reduce the need to create `StatusMessage`s (and thus, `Arc` accesses )
- Add missing calls to update the global sync state

The overall effect should be:
- More peers recognized as Behind, and less as Unknown
- Peers identified as incompatible
2020-11-13 09:00:10 +00:00
Michael Sproul
46a06069c6 Release v0.3.4 (#1894)
## Proposed Changes

Bump version to v0.3.4 and update dependencies with `cargo update`.


Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2020-11-13 06:06:35 +00:00
Age Manning
c00e6c2c6f Small network adjustments (#1884)
## Issue Addressed

- Asymmetric pings - Currently with symmetric ping intervals, lighthouse nodes race each other to ping often ending in simultaneous ping connections. This shifts the ping interval to be asymmetric based on inbound/outbound connections
- Correct inbound/outbound peer-db registering - It appears we were accounting inbound as outbound and vice versa in the peerdb, this has been corrected
- Improved logging

There is likely more to come - I'll leave this open as we investigate further testnets
2020-11-13 06:06:33 +00:00
Paul Hauner
8772c02fa0 Reduce temp allocations in network metrics (#1895)
## Issue Addressed

Using `heaptrack` I could see that ~75% of Lighthouse temporary allocations are caused by temporary string allocations here.

## Proposed Changes

Reduces temporary `String` allocations when updating metrics in the `network` crate. The solution isn't perfect since we rebuild our caches with each call, but it's a significant improvement.

## Additional Info

NA
2020-11-13 04:19:38 +00:00
blacktemplar
c7ac967d5a handle peer state transitions on gossipsub score changes + refactoring (#1892)
## Issue Addressed

NA

## Proposed Changes

Correctly handles peer state transitions on gossipsub changes + refactors handling of peer state transitions into one function used for lighthouse score changes and gossipsub score changes.


Co-authored-by: Age Manning <Age@AgeManning.com>
2020-11-13 03:15:03 +00:00
realbigsean
cb26c15eb6 Peer endpoint updates (#1893)
## Issue Addressed

N/A

## Proposed Changes

- rename `address` -> `last_seen_p2p_address`
- state and direction filters for `peers` endpoint
- metadata count addition to `peers` endpoint
- add `peer_count` endpoint


Co-authored-by: realbigsean <seananderson33@gmail.com>
2020-11-13 02:02:41 +00:00
blacktemplar
fcb4893f72 do subnet discoveries until we have MESH_N_LOW many peers (#1886)
## Issue Addressed

NA

## Proposed Changes

Increases the target peers for a subnet, so that subnet queries are executed until we have at least the minimum required peers for a mesh (`MESH_N_LOW`). We keep the limit of `6` target peers for aggregated subnet discovery queries, therefore the size (and the time needed) for a query doesn't change.
2020-11-13 00:56:05 +00:00
blacktemplar
7404f1ce54 Gossipsub scoring (#1668)
## Issue Addressed

#1606 

## Proposed Changes

Uses dynamic gossipsub scoring parameters depending on the number of active validators as specified in https://gist.github.com/blacktemplar/5c1862cb3f0e32a1a7fb0b25e79e6e2c.

## Additional Info

Although the parameters got tested on Medalla, extensive testing using simulations on larger networks is still to be done and we expect that we need to change the parameters, although this might only affect constants within the dynamic parameter framework.
2020-11-12 01:48:28 +00:00
realbigsean
f8da151b0b Standard beacon api updates (#1831)
## Issue Addressed

Resolves #1809 
Resolves #1824
Resolves #1818
Resolves #1828 (hopefully)

## Proposed Changes

- add `validator_index` to the proposer duties endpoint
- add the ability to query for historical proposer duties
- `StateId` deserialization now fails with a 400 warp rejection
- add the `validator_balances` endpoint
- update the `aggregate_and_proofs` endpoint to accept an array
- updates the attester duties endpoint from a `GET` to a `POST`
- reduces the number of times we query for proposer duties from once per slot per validator to only once per slot 


Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
2020-11-09 23:13:56 +00:00
Michael Sproul
556190ff46 Compact database on finalization (#1871)
## Issue Addressed

Closes #1866

## Proposed Changes

* Compact the database on finalization. This removes the deleted states from disk completely. Because it happens in the background migrator, it doesn't block other database operations while it runs. On my Medalla node it took about 1 minute and shrank the database from 90GB to 9GB.
* Fix an inefficiency in the pruning algorithm where it would always use the genesis checkpoint as the `old_finalized_checkpoint` when running for the first time after start-up. This would result in loading lots of states one-at-a-time back to genesis, and storing a lot of block roots in memory. The new code stores the old finalized checkpoint on disk and only uses genesis if no checkpoint is already stored. This makes it both backwards compatible _and_ forwards compatible -- no schema change required!
* Introduce two new `INFO` logs to indicate when pruning has started and completed. Users seem to want to know this information without enabling debug logs!
2020-11-09 07:02:21 +00:00
Paul Hauner
2f9999752e Add --testnet mainnet and start HTTP server before genesis (#1862)
## Issue Addressed

NA

## Proposed Changes

- Adds support for `--testnet mainnet`
- Start HTTP server prior to genesis

## Additional Info

**Note: This is an incomplete work-in-progress. Use Lighthouse for mainnet at your own risk.**

With this PR, you can check the deposits:

```bash
lighthouse --testnet mainnet bn --http
```
```bash
curl localhost:5052/lighthouse/eth1/deposit_cache | jq
```

```json
{
  "data": [
    {
      "deposit_data": {
        "pubkey": "0x854980aa9bf2e84723e1fa6ef682e3537257984cc9cb1daea2ce6b268084b414f0bb43206e9fa6fd7a202357d6eb2b0d",
        "withdrawal_credentials": "0x00cacf703c658b802d55baa2a5c1777500ef5051fc084330d2761bcb6ab6182b",
        "amount": "32000000000",
        "signature": "0xace226cdfd9da6b1d827c3a6ab93f91f53e8e090eb6ca5ee7c7c5fe3acc75558240ca9291684a2a7af5cac67f0558d1109cc95309f5cdf8c125185ec9dcd22635f900d791316924aed7c40cff2ffccdac0d44cf496853db678c8c53745b3545b"
      },
      "block_number": 3492981,
      "index": 0,
      "signature_is_valid": true
    },
    {
      "deposit_data": {
        "pubkey": "0x93da03a71bc4ed163c2f91c8a54ea3ba2461383dd615388fd494670f8ce571b46e698fc8d04b49e4a8ffe653f581806b",
        "withdrawal_credentials": "0x006ebfbb7c8269a78018c8b810492979561d0404d74ba9c234650baa7524dcc4",
        "amount": "32000000000",
        "signature": "0x8d1f4a1683f798a76effcc6e2cdb8c3eed5a79123d201c5ecd4ab91f768a03c30885455b8a952aeec3c02110457f97ae0a60724187b6d4129d7c352f2e1ac19b4210daacd892fe4629ad3260ce2911dceae3890b04ed28267b2d8cb831f6a92d"
      },
      "block_number": 3493427,
      "index": 1,
      "signature_is_valid": true
    },
```
2020-11-09 05:04:03 +00:00
divma
b0e9e3dcef Seen addresses store port (#1841)
## Issue Addressed
#1764
2020-11-09 04:01:03 +00:00
Age Manning
e2ae5010a6 Update libp2p (#1865)
Updates libp2p to the latest version. 

This adds tokio 0.3 support and brings back yamux support. 

This also updates some discv5 configuration parameters for leaner discovery queries
2020-11-06 04:14:14 +00:00
blacktemplar
7e7fad5734 Ignore RPC messages of disconnected peers and remove old peers based on disconnection time (#1854)
## Issue Addressed

NA

## Proposed Changes

Lets the networking behavior ignore messages of peers that are not connected. Furthermore, old peers are not removed from the peerdb based on score anymore but based on the disconnection time.
2020-11-03 23:43:10 +00:00
Age Manning
0a0f4daf9d Prevent errors for stream termination race (#1853)
Prevents an error being propagated on a race condition for RPC stream termination
2020-11-03 10:37:00 +00:00
Paul Hauner
0cde4e285c Bump version to v0.3.3 (#1850)
## Issue Addressed

NA

## Proposed Changes

- Update versions
- Run `cargo update`

## Additional Info

- Blocked on #1846
2020-11-02 23:55:15 +00:00
Paul Hauner
7afbaa807e Return eth1-related data via the API (#1797)
## Issue Addressed

- Related to #1691

## Proposed Changes

Adds the following API endpoints:

- `GET lighthouse/eth1/syncing`: status about how synced we are with Eth1.
- `GET lighthouse/eth1/block_cache`: all locally cached eth1 blocks.
- `GET lighthouse/eth1/deposit_cache`: all locally cached eth1 deposits.

Additionally:

- Moves some types from the `beacon_node/eth1` to the `common/eth2` crate, so they can be used in the API without duplication.
- Allow `update_deposit_cache` and `update_block_cache` to take an optional head block number to avoid duplicate requests.

## Additional Info

TBC
2020-11-02 00:37:30 +00:00
divma
6c0c050fbb Tweak head syncing (#1845)
## Issue Addressed

Fixes head syncing

## Proposed Changes

- Get back to statusing peers after removing chain segments and making the peer manager deal with status according to the Sync status, preventing an old known deadlock
- Also a bug where a chain would get removed if the optimistic batch succeeds being empty

## Additional Info

Tested on Medalla and looking good
2020-11-01 23:37:39 +00:00
Paul Hauner
f64f8246db Only run http_api tests in release (#1827)
## Issue Addressed

NA

## Proposed Changes

As raised by @hermanjunge in a DM, the `http_api` tests have been observed taking 100+ minutes on debug. This PR:

- Moves the `http_api` tests to only run in release.
- Groups some `http_api` tests to reduce test-setup overhead.

## Additional Info

NA
2020-10-29 22:25:20 +00:00
realbigsean
ae0f025375 Beacon state validator id filter (#1803)
## Issue Addressed

Michael's comment here: https://github.com/sigp/lighthouse/issues/1434#issuecomment-708834079
Resolves #1808

## Proposed Changes

- Add query param `id` and `status` to the `validators` endpoint
- Add string serialization and deserialization for `ValidatorStatus`
- Drop `Epoch` from `ValidatorStatus` variants

## Additional Info

Please provide any additional information. For example, future considerations
or information useful for reviewers.
2020-10-29 05:13:04 +00:00
divma
9f45ac2f5e More sync edge cases + prettify range (#1834)
## Issue Addressed
Sync edge case when we get an empty optimistic batch that passes validation and is inside the download buffer. Eventually the chain would reach the batch and treat it as an ugly state. 

## Proposed Changes
- Handle the edge case advancing the chain's target + code clarification
- Some largey changes for readability + ergonomics since rust has try ops
- Better handling of bad batch and chain states
2020-10-29 02:29:24 +00:00
blacktemplar
2bd5b9182f fix unbanning of peers (#1838)
## Issue Addressed

NA

## Proposed Changes

Currently a banned peer will remain banned indefinitely as long as update is called on the score struct regularly. This fixes this bug and the score decay starts after `BANNED_BEFORE_DECAY` seconds after banning.
2020-10-29 01:25:02 +00:00
Michael Sproul
36bd4d87f0 Update to spec v1.0.0-rc.0 and BLSv4 (#1765)
## Issue Addressed

Closes #1504 
Closes #1505
Replaces #1703
Closes #1707

## Proposed Changes

* Update BLST and Milagro to versions compatible with BLSv4 spec
* Update Lighthouse to spec v1.0.0-rc.0, and update EF test vectors
* Use the v1.0.0 constants for `MainnetEthSpec`.
* Rename `InteropEthSpec` -> `V012LegacyEthSpec`
    * Change all constants to suit the mainnet `v0.12.3` specification (i.e., Medalla).
* Deprecate the `--spec` flag for the `lighthouse` binary
    * This value is now obtained from the `config_name` field of the `YamlConfig`.
        * Built in testnet YAML files have been updated.
    * Ignore the `--spec` value, if supplied, log a warning that it will be deprecated
    * `lcli` still has the spec flag, that's fine because it's dev tooling.
* Remove the `E: EthSpec` from `YamlConfig`
    * This means we need to deser the genesis `BeaconState` on-demand, but this is fine.
* Swap the old "minimal", "mainnet" strings over to the new `EthSpecId` enum.
* Always require a `CONFIG_NAME` field in `YamlConfig` (it used to have a default).

## Additional Info

Lots of breaking changes, do not merge! ~~We will likely need a Lighthouse v0.4.0 branch, and possibly a long-term v0.3.0 branch to keep Medalla alive~~.

Co-authored-by: Kirk Baird <baird.k@outlook.com>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
2020-10-28 22:19:38 +00:00
divma
ad846ad280 Inform peers of requests that exceed the maximum rate limit + log downgrade (#1830)
## Issue Addressed

#1825 

## Proposed Changes

Since we penalize more blocks by range requests that have large steps, it is possible to get requests that will never be processed. We were not informing peers about this requests and also logging CRIT that is no longer relevant. Later we should check if more sophisticated handling for those requests is needed
2020-10-27 11:46:38 +00:00
Paul Hauner
92c8eba8ca Ensure eth1 deposit/chain IDs are used from YamlConfig (#1829)
## Issue Addressed

 NA

## Proposed Changes

Fixes a bug which causes the node to reject valid eth1 nodes.

- Fix core bug: failure to apply `YamlConfig` values to `ChainSpec`.
- Add a test to prevent regression in this specific case.
- Fix an invalid log message

## Additional Info

NA
2020-10-26 03:34:14 +00:00
Paul Hauner
f157d61cc7 Address clippy lints, panic in ssz_derive on overflow (#1714)
## Issue Addressed

NA

## Proposed Changes

- Panic or return error if we overflow `usize` in SSZ decoding/encoding derive macros.
  - I claim that the panics can only be triggered by a faulty type definition in lighthouse, they cannot be triggered externally on a validly defined struct.
- Use `Ordering` instead of some `if` statements, as demanded by clippy.
- Remove some old clippy `allow` that seem to no longer be required.
- Add comments to interesting clippy statements that we're going to continue to ignore.
- Create #1713

## Additional Info

NA
2020-10-25 23:27:39 +00:00
Paul Hauner
eba51f0973 Update testnet configs, change on-disk format (#1799)
## Issue Addressed

- Related to #1691

## Proposed Changes

- Add `DEPOSIT_CHAIN_ID` and `DEPOSIT_NETWORK_ID` to `config.yaml`.
    - Pass the `DEPOSIT_NETWORK_ID` to the `eth1::Service`.
- Remove the unused `MAX_EPOCHS_PER_CROSSLINK` from the `altona` and `medalla` configs (see [spec commit](2befe90032 (diff-efb845ac2ebd4aafbc23df40f47ce25699255064e99d36d0406d0a14ca7953ec))).
- Change from compressing the whole testnet directory, to only compressing the genesis state file. This is the only file we need to compress and *not* compressing the others makes them work nicely with git.
    - We can modify the boot nodes, configs, etc. without incurring an eternal binary-blob cost on our git history.
    - This change is backwards compatible (i.e., non-breaking).

## Additional Info

NA
2020-10-25 22:15:46 +00:00
Age Manning
7453f39d68 Prevent unbanning of disconnected peers (#1822)
## Issue Addressed

Further testing revealed another edge case where we attempt to unban a peer that can be in a disconnected start. Although this causes no real issue, it does log an error to the user. 

This PR adds a check to prevent this edge case and prevents the error being logged to the user.
2020-10-24 05:24:20 +00:00
Age Manning
a3cc1a1e0f Call unban only when necessary (#1821)
This PR prevents a user-facing error. 

It prevents optimistically unbanning a peer and instead checks the state of the peer before requesting the peers state to be unbanned.
2020-10-24 03:24:19 +00:00
blacktemplar
1644289a08 Updates the libp2p to the second newest commit => Allow only one topic per message (#1819)
As @AgeManning mentioned the newest libp2p version had some problems and got downgraded again on lighthouse master. This is an intermediate version that makes no problems and only adds a small change of allowing only one topic per message.
2020-10-24 01:05:37 +00:00
Age Manning
7870b81ade Downgrade libp2p (#1817)
## Description

This downgrades the recent libp2p upgrade. 

There were issues with the RPC which prevented syncing of the chain and this upgrade needs to be further investigated.
2020-10-23 09:33:59 +00:00
Age Manning
55eee18ebb Version bump to 0.3.1 (#1813)
## Description

Bumps Lighthouse to version 0.3.1.
2020-10-23 04:16:36 +00:00
Age Manning
64c5899d25 Adds colour help to bn and vc subcommands (#1811)
Adds coloured help to the bn and vc subcommands
2020-10-23 04:16:34 +00:00
Age Manning
2c7f362908 Discovery v5.1 (#1786)
## Overview 

This updates lighthouse to discovery v5.1

Note: This makes lighthouse's discovery not compatible with any previous version. Lighthouse cannot discover peers or send/receive ENR's from any previous version. This is a breaking change. 

This resolves #1605
2020-10-23 04:16:33 +00:00
Age Manning
ae96dab5d2 Increase UPnP logging and decrease batch sizes (#1812)
## Description

This increases the logging of the underlying UPnP tasks to inform the user of UPnP error/success. 

This also decreases the batch syncing size to two epochs per batch.
2020-10-23 03:01:33 +00:00
Age Manning
c49dd94e20 Update to latest libp2p (#1810)
## Description

Updates to the latest libp2p and includes gossipsub updates. 

Of particular note is the limitation of a single topic per gossipsub message.

Co-authored-by: blacktemplar <blacktemplar@a1.net>
2020-10-23 03:01:31 +00:00
Michael Sproul
acd49d988d Implement database temp states to reduce memory usage (#1798)
## Issue Addressed

Closes #800
Closes #1713

## Proposed Changes

Implement the temporary state storage algorithm described in #800. Specifically:

* Add `DBColumn::BeaconStateTemporary`, for storing 0-length temporary marker values.
* Store intermediate states immediately as they are created, marked temporary. Delete the temporary flag if the block is processed successfully.
* Add a garbage collection process to delete leftover temporary states on start-up.
* Bump the database schema version to 2 so that a DB with temporary states can't accidentally be used with older versions of the software. The auto-migration is a no-op, but puts in place some infra that we can use for future migrations (e.g. #1784)

## Additional Info

There are two known race conditions, one potentially causing permanent faults (hopefully rare), and the other insignificant.

### Race 1: Permanent state marked temporary

EDIT: this has been fixed by the addition of a lock around the relevant critical section

There are 2 threads that are trying to store 2 different blocks that share some intermediate states (e.g. they both skip some slots from the current head). Consider this sequence of events:

1. Thread 1 checks if state `s` already exists, and seeing that it doesn't, prepares an atomic commit of `(s, s_temporary_flag)`.
2. Thread 2 does the same, but also gets as far as committing the state txn, finishing the processing of its block, and _deleting_ the temporary flag.
3. Thread 1 is (finally) scheduled again, and marks `s` as temporary with its transaction.
4.
    a) The process is killed, or thread 1's block fails verification and the temp flag is not deleted. This is a permanent failure! Any attempt to load state `s` will fail... hope it isn't on the main chain! Alternatively (4b) happens...
    b) Thread 1 finishes, and re-deletes the temporary flag. In this case the failure is transient, state `s` will disappear temporarily, but will come back once thread 1 finishes running.

I _hope_ that steps 1-3 only happen very rarely, and 4a even more rarely. It's hard to know

This once again begs the question of why we're using LevelDB (#483), when it clearly doesn't care about atomicity! A ham-fisted fix would be to wrap the hot and cold DBs in locks, which would bring us closer to how other DBs handle read-write transactions. E.g. [LMDB only allows one R/W transaction at a time](https://docs.rs/lmdb/0.8.0/lmdb/struct.Environment.html#method.begin_rw_txn).

### Race 2: Temporary state returned from `get_state`

I don't think this race really matters, but in `load_hot_state`, if another thread stores a state between when we call `load_state_temporary_flag` and when we call `load_hot_state_summary`, then we could end up returning that state even though it's only a temporary state. I can't think of any case where this would be relevant, and I suspect if it did come up, it would be safe/recoverable (having data is safer than _not_ having data).

This could be fixed by using a LevelDB read snapshot, but that would require substantial changes to how we read all our values, so I don't think it's worth it right now.
2020-10-23 01:27:51 +00:00
Age Manning
66f0cf4430 Improve peer handling (#1796)
## Issue Addressed

Potentially resolves #1647 and sync stalls. 

## Proposed Changes

The handling of the state of banned peers was inadequate for the complex peerdb data structure. We store a limited number of disconnected and banned peers in the db. We were not tracking intermediate "disconnecting" states and the in some circumstances we were updating the peer state without informing the peerdb. This lead to a number of inconsistencies in the peer state. 

Further, the peer manager could ban a peer changing a peer's state from being connected to banned. In this circumstance, if the peer then disconnected, we didn't inform the application layer, which lead to applications like sync not being informed of a peers disconnection. This could lead to sync stalling and having to require a lighthouse restart. 

Improved handling for peer states and interactions with the peerdb is made in this PR.
2020-10-23 01:27:48 +00:00
Paul Hauner
b829257cca Ssz state (#1749)
## Issue Addressed

NA

## Proposed Changes

Adds a `lighthouse/beacon/states/:state_id/ssz` endpoint to allow us to pull the genesis state from the API.

## Additional Info

NA
2020-10-22 06:05:49 +00:00