Commit Graph

53 Commits

Author SHA1 Message Date
Michael Sproul
bcdd960ab1 Separate execution payloads in the DB (#3157)
## Proposed Changes

Reduce post-merge disk usage by not storing finalized execution payloads in Lighthouse's database.

⚠️ **This is achieved in a backwards-incompatible way for networks that have already merged** ⚠️. Kiln users and shadow fork enjoyers will be unable to downgrade after running the code from this PR. The upgrade migration may take several minutes to run, and can't be aborted after it begins.

The main changes are:

- New column in the database called `ExecPayload`, keyed by beacon block root.
- The `BeaconBlock` column now stores blinded blocks only.
- Lots of places that previously used full blocks now use blinded blocks, e.g. analytics APIs, block replay in the DB, etc.
- On finalization:
    - `prune_abanonded_forks` deletes non-canonical payloads whilst deleting non-canonical blocks.
    - `migrate_db` deletes finalized canonical payloads whilst deleting finalized states.
- Conversions between blinded and full blocks are implemented in a compositional way, duplicating some work from Sean's PR #3134.
- The execution layer has a new `get_payload_by_block_hash` method that reconstructs a payload using the EE's `eth_getBlockByHash` call.
   - I've tested manually that it works on Kiln, using Geth and Nethermind.
   - This isn't necessarily the most efficient method, and new engine APIs are being discussed to improve this: https://github.com/ethereum/execution-apis/pull/146.
   - We're depending on the `ethers` master branch, due to lots of recent changes. We're also using a workaround for https://github.com/gakonst/ethers-rs/issues/1134.
- Payload reconstruction is used in the HTTP API via `BeaconChain::get_block`, which is now `async`. Due to the `async` fn, the `blocking_json` wrapper has been removed.
- Payload reconstruction is used in network RPC to serve blocks-by-{root,range} responses. Here the `async` adjustment is messier, although I think I've managed to come up with a reasonable compromise: the handlers take the `SendOnDrop` by value so that they can drop it on _task completion_ (after the `fn` returns). Still, this is introducing disk reads onto core executor threads, which may have a negative performance impact (thoughts appreciated).

## Additional Info

- [x] For performance it would be great to remove the cloning of full blocks when converting them to blinded blocks to write to disk. I'm going to experiment with a `put_block` API that takes the block by value, breaks it into a blinded block and a payload, stores the blinded block, and then re-assembles the full block for the caller.
- [x] We should measure the latency of blocks-by-root and blocks-by-range responses.
- [x] We should add integration tests that stress the payload reconstruction (basic tests done, issue for more extensive tests: https://github.com/sigp/lighthouse/issues/3159)
- [x] We should (manually) test the schema v9 migration from several prior versions, particularly as blocks have changed on disk and some migrations rely on being able to load blocks.

Co-authored-by: Paul Hauner <paul@paulhauner.com>
2022-05-12 00:42:17 +00:00
Michael Sproul
375e2b49b3 Conserve disk space by raising default SPRP (#3137)
## Proposed Changes

Increase the default `--slots-per-restore-point` to 8192 for a 4x reduction in freezer DB disk usage.

Existing nodes that use the previous default of 2048 will be left unchanged. Newly synced nodes (with or without checkpoint sync) will use the new 8192 default. 

Long-term we could do away with the freezer DB entirely for validator-only nodes, but this change is much simpler and grants us some extra space in the short term. We can also roll it out gradually across our nodes by purging databases one by one, while keeping the Ansible config the same.

## Additional Info

We ignore a change from 2048 to 8192 if the user hasn't set the 8192 explicitly. We fire a debug log in the case where we do ignore:

```
DEBG Ignoring slots-per-restore-point config in favour of on-disk value, on_disk: 2048, config: 8192
```
2022-04-01 07:16:25 +00:00
Michael Sproul
41e7a07c51 Add lighthouse db command (#3129)
## Proposed Changes

Add a `lighthouse db` command with three initial subcommands:

- `lighthouse db version`: print the database schema version.
- `lighthouse db migrate --to N`: manually upgrade (or downgrade!) the database to a different version.
- `lighthouse db inspect --column C`: log the key and size in bytes of every value in a given `DBColumn`.

This PR lays the groundwork for other changes, namely:

- Mark's fast-deposit sync (https://github.com/sigp/lighthouse/pull/2915), for which I think we should implement a database downgrade (from v9 to v8).
- My `tree-states` work, which already implements a downgrade (v10 to v8).
- Standalone purge commands like `lighthouse db purge-dht` per https://github.com/sigp/lighthouse/issues/2824.

## Additional Info

I updated the `strum` crate to 0.24.0, which necessitated some changes in the network code to remove calls to deprecated methods.

Thanks to @winksaville for the motivation, and implementation work that I used as a source of inspiration (https://github.com/sigp/lighthouse/pull/2685).
2022-04-01 00:58:59 +00:00
Michael Sproul
5e1f8a8480 Update to Rust 1.59 and 2021 edition (#3038)
## Proposed Changes

Lots of lint updates related to `flat_map`, `unwrap_or_else` and string patterns. I did a little more creative refactoring in the op pool, but otherwise followed Clippy's suggestions.

## Additional Info

We need this PR to unblock CI.
2022-02-25 00:10:17 +00:00
Michael Sproul
a290a3c537 Add configurable block replayer (#2863)
## Issue Addressed

Successor to #2431

## Proposed Changes

* Add a `BlockReplayer` struct to abstract over the intricacies of calling `per_slot_processing` and `per_block_processing` while avoiding unnecessary tree hashing.
* Add a variant of the forwards state root iterator that does not require an `end_state`.
* Use the `BlockReplayer` when reconstructing states in the database. Use the efficient forwards iterator for frozen states.
* Refactor the iterators to remove `Arc<HotColdDB>` (this seems to be neater than making _everything_ an `Arc<HotColdDB>` as I did in #2431).

Supplying the state roots allow us to avoid building a tree hash cache at all when reconstructing historic states, which saves around 1 second flat (regardless of `slots-per-restore-point`). This is a small percentage of worst-case state load times with 200K validators and SPRP=2048 (~15s vs ~16s) but a significant speed-up for more frequent restore points: state loads with SPRP=32 should be now consistently <500ms instead of 1.5s (a ~3x speedup).

## Additional Info

Required by https://github.com/sigp/lighthouse/pull/2628
2021-12-21 06:30:52 +00:00
realbigsean
b22ac95d7f v1.1.6 Fork Choice changes (#2822)
## Issue Addressed

Resolves: https://github.com/sigp/lighthouse/issues/2741
Includes: https://github.com/sigp/lighthouse/pull/2853 so that we can get ssz static tests passing here on v1.1.6. If we want to merge that first, we can make this diff slightly smaller

## Proposed Changes

- Changes the `justified_epoch` and `finalized_epoch` in the `ProtoArrayNode` each to an `Option<Checkpoint>`. The `Option` is necessary only for the migration, so not ideal. But does allow us to add a default logic to `None` on these fields during the database migration.
- Adds a database migration from a legacy fork choice struct to the new one, search for all necessary block roots in fork choice by iterating through blocks in the db.
- updates related to https://github.com/ethereum/consensus-specs/pull/2727
  -  We will have to update the persisted forkchoice to make sure the justified checkpoint stored is correct according to the updated fork choice logic. This boils down to setting the forkchoice store's justified checkpoint to the justified checkpoint of the block that advanced the finalized checkpoint to the current one. 
  - AFAICT there's no migration steps necessary for the update to allow applying attestations from prior blocks, but would appreciate confirmation on that
- I updated the consensus spec tests to v1.1.6 here, but they will fail until we also implement the proposer score boost updates. I confirmed that the previously failing scenario `new_finalized_slot_is_justified_checkpoint_ancestor` will now pass after the boost updates, but haven't confirmed _all_ tests will pass because I just quickly stubbed out the proposer boost test scenario formatting.
- This PR now also includes proposer boosting https://github.com/ethereum/consensus-specs/pull/2730

## Additional Info
I realized checking justified and finalized roots in fork choice makes it more likely that we trigger this bug: https://github.com/ethereum/consensus-specs/pull/2727

It's possible the combination of justified checkpoint and finalized checkpoint in the forkchoice store is different from in any block in fork choice. So when trying to startup our store's justified checkpoint seems invalid to the rest of fork choice (but it should be valid). When this happens we get an `InvalidBestNode` error and fail to start up. So I'm including that bugfix in this branch.

Todo:

- [x] Fix fork choice tests
- [x] Self review
- [x] Add fix for https://github.com/ethereum/consensus-specs/pull/2727
- [x] Rebase onto Kintusgi 
- [x] Fix `num_active_validators` calculation as @michaelsproul pointed out
- [x] Clean up db migrations

Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-12-13 20:43:22 +00:00
Mark Mackey
5687c56d51
Initial merge changes
Added Execution Payload from Rayonism Fork

Updated new Containers to match Merge Spec

Updated BeaconBlockBody for Merge Spec

Completed updating BeaconState and BeaconBlockBody

Modified ExecutionPayload<T> to use Transaction<T>

Mostly Finished Changes for beacon-chain.md

Added some things for fork-choice.md

Update to match new fork-choice.md/fork.md changes

ran cargo fmt

Added Missing Pieces in eth2_libp2p for Merge

fix ef test

Various Changes to Conform Closer to Merge Spec
2021-12-02 14:26:50 +11:00
Michael Sproul
ed1fc7cca6 Fix I/O atomicity issues with checkpoint sync (#2671)
## Issue Addressed

This PR addresses an issue found by @YorickDowne during testing of v2.0.0-rc.0.

Due to a lack of atomic database writes on checkpoint sync start-up, it was possible for the database to get into an inconsistent state from which it couldn't recover without `--purge-db`. The core of the issue was that the store's anchor info was being stored _before_ the `PersistedBeaconChain`. If a crash occured so that anchor info was stored but _not_ the `PersistedBeaconChain`, then on restart Lighthouse would think the database was unitialized and attempt to compare-and-swap a `None` value, but would actually find the stale info from the previous run.

## Proposed Changes

The issue is fixed by writing the anchor info, the split point, and the `PersistedBeaconChain` atomically on start-up. Some type-hinting ugliness was required, which could possibly be cleaned up in future refactors.
2021-10-05 03:53:17 +00:00
Michael Sproul
9667dc2f03 Implement checkpoint sync (#2244)
## Issue Addressed

Closes #1891
Closes #1784

## Proposed Changes

Implement checkpoint sync for Lighthouse, enabling it to start from a weak subjectivity checkpoint.

## Additional Info

- [x] Return unavailable status for out-of-range blocks requested by peers (#2561)
- [x] Implement sync daemon for fetching historical blocks (#2561)
- [x] Verify chain hashes (either in `historical_blocks.rs` or the calling module)
- [x] Consistency check for initial block + state
- [x] Fetch the initial state and block from a beacon node HTTP endpoint
- [x] Don't crash fetching beacon states by slot from the API
- [x] Background service for state reconstruction, triggered by CLI flag or API call.

Considered out of scope for this PR:

- Drop the requirement to provide the `--checkpoint-block` (this would require some pretty heavy refactoring of block verification)


Co-authored-by: Diva M <divma@protonmail.com>
2021-09-22 00:37:28 +00:00
Michael Sproul
10945e0619 Revert bad blocks on missed fork (#2529)
## Issue Addressed

Closes #2526

## Proposed Changes

If the head block fails to decode on start up, do two things:

1. Revert all blocks between the head and the most recent hard fork (to `fork_slot - 1`).
2. Reset fork choice so that it contains the new head, and all blocks back to the new head's finalized checkpoint.

## Additional Info

I tweaked some of the beacon chain test harness stuff in order to make it generic enough to test with a non-zero slot clock on start-up. In the process I consolidated all the various `new_` methods into a single generic one which will hopefully serve all future uses 🤞
2021-08-30 06:41:31 +00:00
realbigsean
303deb9969 Rust 1.54.0 lints (#2483)
## Issue Addressed

N/A

## Proposed Changes

- Removing a bunch of unnecessary references
- Updated `Error::VariantError` to `Error::Variant`
- There were additional enum variant lints that I ignored, because I thought our variant names were fine
- removed `MonitoredValidator`'s `pubkey` field, because I couldn't find it used anywhere. It looks like we just use the string version of the pubkey (the `id` field) if there is no index

## Additional Info



Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-07-30 01:11:47 +00:00
Michael Sproul
371c216ac3 Use read_recursive locks in database (#2417)
## Issue Addressed

Closes #2245

## Proposed Changes

Replace all calls to `RwLock::read` in the `store` crate with `RwLock::read_recursive`.

## Additional Info

* Unfortunately we can't run the deadlock detector on CI because it's pinned to an old Rust 1.51.0 nightly which cannot compile Lighthouse (one of our deps uses `ptr::addr_of!` which is too new). A fun side-project at some point might be to update the deadlock detector.
* The reason I think we haven't seen this deadlock (at all?) in practice is that _writes_ to the database's split point are quite infrequent, and a concurrent write is required to trigger the deadlock. The split point is only written when finalization advances, which is once per epoch (every ~6 minutes), and state reads are also quite sporadic. Perhaps we've just been incredibly lucky, or there's something about the timing of state reads vs database migration that protects us.
* I wrote a few small programs to demo the deadlock, and the effectiveness of the `read_recursive` fix: https://github.com/michaelsproul/relock_deadlock_mvp
* [The docs for `read_recursive`](https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.read_recursive) warn of starvation for writers. I think in order for starvation to occur the database would have to be spammed with so many state reads that it's unable to ever clear them all and find time for a write, in which case migration of states to the freezer would cease. If an attack could be performed to trigger this starvation then it would likely trigger a deadlock in the current code, and I think ceasing migration is preferable to deadlocking in this extreme situation. In practice neither should occur due to protection from spammy peers at the network layer. Nevertheless, it would be prudent to run this change on the testnet nodes to check that it doesn't cause accidental starvation.
2021-07-12 07:31:26 +00:00
Michael Sproul
b4689e20c6 Altair consensus changes and refactors (#2279)
## Proposed Changes

Implement the consensus changes necessary for the upcoming Altair hard fork.

## Additional Info

This is quite a heavy refactor, with pivotal types like the `BeaconState` and `BeaconBlock` changing from structs to enums. This ripples through the whole codebase with field accesses changing to methods, e.g. `state.slot` => `state.slot()`.


Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-07-09 06:15:32 +00:00
Mac L
406e3921d9 Use forwards iterator for state root lookups (#2422)
## Issue Addressed

#2377 

## Proposed Changes

Implement the same code used for block root lookups (from #2376) to state root lookups in order to improve performance and reduce associated memory spikes (e.g. from certain HTTP API requests).

## Additional Changes

- Tests using `rev_iter_state_roots` and `rev_iter_block_roots` have been refactored to use their `forwards` versions instead.
- The `rev_iter_state_roots` and `rev_iter_block_roots` functions are now unused and have been removed.
- The `state_at_slot` function has been changed to use the `forwards` iterator.

## Additional Info

- Some tests still need to be refactored to use their `forwards_iter` versions. These tests start their iteration from a specific beacon state and thus use the `rev_iter_state_roots_from` and `rev_iter_block_roots_from` functions. If they can be refactored, those functions can also be removed.
2021-07-06 02:38:53 +00:00
realbigsean
b84ff9f793 rust 1.53.0 updates (#2411)
## Issue Addressed

`make lint` failing on rust 1.53.0.

## Proposed Changes

1.53.0 updates

## Additional Info

I haven't figure out why yet, we were now hitting the recursion limit in a few crates. So I had to add `#![recursion_limit = "256"]` in a few places


Co-authored-by: realbigsean <seananderson33@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-06-18 05:58:01 +00:00
Paul Hauner
bf4e02e2cc Return a specific error for frozen attn states (#2384)
## Issue Addressed

NA

## Proposed Changes

Return a very specific error when at attestation reads shuffling from a frozen `BeaconState`. Previously, this was returning `MissingBeaconState` which indicates a much more serious issue.

## Additional Info

Since `get_inconsistent_state_for_attestation_verification_only` is only called once in `BeaconChain::with_committee_cache`, it is quite easy to reason about the impact of this change.
2021-06-01 06:59:43 +00:00
Michael Sproul
f9d60f5436 VC: accept unknown fields in chain spec (#2277)
## Issue Addressed

Closes #2274

## Proposed Changes

* Modify the `YamlConfig` to collect unknown fields into an `extra_fields` map, instead of failing hard.
* Log a debug message if there are extra fields returned to the VC from one of its BNs.

This restores Lighthouse's compatibility with Teku beacon nodes (and therefore Infura)
2021-03-26 04:53:57 +00:00
Paul Hauner
015ab7d0a7 Optimize validator duties (#2243)
## Issue Addressed

Closes #2052

## Proposed Changes

- Refactor the attester/proposer duties endpoints in the BN
    - Performance improvements
    - Fixes some potential inconsistencies with the dependent root fields.
    - Removes `http_api::beacon_proposer_cache` and just uses the one on the `BeaconChain` instead.
    - Move the code for the proposer/attester duties endpoints into separate files, for readability.
- Refactor the `DutiesService` in the VC
    - Required to reduce the delay on broadcasting new blocks.
    - Gets rid of the `ValidatorDuty` shim struct that came about when we adopted the standard API.
    - Separate block/attestation duty tasks so that they don't block each other when one is slow.
- In the VC, use `PublicKeyBytes` to represent validators instead of `PublicKey`. `PublicKey` is a legit crypto object whilst `PublicKeyBytes` is just a byte-array, it's much faster to clone/hash `PublicKeyBytes` and this change has had a significant impact on runtimes.
    - Unfortunately this has created lots of dust changes.
 - In the BN, store `PublicKeyBytes` in the `beacon_proposer_cache` and allow access to them. The HTTP API always sends `PublicKeyBytes` over the wire and the conversion from `PublicKey` -> `PublickeyBytes` is non-trivial, especially when queries have 100s/1000s of validators (like Pyrmont).
 - Add the `state_processing::state_advance` mod which dedups a lot of the "apply `n` skip slots to the state" code.
    - This also fixes a bug with some functions which were failing to include a state root as per [this comment](072695284f/consensus/state_processing/src/state_advance.rs (L69-L74)). I couldn't find any instance of this bug that resulted in anything more severe than keying a shuffling cache by the wrong block root.
 - Swap the VC block service to use `mpsc` from `tokio` instead of `futures`. This is consistent with the rest of the code base.
    
~~This PR *reduces* the size of the codebase 🎉~~ It *used* to reduce the size of the code base before I added more comments. 

## Observations on Prymont

- Proposer duties times down from peaks of 450ms to consistent <1ms.
- Current epoch attester duties times down from >1s peaks to a consistent 20-30ms.
- Block production down from +600ms to 100-200ms.

## Additional Info

- ~~Blocked on #2241~~
- ~~Blocked on #2234~~

## TODO

- [x] ~~Refactor this into some smaller PRs?~~ Leaving this as-is for now.
- [x] Address `per_slot_processing` roots.
- [x] Investigate slow next epoch times. Not getting added to cache on block processing?
- [x] Consider [this](072695284f/beacon_node/store/src/hot_cold_store.rs (L811-L812)) in the scenario of replacing the state roots


Co-authored-by: pawan <pawandhananjay@gmail.com>
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2021-03-17 05:09:57 +00:00
Michael Sproul
363f15f362 Use the database to persist the pubkey cache (#2234)
## Issue Addressed

Closes #1787

## Proposed Changes

* Abstract the `ValidatorPubkeyCache` over a "backing" which is either a file (legacy), or the database.
* Implement a migration from schema v2 to schema v3, whereby the contents of the cache file are copied to the DB, and then the file is deleted. The next release to include this change must be a minor version bump, and we will need to warn users of the inability to downgrade (this is our first DB schema change since mainnet genesis).
* Move the schema migration code from the `store` crate into the `beacon_chain` crate so that it can access the datadir and the `ValidatorPubkeyCache`, etc. It gets injected back into the `store` via a closure (similar to what we do in fork choice).
2021-03-04 01:25:12 +00:00
realbigsean
7a71977987 Clippy 1.49.0 updates and dht persistence test fix (#2156)
## Issue Addressed

`test_dht_persistence` failing

## Proposed Changes

Bind `NetworkService::start` to an underscore prefixed variable rather than `_`.  `_` was causing it to be dropped immediately

This was failing 5/100 times before this update, but I haven't been able to get it to fail after updating it

Co-authored-by: realbigsean <seananderson33@gmail.com>
2021-01-19 00:34:28 +00:00
blacktemplar
d8cda2d86e Fix new clippy lints (#2036)
## Issue Addressed

NA

## Proposed Changes

Fixes new clippy lints in the whole project (mainly [manual_strip](https://rust-lang.github.io/rust-clippy/master/index.html#manual_strip) and [unnecessary_lazy_evaluations](https://rust-lang.github.io/rust-clippy/master/index.html#unnecessary_lazy_evaluations)). Furthermore, removes `to_string()` calls on literals when used with the `?`-operator.
2020-12-03 01:10:26 +00:00
Michael Sproul
a60ab4eff2 Refine compaction (#1916)
## Proposed Changes

In an attempt to fix OOM issues and database consistency issues observed by some users after the introduction of compaction in v0.3.4, this PR makes the following changes:

* Run compaction less often: roughly every 1024 epochs, including after long periods of non-finality. I think the division check proposed by Paul is pretty solid, and ensures we don't miss any events where we should be compacting. LevelDB lacks an easy way to check the size of the DB, which would be another good trigger.
* Make it possible to disable the compaction on finalization using `--auto-compact-db=false`
* Make it possible to trigger a manual, single-threaded foreground compaction on start-up using `--compact-db`
* Downgrade the pruning log to `DEBUG`, as it's particularly noisy during sync

I would like to ship these changes to affected users ASAP, and will document them further in the Advanced Database section of the book if they prove effective.
2020-11-17 09:10:53 +00:00
Michael Sproul
556190ff46 Compact database on finalization (#1871)
## Issue Addressed

Closes #1866

## Proposed Changes

* Compact the database on finalization. This removes the deleted states from disk completely. Because it happens in the background migrator, it doesn't block other database operations while it runs. On my Medalla node it took about 1 minute and shrank the database from 90GB to 9GB.
* Fix an inefficiency in the pruning algorithm where it would always use the genesis checkpoint as the `old_finalized_checkpoint` when running for the first time after start-up. This would result in loading lots of states one-at-a-time back to genesis, and storing a lot of block roots in memory. The new code stores the old finalized checkpoint on disk and only uses genesis if no checkpoint is already stored. This makes it both backwards compatible _and_ forwards compatible -- no schema change required!
* Introduce two new `INFO` logs to indicate when pruning has started and completed. Users seem to want to know this information without enabling debug logs!
2020-11-09 07:02:21 +00:00
Michael Sproul
acd49d988d Implement database temp states to reduce memory usage (#1798)
## Issue Addressed

Closes #800
Closes #1713

## Proposed Changes

Implement the temporary state storage algorithm described in #800. Specifically:

* Add `DBColumn::BeaconStateTemporary`, for storing 0-length temporary marker values.
* Store intermediate states immediately as they are created, marked temporary. Delete the temporary flag if the block is processed successfully.
* Add a garbage collection process to delete leftover temporary states on start-up.
* Bump the database schema version to 2 so that a DB with temporary states can't accidentally be used with older versions of the software. The auto-migration is a no-op, but puts in place some infra that we can use for future migrations (e.g. #1784)

## Additional Info

There are two known race conditions, one potentially causing permanent faults (hopefully rare), and the other insignificant.

### Race 1: Permanent state marked temporary

EDIT: this has been fixed by the addition of a lock around the relevant critical section

There are 2 threads that are trying to store 2 different blocks that share some intermediate states (e.g. they both skip some slots from the current head). Consider this sequence of events:

1. Thread 1 checks if state `s` already exists, and seeing that it doesn't, prepares an atomic commit of `(s, s_temporary_flag)`.
2. Thread 2 does the same, but also gets as far as committing the state txn, finishing the processing of its block, and _deleting_ the temporary flag.
3. Thread 1 is (finally) scheduled again, and marks `s` as temporary with its transaction.
4.
    a) The process is killed, or thread 1's block fails verification and the temp flag is not deleted. This is a permanent failure! Any attempt to load state `s` will fail... hope it isn't on the main chain! Alternatively (4b) happens...
    b) Thread 1 finishes, and re-deletes the temporary flag. In this case the failure is transient, state `s` will disappear temporarily, but will come back once thread 1 finishes running.

I _hope_ that steps 1-3 only happen very rarely, and 4a even more rarely. It's hard to know

This once again begs the question of why we're using LevelDB (#483), when it clearly doesn't care about atomicity! A ham-fisted fix would be to wrap the hot and cold DBs in locks, which would bring us closer to how other DBs handle read-write transactions. E.g. [LMDB only allows one R/W transaction at a time](https://docs.rs/lmdb/0.8.0/lmdb/struct.Environment.html#method.begin_rw_txn).

### Race 2: Temporary state returned from `get_state`

I don't think this race really matters, but in `load_hot_state`, if another thread stores a state between when we call `load_state_temporary_flag` and when we call `load_hot_state_summary`, then we could end up returning that state even though it's only a temporary state. I can't think of any case where this would be relevant, and I suspect if it did come up, it would be safe/recoverable (having data is safer than _not_ having data).

This could be fixed by using a LevelDB read snapshot, but that would require substantial changes to how we read all our values, so I don't think it's worth it right now.
2020-10-23 01:27:51 +00:00
Michael Sproul
703c33bdc7 Fix head tracker concurrency bugs (#1771)
## Issue Addressed

Closes #1557

## Proposed Changes

Modify the pruning algorithm so that it mutates the head-tracker _before_ committing the database transaction to disk, and _only if_ all the heads to be removed are still present in the head-tracker (i.e. no concurrent mutations).

In the process of writing and testing this I also had to make a few other changes:

* Use internal mutability for all `BeaconChainHarness` functions (namely the RNG and the graffiti), in order to enable parallel calls (see testing section below).
* Disable logging in harness tests unless the `test_logger` feature is turned on

And chose to make some clean-ups:

* Delete the `NullMigrator`
* Remove type-based configuration for the migrator in favour of runtime config (simpler, less duplicated code)
* Use the non-blocking migrator unless the blocking migrator is required. In the store tests we need the blocking migrator because some tests make asserts about the state of the DB after the migration has run.
* Rename `validators_keypairs` -> `validator_keypairs` in the `BeaconChainHarness`

## Testing

To confirm that the fix worked, I wrote a test using [Hiatus](https://crates.io/crates/hiatus), which can be found here:

https://github.com/michaelsproul/lighthouse/tree/hiatus-issue-1557

That test can't be merged because it inserts random breakpoints everywhere, but if you check out that branch you can run the test with:

```
$ cd beacon_node/beacon_chain
$ cargo test --release --test parallel_tests --features test_logger
```

It should pass, and the log output should show:

```
WARN Pruning deferred because of a concurrent mutation, message: this is expected only very rarely!
```

## Additional Info

This is a backwards-compatible change with no impact on consensus.
2020-10-19 05:58:39 +00:00
Michael Sproul
22aedda1be
Add database schema versioning (#1688)
## Issue Addressed

Closes #673

## Proposed Changes

Store a schema version in the database so that future releases can check they're running against a compatible database version. This would also enable automatic migration on breaking database changes, but that's left as future work.

The database config is also stored in the database so that the `slots_per_restore_point` value can be checked for consistency, which closes #673
2020-10-01 11:12:36 +10:00
Adam Szkoda
d9f4819fe0 Alternative (to BeaconChainHarness) BeaconChain testing API (#1380)
The PR:

* Adds the ability to generate a crucial test scenario that isn't possible with `BeaconChainHarness` (i.e. two blocks occupying the same slot; previously forks necessitated skipping slots):

![image](https://user-images.githubusercontent.com/165678/88195404-4bce3580-cc40-11ea-8c08-b48d2e1d5959.png)

* New testing API: Instead of repeatedly calling add_block(), you generate a sorted `Vec<Slot>` and leave it up to the framework to generate blocks at those slots.
* Jumping backwards to an earlier epoch is a hard error, so that tests necessarily generate blocks in a epoch-by-epoch manner.
* Configures the test logger so that output is printed on the console in case a test fails.  The logger also plays well with `--nocapture`, contrary to the existing testing framework
* Rewrites existing fork pruning tests to use the new API
* Adds a tests that triggers finalization at a non epoch boundary slot
* Renamed `BeaconChainYoke` to `BeaconChainTestingRig` because the former has been too confusing
* Fixed multiple tests (e.g. `block_production_different_shuffling_long`, `delete_blocks_and_states`, `shuffling_compatible_simple_fork`) that relied on a weird (and accidental) feature of the old `BeaconChainHarness` that attestations aren't produced for epochs earlier than the current one, thus masking potential bugs in test cases.

Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2020-08-26 09:24:55 +00:00
Michael Sproul
4763f03dcc Fix bug in database pruning (#1564)
## Issue Addressed

Closes #1488

## Proposed Changes

* Prevent the pruning algorithm from over-eagerly deleting states at skipped slots when they are shared with the canonical chain.
* Add `debug` logging to the pruning algorithm so we have so better chance of debugging future issues from logs.
* Modify the handling of the "finalized state" in the beacon chain, so that it's always the state at the first slot of the finalized epoch (previously it was the state at the finalized block). This gives database pruning a clearer and cleaner view of things, and will marginally impact the pruning of the op pool, observed proposers, etc (in ways that are safe as far as I can tell).
* Remove duplicated `RevertedFinalizedEpoch` check from `after_finalization`
* Delete useless and unused `max_finality_distance`
* Add tests that exercise pruning with shared states at skip slots
* Delete unnecessary `block_strategy` argument from `add_blocks` and friends in the test harness (will likely conflict with #1380 slightly, sorry @adaszko -- but we can fix that)
* Bonus: add a `BeaconChain::with_head` method. I didn't end up needing it, but it turned out quite nice, so I figured we could keep it?

## Additional Info

Any users who have experienced pruning errors on Medalla will need to resync after upgrading to a release including this change. This should end unbounded `chain_db` growth! 🎉
2020-08-26 00:01:06 +00:00
Paul Hauner
61d5b592cb Memory usage reduction (#1522)
## Issue Addressed

NA

## Proposed Changes

- Adds a new function to allow getting a state with a bad state root history for attestation verification. This reduces unnecessary tree hashing during attestation processing, which accounted for 23% of memory allocations (by bytes) in a recent `heaptrack` observation.
- Don't clone caches on intermediate epoch-boundary states during block processing.
- Reject blocks that are known to fork choice earlier during gossip processing, instead of waiting until after state has been loaded (this only happens in edge-case).
- Avoid multiple re-allocations by creating a "forced" exact size iterator.

## Additional Info

NA
2020-08-17 08:05:13 +00:00
Adam Szkoda
8a1a4051cf Fix a bug in fork pruning (#1507)
Extracted from https://github.com/sigp/lighthouse/pull/1380 because merging #1380 proves to be contentious.

Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2020-08-12 07:00:00 +00:00
Adam Szkoda
c7f47af9fb
Harden the freezing procedure against failures (#1323)
* Enable logging in tests

* Migrate states to the freezer atomically
2020-07-03 09:47:31 +10:00
Adam Szkoda
536728b975
Write new blocks and states to the database atomically (#1285)
* Mostly atomic put_state()
* Reduce number of vec allocations
* Make crucial db operations atomic
* Save restore points
* Remove StateBatch
* Merge two HotColdDB impls
* Further reduce allocations
* Review feedback
* Silence clippy warning
2020-07-01 12:45:57 +10:00
Michael Sproul
81c9fe3817
Apply store refactor to new fork choice 2020-06-17 15:20:44 +10:00
Adam Szkoda
9db0c28051
Make key value storage abstractions more accurate (#1267)
* Layer do_atomically() abstractions properly

* Reduce allocs and DRY get_key_for_col()

* Parameterize HotColdDB with hot and cold item stores

* -impl Store for MemoryStore

* Replace Store uses with HotColdDB

* Ditch Store trait

* cargo fmt

* Style fix

* Readd missing dep that broke the build
2020-06-16 11:34:04 +10:00
Adam Szkoda
7f036a6e95
Add error handling to iterators (#1243)
* Add error handling to iterators

* Review feedback

* Leverage itertools::process_results() in few places
2020-06-10 09:55:44 +10:00
Adam Szkoda
91cb14ac41
Clean up database abstractions (#1200)
* Remove redundant method

* Pull out a method out of a struct

* More precise db access abstractions

* Move fake trait method out of it

* cargo fmt

* Fix compilation error after refactoring

* Move another fake method out the Store trait

* Get rid of superfluous method

* Fix refactoring bug

* Rename: SimpleStoreItem -> StoreItem

* Get rid of the confusing DiskStore type alias

* Get rid of SimpleDiskStore type alias

* Correction: A method took both self and a ref to Self
2020-06-01 08:13:49 +10:00
Adam Szkoda
919c81fe7d
Ditch StoreItem trait (#1185) 2020-05-25 10:26:54 +10:00
Adam Szkoda
59ead67f76
Race condition fix + Reliability improvements around forks pruning (#1132)
* Improve error handling in block iteration

* Introduce atomic DB operations

* Fix race condition

An invariant was violated:  For every block hash in head_tracker, that
block is accessible from the store.
2020-05-16 13:23:32 +10:00
Adam Szkoda
9c3f76a33b
Prune abandoned forks (#916)
* Address compiler warning

* Prune abandoned fork choice forks

* New approach to pruning

* Wrap some block hashes in a newtype pattern

For increased type safety.

* Add Graphviz chain dump emitter for debugging

* Fix broken test case

* Make prunes_abandoned_forks use real DiskStore

* Mark finalized blocks in the GraphViz output

* Refine debug stringification of Slot and Epoch

Before this commit: print!("{:?}", Slot(123)) == "Slot(\n123\n)".
After this commit: print!("{:?", Slot(123)) == "Slot(123)".

* Simplify build_block()

* Rewrite test case using more composable test primitives

* Working rewritten test case

* Tighten fork prunning test checks

* Add another pruning test case

* Bugfix: Finalized blocks weren't always properly detected

* Pruning: Add pruning_does_not_touch_blocks_prior_to_finalization test case

* Tighten pruning tests: check if heads are tracked properly

* Add a failing test case for a buggy scenario

* Change name of function to a more accurate one

* Fix failing test case

* Test case: Were skipped slots' states pruned?

* Style fix: Simplify dereferencing

* Tighten pruning tests: check if abandoned states are deleted

* Towards atomicity of db ops

* Correct typo

* Prune also skipped slots' states

* New logic for handling skipped states

* Make skipped slots test pass

* Post conflict resolution fixes

* Formatting fixes

* Tests passing

* Block hashes in Graphviz node labels

* Removed unused changes

* Fix bug with states having < SlotsPerHistoricalRoot roots

* Consolidate State/BlockRootsIterator for pruning

* Address review feedback

* Fix a bug in pruning tests

* Detach prune_abandoned_forks() from its object

* Move migrate.rs from store to beacon_chain

* Move forks pruning onto a background thread

* Bugfix: Heads weren't pruned when prune set contained only the head

* Rename: freeze_to_state() -> process_finalization()

* Eliminate redundant function parameter

Co-authored-by: Michael Sproul <michael@sigmaprime.io>
2020-04-20 19:59:56 +10:00
Paul Hauner
2fb6b7c793
Add no-copy block processing cache (#863)
* Add state cache, remove store cache

* Only build the head committee cache

* Fix compile error

* Fix compile error from merge

* Rename state_cache -> checkpoint_cache

* Rename Checkpoint -> Snapshot

* Tidy, add comments

* Tidy up find_head function

* Change some checkpoint -> snapshot

* Add tests

* Expose max_len

* Remove dead code

* Tidy

* Fix bug
2020-04-06 10:53:33 +10:00
Michael Sproul
1f16d8fe4d
Add methods to delete blocks and states from disk (#843)
Closes #833
2020-03-04 16:48:35 +11:00
Paul Hauner
fbb630793e
Attempt to remove a tree hash from block replaying (#862)
* Attempt to remove a tree hash from block replaying

* Add missed thing
2020-03-02 13:40:58 +11:00
Michael Sproul
371e5adcf8
Update to Spec v0.10 (#817)
* Start updating types

* WIP

* Signature hacking

* Existing EF tests passing with fake_crypto

* Updates

* Delete outdated API spec

* The refactor continues

* It compiles

* WIP test fixes

* All release tests passing bar genesis state parsing

* Update and test YamlConfig

* Update to spec v0.10 compatible BLS

* Updates to BLS EF tests

* Add EF test for AggregateVerify

And delete unused hash2curve tests for uncompressed points

* Update EF tests to v0.10.1

* Use optional block root correctly in block proc

* Use genesis fork in deposit domain. All tests pass

* Cargo fmt

* Fast aggregate verify test

* Update REST API docs

* Cargo fmt

* Fix unused import

* Bump spec tags to v0.10.1

* Add `seconds_per_eth1_block` to chainspec

* Update to timestamp based eth1 voting scheme

* Return None from `get_votes_to_consider` if block cache is empty

* Handle overflows in `is_candidate_block`

* Revert to failing tests

* Fix eth1 data sets test

* Choose default vote according to spec

* Fix collect_valid_votes tests

* Fix `get_votes_to_consider` to choose all eligible blocks

* Uncomment winning_vote tests

* Add comments; remove unused code

* Reduce seconds_per_eth1_block for simulation

* Addressed review comments

* Add test for default vote case

* Fix logs

* Remove unused functions

* Meter default eth1 votes

* Fix comments

* Address review comments; remove unused dependency

* Disable/delete two outdated tests

* Bump eth1 default vote warn to error

* Delete outdated eth1 test

Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
2020-02-11 10:19:36 +11:00
Michael Sproul
e0b9fa599f
Add LRU cache to database (#837)
* Add LRU caches to store

* Improvements to LRU caches

* Take state by value in `Store::put_state`

* Store blocks by value, configurable cache sizes

* Use a StateBatch to efficiently store skip states

* Fix store tests

* Add CloneConfig test, remove unused metrics

* Use Mutexes instead of RwLocks for LRU caches
2020-02-10 11:30:21 +11:00
pscott
7396cd2cab Fix clippy warnings (#813)
* Clippy account manager

* Clippy account_manager

* Clippy beacon_node/beacon_chain

* Clippy beacon_node/client

* Clippy beacon_node/eth1

* Clippy beacon_node/eth2-libp2p

* Clippy beacon_node/genesis

* Clippy beacon_node/network

* Clippy beacon_node/rest_api

* Clippy beacon_node/src

* Clippy beacon_node/store

* Clippy eth2/lmd_ghost

* Clippy eth2/operation_pool

* Clippy eth2/state_processing

* Clippy eth2/types

* Clippy eth2/utils/bls

* Clippy eth2/utils/cahced_tree_hash

* Clippy eth2/utils/deposit_contract

* Clippy eth2/utils/eth2_interop_keypairs

* Clippy eth2/utils/eth2_testnet_config

* Clippy eth2/utils/lighthouse_metrics

* Clippy eth2/utils/ssz

* Clippy eth2/utils/ssz_types

* Clippy eth2/utils/tree_hash_derive

* Clippy lcli

* Clippy tests/beacon_chain_sim

* Clippy validator_client

* Cargo fmt
2020-01-21 18:38:56 +11:00
Michael Sproul
95fc840e2c
Fix off-by-one error in get_latest_restore_point (#787)
* Fix off-by-one error in get_latest_restore_point

* Tighten SPRP checks for succinct hot DB change
2020-01-09 21:05:56 +11:00
Michael Sproul
d9e9c17d3b
Avoid building caches during block replay (#783)
Also, make the ExitCache safe.
2020-01-09 11:43:11 +11:00
Michael Sproul
f36a5a15d6
Store states efficiently in the hot database (#746)
* Sparse hot DB and block root tree

* Fix store_tests

* Ensure loads of hot states on boundaries are fast

* Milder error for unaligned finalized blocks
2020-01-08 13:58:01 +11:00
Paul Hauner
647034b637
Optimization: avoid recomputing known state roots (#762)
* Start adding optimization

* Add temp fix for protobuf issue

* Fix compile errors

* Fix protobuf import
2020-01-03 15:09:00 +11:00
Paul Hauner
d4b28d48f8
Remove some dbg, println (#675) 2019-12-07 07:29:20 +11:00