## Proposed Changes
Improve the payload pruning feature in several ways:
- Payload pruning is now entirely optional. It is enabled by default but can be disabled with `--prune-payloads false`. The previous `--prune-payloads-on-startup` flag from #3565 is removed.
- Initial payload pruning on startup now runs in a background thread. This thread will always load the split state, which is a small fraction of its total work (up to ~300ms) and then backtrack from that state. This pruning process ran in 2m5s on one Prater node with good I/O and 16m on a node with slower I/O.
- To work with the optional payload pruning the database function `try_load_full_block` will now attempt to load execution payloads for finalized slots _if_ pruning is currently disabled. This gives users an opt-out for the extensive traffic between the CL and EL for reconstructing payloads.
## Additional Info
If the `prune-payloads` flag is toggled on and off then the on-startup check may not see any payloads to delete and fail to clean them up. In this case the `lighthouse db prune_payloads` command should be used to force a manual sweep of the database.
## Issue Addressed
Closes https://github.com/sigp/lighthouse/issues/3556
## Proposed Changes
Delete finalized execution payloads from the database in two places:
1. When running the finalization migration in `migrate_database`. We delete the finalized payloads between the last split point and the new updated split point. _If_ payloads are already pruned prior to this then this is sufficient to prune _all_ payloads as non-canonical payloads are already deleted by the head pruner, and all canonical payloads prior to the previous split will already have been pruned.
2. To address the fact that users will update to this code _after_ the merge on mainnet (and testnets), we need a one-off scan to delete the finalized payloads from the canonical chain. This is implemented in `try_prune_execution_payloads` which runs on startup and scans the chain back to the Bellatrix fork or the anchor slot (if checkpoint synced after Bellatrix). In the case where payloads are already pruned this check only imposes a single state load for the split state, which shouldn't be _too slow_. Even so, a flag `--prepare-payloads-on-startup=false` is provided to turn this off after it has run the first time, which provides faster start-up times.
There is also a new `lighthouse db prune_payloads` subcommand for users who prefer to run the pruning manually.
## Additional Info
The tests have been updated to not rely on finalized payloads in the database, instead using the `MockExecutionLayer` to reconstruct them. Additionally a check was added to `check_chain_dump` which asserts the non-existence or existence of payloads on disk depending on their slot.
## Proposed Changes
Increase the default `--slots-per-restore-point` to 8192 for a 4x reduction in freezer DB disk usage.
Existing nodes that use the previous default of 2048 will be left unchanged. Newly synced nodes (with or without checkpoint sync) will use the new 8192 default.
Long-term we could do away with the freezer DB entirely for validator-only nodes, but this change is much simpler and grants us some extra space in the short term. We can also roll it out gradually across our nodes by purging databases one by one, while keeping the Ansible config the same.
## Additional Info
We ignore a change from 2048 to 8192 if the user hasn't set the 8192 explicitly. We fire a debug log in the case where we do ignore:
```
DEBG Ignoring slots-per-restore-point config in favour of on-disk value, on_disk: 2048, config: 8192
```
## Issue Addressed
Closes#1891Closes#1784
## Proposed Changes
Implement checkpoint sync for Lighthouse, enabling it to start from a weak subjectivity checkpoint.
## Additional Info
- [x] Return unavailable status for out-of-range blocks requested by peers (#2561)
- [x] Implement sync daemon for fetching historical blocks (#2561)
- [x] Verify chain hashes (either in `historical_blocks.rs` or the calling module)
- [x] Consistency check for initial block + state
- [x] Fetch the initial state and block from a beacon node HTTP endpoint
- [x] Don't crash fetching beacon states by slot from the API
- [x] Background service for state reconstruction, triggered by CLI flag or API call.
Considered out of scope for this PR:
- Drop the requirement to provide the `--checkpoint-block` (this would require some pretty heavy refactoring of block verification)
Co-authored-by: Diva M <divma@protonmail.com>
## Proposed Changes
In an attempt to fix OOM issues and database consistency issues observed by some users after the introduction of compaction in v0.3.4, this PR makes the following changes:
* Run compaction less often: roughly every 1024 epochs, including after long periods of non-finality. I think the division check proposed by Paul is pretty solid, and ensures we don't miss any events where we should be compacting. LevelDB lacks an easy way to check the size of the DB, which would be another good trigger.
* Make it possible to disable the compaction on finalization using `--auto-compact-db=false`
* Make it possible to trigger a manual, single-threaded foreground compaction on start-up using `--compact-db`
* Downgrade the pruning log to `DEBUG`, as it's particularly noisy during sync
I would like to ship these changes to affected users ASAP, and will document them further in the Advanced Database section of the book if they prove effective.
## Issue Addressed
Closes#673
## Proposed Changes
Store a schema version in the database so that future releases can check they're running against a compatible database version. This would also enable automatic migration on breaking database changes, but that's left as future work.
The database config is also stored in the database so that the `slots_per_restore_point` value can be checked for consistency, which closes#673
* Add LRU caches to store
* Improvements to LRU caches
* Take state by value in `Store::put_state`
* Store blocks by value, configurable cache sizes
* Use a StateBatch to efficiently store skip states
* Fix store tests
* Add CloneConfig test, remove unused metrics
* Use Mutexes instead of RwLocks for LRU caches
This should reduce disk usage by 32x while keeping historical state queries to
less than 10s. If historical states are required quickly, the minimum SPRP of 32
can be set on the CLI.
* Draft of checkpoint freezer DB
* Fix bugs
* Adjust root iterators for checkpoint database
* Fix freezer state lookups with no slot hint
* Fix split comment
* Use "restore point" to refer to frozen states
* Resolve some FIXMEs
* Configurable slots per restore point
* Document new freezer DB functions
* Fix up StoreConfig
* Fix new test for merge
* Document SPRP default CLI flag, clarify tests