## Issue Addressed
Fixes a bug in the handling of `--beacon-process-max-workers` which caused it to have no effect.
## Proposed Changes
For this PR I channeled @ethDreamer and saw deep into the faulty CLI config -- this bug is almost identical to the one Mark found and fixed in #4622.
`parent_finalized.epoch + 1 > block_epoch` will never be `true` since as the comment says:
```
A block in epoch `N` cannot contain attestations which would finalize an epoch higher than `N - 1`.
```
## Issue Addressed
Closes#3210Closes#3211
## Proposed Changes
- Checkpoint sync from the latest finalized state regardless of its alignment.
- Add the `block_root` to the database's split point. This is _only_ added to the in-memory split in order to avoid a schema migration. See `load_split`.
- Add a new method to the DB called `get_advanced_state`, which looks up a state _by block root_, with a `state_root` as fallback. Using this method prevents accidental accesses of the split's unadvanced state, which does not exist in the hot DB and is not guaranteed to exist in the freezer DB at all. Previously Lighthouse would look up this state _from the freezer DB_, even if it was required for block/attestation processing, which was suboptimal.
- Replace several state look-ups in block and attestation processing with `get_advanced_state` so that they can't hit the split block's unadvanced state.
- Do not store any states in the freezer database by default. All states will be deleted upon being evicted from the hot database unless `--reconstruct-historic-states` is set. The anchor info which was previously used for checkpoint sync is used to implement this, including when syncing from genesis.
## Additional Info
Needs further testing. I want to stress-test the pruned database under Hydra.
The `get_advanced_state` method is intended to become more relevant over time: `tree-states` includes an identically named method that returns advanced states from its in-memory cache.
Co-authored-by: realbigsean <seananderson33@gmail.com>
## Proposed Changes
This PR updates `blst` to 0.3.11, which gives us _runtime detection of CPU features_ 🎉
Although [performance benchmarks](https://gist.github.com/michaelsproul/f759fa28dfa4003962507db34b439d6c) don't show a substantial detriment to running the `portable` build vs `modern`, in order to take things slowly I propose the following roll-out strategy:
- Keep both `modern` and `portable` builds for releases/Docker images.
- Run the `portable` build on half of SigP's infrastructure to monitor for performance deficits.
- Listen out for user issues with the `portable` builds (e.g. SIGILLs from misdetected hardware).
- Make the `portable` build the default and remove the `modern` build from our release binaries & Docker images.
It seems `post_validator_duties_sync` is the only api which doesn't have its own metric in `duties_service`, this PR adds `metrics::VALIDATOR_DUTIES_SYNC_HTTP_POST` for completeness.
Since `tolerant_current_epoch` is expected to be either `current_epoch` or `current_epoch+1`, we can eliminate a case here.
And added a comment about `compute_historic_attester_duties` , since `RelativeEpoch::from_epoch` will only allow `request_epoch == current_epoch-1` when `request_epoch < current_epoch`.
## Issue Addressed
Closes#4245
## Proposed Changes
- If an SSE channel fills up, send a comment instead of terminating the stream.
- Add a CLI flag for scaling up the SSE buffer: `--http-sse-capacity-multiplier N`.
## Additional Info
~~Blocked on #4462. I haven't rebased on that PR yet for initial testing, because it still needs some more work to handle long-running HTTP threads.~~
- [x] Add CLI flag tests.
## Issue Addressed
The feature flag used to control this feature is `disable_backfill` instead of `disable-backfill`.
kudos to @michaelsproul for discovering this bug!
## Issue Addressed
Closes#3404 (mostly)
## Proposed Changes
- Remove all uses of Warp's `and_then` (which backtracks) in favour of `then` (which doesn't).
- Bump the priority of the `POST` method for `v2/blocks` to `P0`. Publishing a block needs to happen quickly.
- Run the new SSZ POST endpoints on the beacon processor. I think this was missed in between merging #4462 and #4504/#4479.
- Fix a minor issue in the validator registrations endpoint whereby an error from spawning the task on the beacon processor would be dropped.
## Additional Info
I've tested this manually and can confirm that we no longer get the dreaded `Unsupported endpoint version` errors for queries like:
```
$ curl -X POST -H "Content-Type: application/json" --data @block.json "http://localhost:5052/eth/v2/beacon/blocks" | jq
{
"code": 400,
"message": "BAD_REQUEST: WeakSubjectivityConflict",
"stacktraces": []
}
```
```
$ curl -X POST -H "Content-Type: application/octet-stream" --data @block.json "http://localhost:5052/eth/v2/beacon/blocks" | jq
{
"code": 400,
"message": "BAD_REQUEST: invalid SSZ: OffsetOutOfBounds(572530811)",
"stacktraces": []
}
```
```
$ curl "http://localhost:5052/eth/v2/validator/blocks/7067595"
{"code":400,"message":"BAD_REQUEST: invalid query: Invalid query string","stacktraces":[]}
```
However, I can still trigger it by leaving off the `Content-Type`. We can re-test this aspect with #4575.
## Issue Addressed
#4538
## Proposed Changes
add newtype wrapper around DialError that extracts error messages and logs them in a more readable format
## Additional Info
I was able to test Transport Dial Errors in the situation where a libp2p instance attempts to ping a nonexistent peer. That error message should look something like
`A transport level error has ocurred: Connection refused (os error 61)`
AgeManning mentioned we should try fetching only the most inner error (in situations where theres a nested error). I took a stab at implementing that
For non transport DialErrors, I wrote out the error messages explicitly (as per the docs). Could potentially clean things up here if thats not necessary
Co-authored-by: Age Manning <Age@AgeManning.com>
## Issue Addressed
NA
## Proposed Changes
Rather than spawning new tasks on the tokio executor to process each HTTP API request, send the tasks to the `BeaconProcessor`. This achieves:
1. Places a bound on how many concurrent requests are being served (i.e., how many we are actually trying to compute at one time).
1. Places a bound on how many requests can be awaiting a response at one time (i.e., starts dropping requests when we have too many queued).
1. Allows the BN prioritise HTTP requests with respect to messages coming from the P2P network (i.e., proiritise importing gossip blocks rather than serving API requests).
Presently there are two levels of priorities:
- `Priority::P0`
- The beacon processor will prioritise these above everything other than importing new blocks.
- Roughly all validator-sensitive endpoints.
- `Priority::P1`
- The beacon processor will prioritise practically all other P2P messages over these, except for historical backfill things.
- Everything that's not `Priority::P0`
The `--http-enable-beacon-processor false` flag can be supplied to revert back to the old behaviour of spawning new `tokio` tasks for each request:
```
--http-enable-beacon-processor <BOOLEAN>
The beacon processor is a scheduler which provides quality-of-service and DoS protection. When set to
"true", HTTP API requests will queued and scheduled alongside other tasks. When set to "false", HTTP API
responses will be executed immediately. [default: true]
```
## New CLI Flags
I added some other new CLI flags:
```
--beacon-processor-aggregate-batch-size <INTEGER>
Specifies the number of gossip aggregate attestations in a signature verification batch. Higher values may
reduce CPU usage in a healthy network while lower values may increase CPU usage in an unhealthy or hostile
network. [default: 64]
--beacon-processor-attestation-batch-size <INTEGER>
Specifies the number of gossip attestations in a signature verification batch. Higher values may reduce CPU
usage in a healthy network whilst lower values may increase CPU usage in an unhealthy or hostile network.
[default: 64]
--beacon-processor-max-workers <INTEGER>
Specifies the maximum concurrent tasks for the task scheduler. Increasing this value may increase resource
consumption. Reducing the value may result in decreased resource usage and diminished performance. The
default value is the number of logical CPU cores on the host.
--beacon-processor-reprocess-queue-len <INTEGER>
Specifies the length of the queue for messages requiring delayed processing. Higher values may prevent
messages from being dropped while lower values may help protect the node from becoming overwhelmed.
[default: 12288]
```
I needed to add the max-workers flag since the "simulator" flavor tests started failing with HTTP timeouts on the test assertions. I believe they were failing because the Github runners only have 2 cores and there just weren't enough workers available to process our requests in time. I added the other flags since they seem fun to fiddle with.
## Additional Info
I bumped the timeouts on the "simulator" flavor test from 4s to 8s. The prioritisation of consensus messages seems to be causing slower responses, I guess this is what we signed up for 🤷
The `validator/register` validator has some special handling because the relays have a bad habit of timing out on these calls. It seems like a waste of a `BeaconProcessor` worker to just wait for the builder API HTTP response, so we spawn a new `tokio` task to wait for a builder response.
I've added an optimisation for the `GET beacon/states/{state_id}/validators/{validator_id}` endpoint in [efbabe3](efbabe3252). That's the endpoint the VC uses to resolve pubkeys to validator indices, and it's the endpoint that was causing us grief. Perhaps I should move that into a new PR, not sure.
## Issue Addressed
Addresses #2557
## Proposed Changes
Adds the `lighthouse validator-manager` command, which provides:
- `lighthouse validator-manager create`
- Creates a `validators.json` file and a `deposits.json` (same format as https://github.com/ethereum/staking-deposit-cli)
- `lighthouse validator-manager import`
- Imports validators from a `validators.json` file to the VC via the HTTP API.
- `lighthouse validator-manager move`
- Moves validators from one VC to the other, utilizing only the VC API.
## Additional Info
In 98bcb947c I've reduced some VC `ERRO` and `CRIT` warnings to `WARN` or `DEBG` for the case where a pubkey is missing from the validator store. These were being triggered when we removed a validator but still had it in caches. It seems to me that `UnknownPubkey` will only happen in the case where we've removed a validator, so downgrading the logs is prudent. All the logs are `DEBG` apart from attestations and blocks which are `WARN`. I thought having *some* logging about this condition might help us down the track.
In 856cd7e37d I've made the VC delete the corresponding password file when it's deleting a keystore. This seemed like nice hygiene. Notably, it'll only delete that password file after it scans the validator definitions and finds that no other validator is also using that password file.
## Issue Addressed
NA
## Proposed Changes
We've been seeing a lot of [CI failures](https://github.com/sigp/lighthouse/actions/runs/5781296217/job/15666209142) with errors like this:
```
---- extra_interchange_tests::export_same_key_twice stdout ----
thread 'extra_interchange_tests::export_same_key_twice' panicked at 'called `Result::unwrap()` on an `Err` value: SQLError("Unable to open database: Error(None)")', validator_client/slashing_protection/src/extra_interchange_tests.rs:48:67
```
I'm assuming they're timeouts. I noticed that tests have a 0.1s timeout. Perhaps this just doesn't cut it when our new runners are overloaded.
## Additional Info
NA
**Motivation**
As clarified [on discord](https://discord.com/channels/605577013327167508/605577013331361793/1121246688183603240), sync committee contributions are not delayed if DP is enabled.
**Description**
This PR updates doppelganger note about sync committee contributions. Based on the current docs, a user might assume that DP is not working as expected.
## Issue Addressed
Addresses [#4401](https://github.com/sigp/lighthouse/issues/4401)
## Proposed Changes
Shift some constants into ```ChainSpec``` and remove the constant values from code space.
## Additional Info
I mostly used ```MainnetEthSpec::default_spec()``` for getting ```ChainSpec```. I wonder Did I make a mistake about that.
Co-authored-by: armaganyildirak <armaganyildirak@gmail.com>
Co-authored-by: Paul Hauner <paul@paulhauner.com>
Co-authored-by: Age Manning <Age@AgeManning.com>
Co-authored-by: Diva M <divma@protonmail.com>
## Issue Addressed
Upgrade libp2p to v0.52
## Proposed Changes
- **Workflows**: remove installation of `protoc`
- **Book**: remove installation of `protoc`
- **`Dockerfile`s and `cross`**: remove custom base `Dockerfile` for cross since it's no longer needed. Remove `protoc` from remaining `Dockerfiles`s
- **Upgrade `discv5` to `v0.3.1`:** we have some cool stuff in there: no longer needs `protoc` and faster ip updates on cold start
- **Upgrade `prometheus` to `0.21.0`**, now it no longer needs encoding checks
- **things that look like refactors:** bunch of api types were renamed and need to be accessed in a different (clearer) way
- **Lighthouse network**
- connection limits is now a behaviour
- banned peers no longer exist on the swarm level, but at the behaviour level
- `connection_event_buffer_size` now is handled per connection with a buffer size of 4
- `mplex` is deprecated and was removed
- rpc handler now logs the peer to which it belongs
## Additional Info
Tried to keep as much behaviour unchanged as possible. However, there is a great deal of improvements we can do _after_ this upgrade:
- Smart connection limits: Connection limits have been checked only based on numbers, we can now use information about the incoming peer to decide if we want it
- More powerful peer management: Dial attempts from other behaviours can be rejected early
- Incoming connections can be rejected early
- Banning can be returned exclusively to the peer management: We should not get connections to banned peers anymore making use of this
- TCP Nat updates: We might be able to take advantage of confirmed external addresses to check out tcp ports/ips
Co-authored-by: Age Manning <Age@AgeManning.com>
Co-authored-by: Akihito Nakano <sora.akatsuki@gmail.com>
## Issue Addressed
Solves #4442
## Proposed Changes
EL clients log errors if we don't query this endpoint, but they are making releases that remove this error logging. After those are out we can stop calling it, after which point EL teams will remove the endpoint entirely.
Refer https://hackmd.io/@n0ble/deprecate-exchgTC
Some updates in the FAQ based on issues seen on Discord. Additionally, corrected the disk usage on the default SPRP as the previously provided value is not correct.
Co-authored-by: chonghe <44791194+chong-he@users.noreply.github.com>
Often when testing I have to create a hack which is annoying to maintain.
I think it might be handy to add a custom compile-time flag that developers can use if they want to test things locally without having to backfill a bunch of blocks.
There is probably an argument to have a feature called "backfill" which is enabled by default and can be disabled. I didn't go this route because I think it's counter-intuitive to have a feature that enables a core and necessary behaviour.
## Issue Addressed
The PR fixes a bug where the the ideal rewards for source and head were incorrectly set.
Output from testing a validator that performed optimally in a Phase 0 epoch , note the `source` and `target` under ideal rewards is incorrect (compared to the actual `total_rewards` below):
```json
{
"ideal_rewards": [
...
{
"effective_balance": "32000000000",
"head": "18771",
"target": "18770",
"source": "18729",
"inclusion_delay": "17083",
"inactivity": "0"
}
],
"total_rewards": [
{
"validator_index": "0",
"head": "18729",
"target": "18770",
"source": "18771",
"inclusion_delay": "17083",
"inactivity": "0"
}
]
```
## Proposed Changes
- Fix bad `state_root` reuse in `lcli transition-blocks` that resulted in invalid results at skipped slots.
- Modernise `lcli pretty-ssz` to include fork-generic decoders for `SignedBeaconBlock` and `BeaconState` which respect the `--network`/`--testnet-dir` flag.
## Additional Info
Breaking change: the underscore names like `signed_block_merge` are removed in favour of the fork-generic name `SignedBeaconBlock`, and fork-specific names which match the superstruct variants, e.g. `SignedBeaconBlockMerge`.
## Proposed Changes
Remove patch for `arbitrary` in favour of upstream, now that the `arithmetic_side_effects` lint no longer triggers in derive macro code.
## Additional Info
~~Blocked on Rust 1.71.0, to be released 13 July 23~~
## Issue Addressed
NA
## Proposed Changes
Carries on from #4115, with the following modifications:
1. Self-hosted runners are only enabled if `github.repository == sigp/lighthouse`.
- This allows forks to still have Github-hosted CI.
- This gives us a method to switch back to Github-runners if we have extended downtime on self-hosted.
1. Does not remove any existing dependency builds for Github-hosted runners (e.g., installing the latest Rust).
1. Adds the `WATCH_HOST` environment variable which defines where we expect to find the postgres db in the `watch` tests. This should be set to `host.docker.internal` for the tests to pass on self-hosted runners.
## Additional Info
NA
Co-authored-by: antondlr <anton@delaruelle.net>
## Issue Addressed
n/a Noticed this while working on something else
## Proposed Changes
- leverage the appropriate types to avoid a bunch of `unwrap` and errors
## Additional Info
n/a
## Issue Addressed
Speed up CI by installing foundry with Github action instead of building Anvil from source.
Building anvil from source on GItHub hosted runners currently takes about 10 mins. Using the `foundry-toolchain` action to install only takes about 2 seconds.
## Issue Addressed
N/A
## Proposed Changes
Add lints for rust 1.71
[3789134](3789134ae2) is probably the one that needs most attention as it changes beacon state code. I changed the `is_in_inactivity_leak ` function to return a `ArithError` as not all consumers of that function work well with a `BeaconState::Error`.
## Issue Addressed
This PR attempts to workaround the recent frequent eth1 simulator failures caused by missing eth logs from Anvil.
> FailedToInsertDeposit(NonConsecutive { log_index: 1, expected: 0 })
This usually occurs at the beginning of the tests, and it guarantees a timeout after a few hours if this log shows up, and this is currently causing our CIs to fail quite frequently.
Example failure here: https://github.com/sigp/lighthouse/actions/runs/5525760195/jobs/10079736914
## Proposed Changes
The quick fix applied here adds a timeout to node startup and restarts the node again.
- Add a 60 seconds timeout to beacon node startup in eth1 simulator tests. It takes ~10 seconds on my machine, but could take longer on CI runners.
- Wrap the startup code in a retry function, that allows for 3 retries before returning an error.
## Additional Info
We should probably raise an issue under the Anvil GitHub repo there so this can be further investigated.
## Issue Addressed
Addresses an issue where CI could fail due to an nonexisting file error:
```
Run ./clean.sh
rm: cannot remove '/home/runner/.lighthouse/local-testnet/geth_datadir4/geth/fastcache.tmp.1549331618': No such file or directory
Error: Process completed with exit code 1.
```
This seems to happen quite frequently now, I'm not sure exactly why but perhaps worth trying suppressing the prompt?
https://github.com/sigp/lighthouse/actions/runs/5455027574/jobs/9925916159?pr=4463
## Proposed Changes
Replace `wget` in the EF-tests makefile with `curl`.
On macOS `curl` is pre-installed, and I found myself making this change to avoid installing `wget`.
The `-L` flag is used to follow redirects which is useful if a repo gets renamed, and more similar to `wget`'s default behaviour.