lighthouse/beacon_node/beacon_chain/src
Paul Hauner 830efdb5c2 Improve validator monitor experience for high validator counts (#3728)
## Issue Addressed

NA

## Proposed Changes

Myself and others (#3678) have observed  that when running with lots of validators (e.g., 1000s) the cardinality is too much for Prometheus. I've seen Prometheus instances just grind to a halt when we turn the validator monitor on for our testnet validators (we have 10,000s of Goerli validators). Additionally, the debug log volume can get very high with one log per validator, per attestation.

To address this, the `bn --validator-monitor-individual-tracking-threshold <INTEGER>` flag has been added to *disable* per-validator (i.e., non-aggregated) metrics/logging once the validator monitor exceeds the threshold of validators. The default value is `64`, which is a finger-to-the-wind value. I don't actually know the value at which Prometheus starts to become overwhelmed, but I've seen it work with ~64 validators and I've seen it *not* work with 1000s of validators. A default of `64` seems like it will result in a breaking change to users who are running millions of dollars worth of validators whilst resulting in a no-op for low-validator-count users. I'm open to changing this number, though.

Additionally, this PR starts collecting aggregated Prometheus metrics (e.g., total count of head hits across all validators), so that high-validator-count validators still have some interesting metrics. We already had logging for aggregated values, so nothing has been added there.

I've opted to make this a breaking change since it can be rather damaging to your Prometheus instance to accidentally enable the validator monitor with large numbers of validators. I've crashed a Prometheus instance myself and had a report from another user who's done the same thing.

## Additional Info

NA

## Breaking Changes Note

A new label has been added to the validator monitor Prometheus metrics: `total`. This label tracks the aggregated metrics of all validators in the validator monitor (as opposed to each validator being tracking individually using its pubkey as the label).

Additionally, a new flag has been added to the Beacon Node: `--validator-monitor-individual-tracking-threshold`. The default value is `64`, which means that when the validator monitor is tracking more than 64 validators then it will stop tracking per-validator metrics and only track the `all_validators` metric. It will also stop logging per-validator logs and only emit aggregated logs (the exception being that exit and slashing logs are always emitted).

These changes were introduced in #3728 to address issues with untenable Prometheus cardinality and log volume when using the validator monitor with high validator counts (e.g., 1000s of validators). Users with less than 65 validators will see no change in behavior (apart from the added `all_validators` metric). Users with more than 65 validators who wish to maintain the previous behavior can set something like `--validator-monitor-individual-tracking-threshold 999999`.
2023-01-09 08:18:55 +00:00
..
attestation_verification Use async code when interacting with EL (#3244) 2022-07-03 05:36:50 +00:00
schema_change Delete DB schema migrations for v11 and earlier (#3761) 2022-12-02 00:07:43 +00:00
attestation_verification.rs Refactor op pool for speed and correctness (#3312) 2022-08-29 09:10:26 +00:00
attester_cache.rs Add early attester cache (#2872) 2022-01-11 01:35:55 +00:00
beacon_chain.rs Enable proposer boost re-orging (#2860) 2022-12-13 09:57:26 +00:00
beacon_fork_choice_store.rs Enable proposer boost re-orging (#2860) 2022-12-13 09:57:26 +00:00
beacon_proposer_cache.rs Use async code when interacting with EL (#3244) 2022-07-03 05:36:50 +00:00
beacon_snapshot.rs Use async code when interacting with EL (#3244) 2022-07-03 05:36:50 +00:00
block_reward.rs Refactor op pool for speed and correctness (#3312) 2022-08-29 09:10:26 +00:00
block_times_cache.rs Add BlockTimesCache to allow additional block delay metrics (#2546) 2021-09-30 04:31:41 +00:00
block_verification.rs Prioritise important parts of block processing (#3696) 2022-11-30 05:22:58 +00:00
builder.rs Improve validator monitor experience for high validator counts (#3728) 2023-01-09 08:18:55 +00:00
canonical_head.rs Enable proposer boost re-orging (#2860) 2022-12-13 09:57:26 +00:00
chain_config.rs Verify execution block hashes during finalized sync (#3794) 2023-01-09 03:11:59 +00:00
early_attester_cache.rs Fix some typos (#3376) 2022-07-27 00:51:06 +00:00
errors.rs Enable proposer boost re-orging (#2860) 2022-12-13 09:57:26 +00:00
eth1_chain.rs Deposit Cache Finalization & Fast WS Sync (#2915) 2022-10-30 04:04:24 +00:00
eth1_finalization_cache.rs Deposit Cache Finalization & Fast WS Sync (#2915) 2022-10-30 04:04:24 +00:00
events.rs Implement API for block rewards (#2628) 2022-01-27 01:06:02 +00:00
execution_payload.rs Verify execution block hashes during finalized sync (#3794) 2023-01-09 03:11:59 +00:00
fork_choice_signal.rs Run fork choice before block proposal (#3168) 2022-05-20 05:02:11 +00:00
fork_revert.rs Enable proposer boost re-orging (#2860) 2022-12-13 09:57:26 +00:00
head_tracker.rs Fix rust 1.65 lints (#3682) 2022-11-04 07:43:43 +00:00
historical_blocks.rs Use async code when interacting with EL (#3244) 2022-07-03 05:36:50 +00:00
lib.rs Enable proposer boost re-orging (#2860) 2022-12-13 09:57:26 +00:00
light_client_finality_update_verification.rs Adding light_client gossip topics (#3693) 2022-12-13 06:24:51 +00:00
light_client_optimistic_update_verification.rs Adding light_client gossip topics (#3693) 2022-12-13 06:24:51 +00:00
merge_readiness.rs Increase merge-readiness lookhead (#3463) 2022-08-15 01:30:59 +00:00
metrics.rs Enable proposer boost re-orging (#2860) 2022-12-13 09:57:26 +00:00
migrate.rs Fix rust 1.65 lints (#3682) 2022-11-04 07:43:43 +00:00
naive_aggregation_pool.rs Clippy lints for rust 1.66 (#3810) 2022-12-16 04:04:00 +00:00
observed_aggregates.rs v2.2.0 (#3139) 2022-04-05 02:53:09 +00:00
observed_attesters.rs Ensure doppelganger detects attestations in blocks (#2495) 2021-08-09 02:43:03 +00:00
observed_block_producers.rs Doppelganger detection (#2230) 2021-07-31 03:50:52 +00:00
observed_operations.rs Refactor op pool for speed and correctness (#3312) 2022-08-29 09:10:26 +00:00
otb_verification_service.rs Initial Commit of Retrospective OTB Verification (#3372) 2022-07-30 00:22:38 +00:00
persisted_beacon_chain.rs Fix head tracker concurrency bugs (#1771) 2020-10-19 05:58:39 +00:00
persisted_fork_choice.rs Delete DB schema migrations for v11 and earlier (#3761) 2022-12-02 00:07:43 +00:00
pre_finalization_cache.rs Separate execution payloads in the DB (#3157) 2022-05-12 00:42:17 +00:00
proposer_prep_service.rs Enable proposer boost re-orging (#2860) 2022-12-13 09:57:26 +00:00
schema_change.rs Delete DB schema migrations for v11 and earlier (#3761) 2022-12-02 00:07:43 +00:00
shuffling_cache.rs Impl oneshot_broadcast for committee promises (#3595) 2022-09-21 01:01:50 +00:00
snapshot_cache.rs Enable proposer boost re-orging (#2860) 2022-12-13 09:57:26 +00:00
state_advance_timer.rs Enable proposer boost re-orging (#2860) 2022-12-13 09:57:26 +00:00
sync_committee_verification.rs New rust lints for rustc 1.64.0 (#3602) 2022-09-23 03:52:46 +00:00
test_utils.rs Improve validator monitor experience for high validator counts (#3728) 2023-01-09 08:18:55 +00:00
timeout_rw_lock.rs Add flag to disable lock timeouts (#2714) 2021-10-19 00:30:40 +00:00
validator_monitor.rs Improve validator monitor experience for high validator counts (#3728) 2023-01-09 08:18:55 +00:00
validator_pubkey_cache.rs Prioritise important parts of block processing (#3696) 2022-11-30 05:22:58 +00:00