lighthouse/beacon_node/lighthouse_network/src/metrics.rs

147 lines
5.9 KiB
Rust
Raw Normal View History

Fix block processing blowup, upgrade metrics (#500) * Renamed fork_choice::process_attestation_from_block * Processing attestation in fork choice * Retrieving state from store and checking signature * Looser check on beacon state validity. * Cleaned up get_attestation_state * Expanded fork choice api to provide latest validator message. * Checking if the an attestation contains a latest message * Correct process_attestation error handling. * Copy paste error in comment fixed. * Tidy ancestor iterators * Getting attestation slot via helper method * Refactored attestation creation in test utils * Revert "Refactored attestation creation in test utils" This reverts commit 4d277fe4239a7194758b18fb5c00dfe0b8231306. * Integration tests for free attestation processing * Implicit conflicts resolved. * formatting * Do first pass on Grants code * Add another attestation processing test * Tidy attestation processing * Remove old code fragment * Add non-compiling half finished changes * Simplify, fix bugs, add tests for chain iters * Remove attestation processing from op pool * Fix bug with fork choice, tidy * Fix overly restrictive check in fork choice. * Ensure committee cache is build during attn proc * Ignore unknown blocks at fork choice * Various minor fixes * Make fork choice write lock in to read lock * Remove unused method * Tidy comments * Fix attestation prod. target roots change * Fix compile error in store iters * Reject any attestation prior to finalization * Begin metrics refactor * Move beacon_chain to new metrics structure. * Make metrics not panic if already defined * Use global prometheus gather at rest api * Unify common metric fns into a crate * Add heavy metering to block processing * Remove hypen from prometheus metric name * Add more beacon chain metrics * Add beacon chain persistence metric * Prune op pool on finalization * Add extra prom beacon chain metrics * Prefix BeaconChain metrics with "beacon_" * Add more store metrics * Add basic metrics to libp2p * Add metrics to HTTP server * Remove old `http_server` crate * Update metrics names to be more like standard * Fix broken beacon chain metrics, add slot clock metrics * Add lighthouse_metrics gather fn * Remove http args * Fix wrong state given to op pool prune * Make prom metric names more consistent * Add more metrics, tidy existing metrics * Fix store block read metrics * Tidy attestation metrics * Fix minor PR comments * Allow travis failures on beta (see desc) There's a non-backward compatible change in `cargo fmt`. Stable and beta do not agree. * Tidy `lighthouse_metrics` docs * Fix typo
2019-08-19 11:02:34 +00:00
pub use lighthouse_metrics::*;
lazy_static! {
pub static ref NAT_OPEN: Result<IntCounter> = try_create_int_counter(
"nat_open",
"An estimate indicating if the local node is exposed to the internet."
);
Fix block processing blowup, upgrade metrics (#500) * Renamed fork_choice::process_attestation_from_block * Processing attestation in fork choice * Retrieving state from store and checking signature * Looser check on beacon state validity. * Cleaned up get_attestation_state * Expanded fork choice api to provide latest validator message. * Checking if the an attestation contains a latest message * Correct process_attestation error handling. * Copy paste error in comment fixed. * Tidy ancestor iterators * Getting attestation slot via helper method * Refactored attestation creation in test utils * Revert "Refactored attestation creation in test utils" This reverts commit 4d277fe4239a7194758b18fb5c00dfe0b8231306. * Integration tests for free attestation processing * Implicit conflicts resolved. * formatting * Do first pass on Grants code * Add another attestation processing test * Tidy attestation processing * Remove old code fragment * Add non-compiling half finished changes * Simplify, fix bugs, add tests for chain iters * Remove attestation processing from op pool * Fix bug with fork choice, tidy * Fix overly restrictive check in fork choice. * Ensure committee cache is build during attn proc * Ignore unknown blocks at fork choice * Various minor fixes * Make fork choice write lock in to read lock * Remove unused method * Tidy comments * Fix attestation prod. target roots change * Fix compile error in store iters * Reject any attestation prior to finalization * Begin metrics refactor * Move beacon_chain to new metrics structure. * Make metrics not panic if already defined * Use global prometheus gather at rest api * Unify common metric fns into a crate * Add heavy metering to block processing * Remove hypen from prometheus metric name * Add more beacon chain metrics * Add beacon chain persistence metric * Prune op pool on finalization * Add extra prom beacon chain metrics * Prefix BeaconChain metrics with "beacon_" * Add more store metrics * Add basic metrics to libp2p * Add metrics to HTTP server * Remove old `http_server` crate * Update metrics names to be more like standard * Fix broken beacon chain metrics, add slot clock metrics * Add lighthouse_metrics gather fn * Remove http args * Fix wrong state given to op pool prune * Make prom metric names more consistent * Add more metrics, tidy existing metrics * Fix store block read metrics * Tidy attestation metrics * Fix minor PR comments * Allow travis failures on beta (see desc) There's a non-backward compatible change in `cargo fmt`. Stable and beta do not agree. * Tidy `lighthouse_metrics` docs * Fix typo
2019-08-19 11:02:34 +00:00
pub static ref ADDRESS_UPDATE_COUNT: Result<IntCounter> = try_create_int_counter(
"libp2p_address_update_total",
"Count of libp2p socked updated events (when our view of our IP address has changed)"
);
pub static ref PEERS_CONNECTED: Result<IntGauge> = try_create_int_gauge(
"libp2p_peers",
Fix block processing blowup, upgrade metrics (#500) * Renamed fork_choice::process_attestation_from_block * Processing attestation in fork choice * Retrieving state from store and checking signature * Looser check on beacon state validity. * Cleaned up get_attestation_state * Expanded fork choice api to provide latest validator message. * Checking if the an attestation contains a latest message * Correct process_attestation error handling. * Copy paste error in comment fixed. * Tidy ancestor iterators * Getting attestation slot via helper method * Refactored attestation creation in test utils * Revert "Refactored attestation creation in test utils" This reverts commit 4d277fe4239a7194758b18fb5c00dfe0b8231306. * Integration tests for free attestation processing * Implicit conflicts resolved. * formatting * Do first pass on Grants code * Add another attestation processing test * Tidy attestation processing * Remove old code fragment * Add non-compiling half finished changes * Simplify, fix bugs, add tests for chain iters * Remove attestation processing from op pool * Fix bug with fork choice, tidy * Fix overly restrictive check in fork choice. * Ensure committee cache is build during attn proc * Ignore unknown blocks at fork choice * Various minor fixes * Make fork choice write lock in to read lock * Remove unused method * Tidy comments * Fix attestation prod. target roots change * Fix compile error in store iters * Reject any attestation prior to finalization * Begin metrics refactor * Move beacon_chain to new metrics structure. * Make metrics not panic if already defined * Use global prometheus gather at rest api * Unify common metric fns into a crate * Add heavy metering to block processing * Remove hypen from prometheus metric name * Add more beacon chain metrics * Add beacon chain persistence metric * Prune op pool on finalization * Add extra prom beacon chain metrics * Prefix BeaconChain metrics with "beacon_" * Add more store metrics * Add basic metrics to libp2p * Add metrics to HTTP server * Remove old `http_server` crate * Update metrics names to be more like standard * Fix broken beacon chain metrics, add slot clock metrics * Add lighthouse_metrics gather fn * Remove http args * Fix wrong state given to op pool prune * Make prom metric names more consistent * Add more metrics, tidy existing metrics * Fix store block read metrics * Tidy attestation metrics * Fix minor PR comments * Allow travis failures on beta (see desc) There's a non-backward compatible change in `cargo fmt`. Stable and beta do not agree. * Tidy `lighthouse_metrics` docs * Fix typo
2019-08-19 11:02:34 +00:00
"Count of libp2p peers currently connected"
);
Fix block processing blowup, upgrade metrics (#500) * Renamed fork_choice::process_attestation_from_block * Processing attestation in fork choice * Retrieving state from store and checking signature * Looser check on beacon state validity. * Cleaned up get_attestation_state * Expanded fork choice api to provide latest validator message. * Checking if the an attestation contains a latest message * Correct process_attestation error handling. * Copy paste error in comment fixed. * Tidy ancestor iterators * Getting attestation slot via helper method * Refactored attestation creation in test utils * Revert "Refactored attestation creation in test utils" This reverts commit 4d277fe4239a7194758b18fb5c00dfe0b8231306. * Integration tests for free attestation processing * Implicit conflicts resolved. * formatting * Do first pass on Grants code * Add another attestation processing test * Tidy attestation processing * Remove old code fragment * Add non-compiling half finished changes * Simplify, fix bugs, add tests for chain iters * Remove attestation processing from op pool * Fix bug with fork choice, tidy * Fix overly restrictive check in fork choice. * Ensure committee cache is build during attn proc * Ignore unknown blocks at fork choice * Various minor fixes * Make fork choice write lock in to read lock * Remove unused method * Tidy comments * Fix attestation prod. target roots change * Fix compile error in store iters * Reject any attestation prior to finalization * Begin metrics refactor * Move beacon_chain to new metrics structure. * Make metrics not panic if already defined * Use global prometheus gather at rest api * Unify common metric fns into a crate * Add heavy metering to block processing * Remove hypen from prometheus metric name * Add more beacon chain metrics * Add beacon chain persistence metric * Prune op pool on finalization * Add extra prom beacon chain metrics * Prefix BeaconChain metrics with "beacon_" * Add more store metrics * Add basic metrics to libp2p * Add metrics to HTTP server * Remove old `http_server` crate * Update metrics names to be more like standard * Fix broken beacon chain metrics, add slot clock metrics * Add lighthouse_metrics gather fn * Remove http args * Fix wrong state given to op pool prune * Make prom metric names more consistent * Add more metrics, tidy existing metrics * Fix store block read metrics * Tidy attestation metrics * Fix minor PR comments * Allow travis failures on beta (see desc) There's a non-backward compatible change in `cargo fmt`. Stable and beta do not agree. * Tidy `lighthouse_metrics` docs * Fix typo
2019-08-19 11:02:34 +00:00
pub static ref PEER_CONNECT_EVENT_COUNT: Result<IntCounter> = try_create_int_counter(
"libp2p_peer_connect_event_total",
"Count of libp2p peer connect events (not the current number of connected peers)"
);
pub static ref PEER_DISCONNECT_EVENT_COUNT: Result<IntCounter> = try_create_int_counter(
"libp2p_peer_disconnect_event_total",
"Count of libp2p peer disconnect events"
);
pub static ref DISCOVERY_SENT_BYTES: Result<IntGauge> = try_create_int_gauge(
"discovery_sent_bytes",
"The number of bytes sent in discovery"
);
pub static ref DISCOVERY_RECV_BYTES: Result<IntGauge> = try_create_int_gauge(
"discovery_recv_bytes",
"The number of bytes received in discovery"
);
pub static ref DISCOVERY_QUEUE: Result<IntGauge> = try_create_int_gauge(
"discovery_queue_size",
"The number of discovery queries awaiting execution"
);
pub static ref DISCOVERY_REQS: Result<Gauge> = try_create_float_gauge(
"discovery_requests",
"The number of unsolicited discovery requests per second"
);
pub static ref DISCOVERY_SESSIONS: Result<IntGauge> = try_create_int_gauge(
"discovery_sessions",
"The number of active discovery sessions with peers"
);
pub static ref PEERS_PER_CLIENT: Result<IntGaugeVec> = try_create_int_gauge_vec(
"libp2p_peers_per_client",
"The connected peers via client implementation",
&["Client"]
);
pub static ref FAILED_ATTESTATION_PUBLISHES_PER_SUBNET: Result<IntGaugeVec> =
try_create_int_gauge_vec(
"gossipsub_failed_attestation_publishes_per_subnet",
"Failed attestation publishes per subnet",
&["subnet"]
);
pub static ref FAILED_PUBLISHES_PER_MAIN_TOPIC: Result<IntGaugeVec> = try_create_int_gauge_vec(
"gossipsub_failed_publishes_per_main_topic",
"Failed gossip publishes",
&["topic_hash"]
);
More metrics + RPC tweaks (#2041) ## Issue Addressed NA ## Proposed Changes This was mostly done to find the reason why LH was dropping peers from Nimbus. It proved to be useful so I think it's worth it. But there is also some functional stuff here - Add metrics for rpc errors per client, error type and direction - Add metrics for downscoring events per source type, client and penalty type - Add metrics for gossip validation results per client for non-accepted messages - Make the RPC handler return errors and requests/responses in the order we see them - Allow a small burst for the Ping rate limit, from 1 every 5 seconds to 2 every 10 seconds - Send rate limiting errors with a particular code and use that same code to identify them. I picked something different to 128 since that is most likely what other clients are using for their own errors - Remove some unused code in the `PeerAction` and the rpc handler - Remove the unused variant `RateLimited`. tTis was never produced directly, since the only way to get the request's protocol is via de handler. The handler upon receiving from LH a response with an error (rate limited in this case) emits this event with the missing info (It was always like this, just pointing out that we do downscore rate limiting errors regardless of the change) Metrics for Nimbus looked like this: Downscoring events: `increase(libp2p_peer_actions_per_client{client="Nimbus"}[5m])` ![image](https://user-images.githubusercontent.com/26765164/101210880-862bf280-3676-11eb-94c0-399f0bf5aa2e.png) RPC Errors: `increase(libp2p_rpc_errors_per_client{client="Nimbus"}[5m])` ![image](https://user-images.githubusercontent.com/26765164/101210997-ba071800-3676-11eb-847a-f32405ede002.png) Unaccepted gossip message: `increase(gossipsub_unaccepted_messages_per_client{client="Nimbus"}[5m])` ![image](https://user-images.githubusercontent.com/26765164/101211124-f470b500-3676-11eb-9459-132ecff058ec.png)
2020-12-08 03:55:50 +00:00
pub static ref TOTAL_RPC_ERRORS_PER_CLIENT: Result<IntCounterVec> = try_create_int_counter_vec(
"libp2p_rpc_errors_per_client",
"RPC errors per client",
&["client", "rpc_error", "direction"]
);
pub static ref TOTAL_RPC_REQUESTS: Result<IntCounterVec> = try_create_int_counter_vec(
"libp2p_rpc_requests_total",
"RPC requests total",
&["type"]
);
More metrics + RPC tweaks (#2041) ## Issue Addressed NA ## Proposed Changes This was mostly done to find the reason why LH was dropping peers from Nimbus. It proved to be useful so I think it's worth it. But there is also some functional stuff here - Add metrics for rpc errors per client, error type and direction - Add metrics for downscoring events per source type, client and penalty type - Add metrics for gossip validation results per client for non-accepted messages - Make the RPC handler return errors and requests/responses in the order we see them - Allow a small burst for the Ping rate limit, from 1 every 5 seconds to 2 every 10 seconds - Send rate limiting errors with a particular code and use that same code to identify them. I picked something different to 128 since that is most likely what other clients are using for their own errors - Remove some unused code in the `PeerAction` and the rpc handler - Remove the unused variant `RateLimited`. tTis was never produced directly, since the only way to get the request's protocol is via de handler. The handler upon receiving from LH a response with an error (rate limited in this case) emits this event with the missing info (It was always like this, just pointing out that we do downscore rate limiting errors regardless of the change) Metrics for Nimbus looked like this: Downscoring events: `increase(libp2p_peer_actions_per_client{client="Nimbus"}[5m])` ![image](https://user-images.githubusercontent.com/26765164/101210880-862bf280-3676-11eb-94c0-399f0bf5aa2e.png) RPC Errors: `increase(libp2p_rpc_errors_per_client{client="Nimbus"}[5m])` ![image](https://user-images.githubusercontent.com/26765164/101210997-ba071800-3676-11eb-847a-f32405ede002.png) Unaccepted gossip message: `increase(gossipsub_unaccepted_messages_per_client{client="Nimbus"}[5m])` ![image](https://user-images.githubusercontent.com/26765164/101211124-f470b500-3676-11eb-9459-132ecff058ec.png)
2020-12-08 03:55:50 +00:00
pub static ref PEER_ACTION_EVENTS_PER_CLIENT: Result<IntCounterVec> =
try_create_int_counter_vec(
"libp2p_peer_actions_per_client",
"Score reports per client",
&["client", "action", "source"]
);
pub static ref GOSSIP_UNACCEPTED_MESSAGES_PER_CLIENT: Result<IntCounterVec> =
try_create_int_counter_vec(
"gossipsub_unaccepted_messages_per_client",
"Gossipsub messages that we did not accept, per client",
&["client", "validation_result"]
);
pub static ref PEER_SCORE_DISTRIBUTION: Result<IntGaugeVec> =
try_create_int_gauge_vec(
"peer_score_distribution",
"The distribution of connected peer scores",
&["position"]
);
pub static ref PEER_SCORE_PER_CLIENT: Result<GaugeVec> =
try_create_float_gauge_vec(
"peer_score_per_client",
"Average score per client",
&["client"]
);
/*
* Inbound/Outbound peers
*/
/// The number of peers that dialed us.
pub static ref NETWORK_INBOUND_PEERS: Result<IntGauge> =
try_create_int_gauge("network_inbound_peers","The number of peers that are currently connected that have dialed us.");
/// The number of peers that we dialed us.
pub static ref NETWORK_OUTBOUND_PEERS: Result<IntGauge> =
try_create_int_gauge("network_outbound_peers","The number of peers that are currently connected that we dialed.");
/*
* Peer Reporting
*/
pub static ref REPORT_PEER_MSGS: Result<IntCounterVec> = try_create_int_counter_vec(
"libp2p_report_peer_msgs_total",
"Number of peer reports per msg",
&["msg"]
);
}
/// Checks if we consider the NAT open.
///
/// Conditions for an open NAT:
/// 1. We have 1 or more SOCKET_UPDATED messages. This occurs when discovery has a majority of
/// users reporting an external port and our ENR gets updated.
/// 2. We have 0 SOCKET_UPDATED messages (can be true if the port was correct on boot), then we
/// rely on whether we have any inbound messages. If we have no socket update messages, but
/// manage to get at least one inbound peer, we are exposed correctly.
pub fn check_nat() {
// NAT is already deemed open.
if NAT_OPEN.as_ref().map(|v| v.get()).unwrap_or(0) != 0 {
return;
}
if ADDRESS_UPDATE_COUNT.as_ref().map(|v| v.get()).unwrap_or(0) == 0
|| NETWORK_INBOUND_PEERS.as_ref().map(|v| v.get()).unwrap_or(0) != 0_i64
{
inc_counter(&NAT_OPEN);
}
}
pub fn scrape_discovery_metrics() {
let metrics = discv5::metrics::Metrics::from(discv5::Discv5::raw_metrics());
set_float_gauge(&DISCOVERY_REQS, metrics.unsolicited_requests_per_second);
set_gauge(&DISCOVERY_SESSIONS, metrics.active_sessions as i64);
set_gauge(&DISCOVERY_SENT_BYTES, metrics.bytes_sent as i64);
set_gauge(&DISCOVERY_RECV_BYTES, metrics.bytes_recv as i64);
Fix block processing blowup, upgrade metrics (#500) * Renamed fork_choice::process_attestation_from_block * Processing attestation in fork choice * Retrieving state from store and checking signature * Looser check on beacon state validity. * Cleaned up get_attestation_state * Expanded fork choice api to provide latest validator message. * Checking if the an attestation contains a latest message * Correct process_attestation error handling. * Copy paste error in comment fixed. * Tidy ancestor iterators * Getting attestation slot via helper method * Refactored attestation creation in test utils * Revert "Refactored attestation creation in test utils" This reverts commit 4d277fe4239a7194758b18fb5c00dfe0b8231306. * Integration tests for free attestation processing * Implicit conflicts resolved. * formatting * Do first pass on Grants code * Add another attestation processing test * Tidy attestation processing * Remove old code fragment * Add non-compiling half finished changes * Simplify, fix bugs, add tests for chain iters * Remove attestation processing from op pool * Fix bug with fork choice, tidy * Fix overly restrictive check in fork choice. * Ensure committee cache is build during attn proc * Ignore unknown blocks at fork choice * Various minor fixes * Make fork choice write lock in to read lock * Remove unused method * Tidy comments * Fix attestation prod. target roots change * Fix compile error in store iters * Reject any attestation prior to finalization * Begin metrics refactor * Move beacon_chain to new metrics structure. * Make metrics not panic if already defined * Use global prometheus gather at rest api * Unify common metric fns into a crate * Add heavy metering to block processing * Remove hypen from prometheus metric name * Add more beacon chain metrics * Add beacon chain persistence metric * Prune op pool on finalization * Add extra prom beacon chain metrics * Prefix BeaconChain metrics with "beacon_" * Add more store metrics * Add basic metrics to libp2p * Add metrics to HTTP server * Remove old `http_server` crate * Update metrics names to be more like standard * Fix broken beacon chain metrics, add slot clock metrics * Add lighthouse_metrics gather fn * Remove http args * Fix wrong state given to op pool prune * Make prom metric names more consistent * Add more metrics, tidy existing metrics * Fix store block read metrics * Tidy attestation metrics * Fix minor PR comments * Allow travis failures on beta (see desc) There's a non-backward compatible change in `cargo fmt`. Stable and beta do not agree. * Tidy `lighthouse_metrics` docs * Fix typo
2019-08-19 11:02:34 +00:00
}