cosmos-sdk/docs/architecture/adr-013-metrics.md
Julien Robert 58597139fa
docs: Improve markdownlint configuration (#11104)
## Description

Closes: #9404



---

### Author Checklist

*All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.*

I have...

- [x] included the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title
- [ ] added `!` to the type prefix if API or client breaking change
- [x] targeted the correct branch (see [PR Targeting](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting))
- [x] provided a link to the relevant issue or specification
- [ ] followed the guidelines for [building modules](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules)
- [ ] included the necessary unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing)
- [ ] added a changelog entry to `CHANGELOG.md`
- [ ] included comments for [documenting Go code](https://blog.golang.org/godoc)
- [ ] updated the relevant documentation or specification
- [x] reviewed "Files changed" and left comments if necessary
- [x] confirmed all CI checks have passed

### Reviewers Checklist

*All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.*

I have...

- [ ] confirmed the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title
- [ ] confirmed `!` in the type prefix if API or client breaking change
- [ ] confirmed all author checklist items have been addressed 
- [ ] reviewed state machine logic
- [ ] reviewed API design and naming
- [ ] reviewed documentation is accurate
- [ ] reviewed tests and test coverage
- [ ] manually tested (if applicable)
2022-02-10 12:07:01 +00:00

5.0 KiB

ADR 013: Observability

Changelog

  • 20-01-2020: Initial Draft

Status

Proposed

Context

Telemetry is paramount into debugging and understanding what the application is doing and how it is performing. We aim to expose metrics from modules and other core parts of the Cosmos SDK.

In addition, we should aim to support multiple configurable sinks that an operator may choose from. By default, when telemetry is enabled, the application should track and expose metrics that are stored in-memory. The operator may choose to enable additional sinks, where we support only Prometheus for now, as it's battle-tested, simple to setup, open source, and is rich with ecosystem tooling.

We must also aim to integrate metrics into the Cosmos SDK in the most seamless way possible such that metrics may be added or removed at will and without much friction. To do this, we will use the go-metrics library.

Finally, operators may enable telemetry along with specific configuration options. If enabled, metrics will be exposed via /metrics?format={text|prometheus} via the API server.

Decision

We will add an additional configuration block to app.toml that defines telemetry settings:

###############################################################################
###                         Telemetry Configuration                         ###
###############################################################################

[telemetry]

# Prefixed with keys to separate services
service-name = {{ .Telemetry.ServiceName }}

# Enabled enables the application telemetry functionality. When enabled,
# an in-memory sink is also enabled by default. Operators may also enabled
# other sinks such as Prometheus.
enabled = {{ .Telemetry.Enabled }}

# Enable prefixing gauge values with hostname
enable-hostname = {{ .Telemetry.EnableHostname }}

# Enable adding hostname to labels
enable-hostname-label = {{ .Telemetry.EnableHostnameLabel }}

# Enable adding service to labels
enable-service-label = {{ .Telemetry.EnableServiceLabel }}

# PrometheusRetentionTime, when positive, enables a Prometheus metrics sink.
prometheus-retention-time = {{ .Telemetry.PrometheusRetentionTime }}

The given configuration allows for two sinks -- in-memory and Prometheus. We create a Metrics type that performs all the bootstrapping for the operator, so capturing metrics becomes seamless.

// Metrics defines a wrapper around application telemetry functionality. It allows
// metrics to be gathered at any point in time. When creating a Metrics object,
// internally, a global metrics is registered with a set of sinks as configured
// by the operator. In addition to the sinks, when a process gets a SIGUSR1, a
// dump of formatted recent metrics will be sent to STDERR.
type Metrics struct {
  memSink           *metrics.InmemSink
  prometheusEnabled bool
}

// Gather collects all registered metrics and returns a GatherResponse where the
// metrics are encoded depending on the type. Metrics are either encoded via
// Prometheus or JSON if in-memory.
func (m *Metrics) Gather(format string) (GatherResponse, error) {
  switch format {
  case FormatPrometheus:
    return m.gatherPrometheus()

  case FormatText:
    return m.gatherGeneric()

  case FormatDefault:
    return m.gatherGeneric()

  default:
    return GatherResponse{}, fmt.Errorf("unsupported metrics format: %s", format)
  }
}

In addition, Metrics allows us to gather the current set of metrics at any given point in time. An operator may also choose to send a signal, SIGUSR1, to dump and print formatted metrics to STDERR.

During an application's bootstrapping and construction phase, if Telemetry.Enabled is true, the API server will create an instance of a reference to Metrics object and will register a metrics handler accordingly.

func (s *Server) Start(cfg config.Config) error {
  // ...

  if cfg.Telemetry.Enabled {
    m, err := telemetry.New(cfg.Telemetry)
    if err != nil {
      return err
    }

    s.metrics = m
    s.registerMetrics()
  }

  // ...
}

func (s *Server) registerMetrics() {
  metricsHandler := func(w http.ResponseWriter, r *http.Request) {
    format := strings.TrimSpace(r.FormValue("format"))

    gr, err := s.metrics.Gather(format)
    if err != nil {
      rest.WriteErrorResponse(w, http.StatusBadRequest, fmt.Sprintf("failed to gather metrics: %s", err))
      return
    }

    w.Header().Set("Content-Type", gr.ContentType)
    _, _ = w.Write(gr.Metrics)
  }

  s.Router.HandleFunc("/metrics", metricsHandler).Methods("GET")
}

Application developers may track counters, gauges, summaries, and key/value metrics. There is no additional lifting required by modules to leverage profiling metrics. To do so, it's as simple as:

func (k BaseKeeper) MintCoins(ctx sdk.Context, moduleName string, amt sdk.Coins) error {
  defer metrics.MeasureSince(time.Now(), "MintCoins")
  // ...
}

Consequences

Positive

  • Exposure into the performance and behavior of an application

Negative

Neutral

References