Monitoring and Alerting

Vinay Rawal answered on March 8, 2023 Popularity 6/10 Helpfulness 1/10

Contents


More Related Answers

  • database monitoring
  • Monitoring and Logging
  • Performance monitoring tools
  • monitoring log
  • Monitoring dashboard
  • Enabling monitoring
  • Monitoring

  • Monitoring and Alerting

    0

    To collect metrics, we use Prometheus, an open-source tool originally developed at SoundCloud. It uses a pull model to scrape metrics from many endpoints (normally specific pods in the cluster), allowing you to store and query this time series data (generally using a frontend like Grafana). Our cluster-level monitoring system provides basic metrics for all pods, (though developers are also encouraged to expose custom business-logic metrics which will also be scraped automatically). To do so, we deploy the node-exporter DaemonSet, which exposes metrics for each pod on that node: CPU and memory usage, disk and network I/O, etc. Alerts (using the Prometheus Alert Manager) are triggered once specific conditions have been met (e.g. if the number of replicas backing a given service goes below certain threshold or if the latency for a given SQL query is too high).

    We quickly reached the limits of a single-instance Prometheus setup and have since federated metrics across multiple instances. This has allowed us to scale to many thousands of metrics, capturing everything from CPU usage to individual SQL query latencies or inter-service RPCs.

    Popularity 6/10 Helpfulness 1/10 Language whatever
    Source: Grepper
    Link to this answer
    Share Copy Link
    Contributed on Mar 08 2023
    Vinay Rawal
    0 Answers  Avg Quality 2/10


    X

    Continue with Google

    By continuing, I agree that I have read and agree to Greppers's Terms of Service and Privacy Policy.
    X
    Grepper Account Login Required

    Oops, You will need to install Grepper and log-in to perform this action.