
Problem Statement—Metrics and Logs

  • Metrics and logs—not enough visibility to diagnose issues
    • Metrics
      • High-level view of system health
      • Show when behaviour changes, e.g. CPU usage of database
      • No root cause
      • High-level data point
    • Logs
      • “Breadcrumb trail” showing application behaviour
      • Distributed systems—distributed logs, difficult to follow
      • Low-level data point
  • Metrics too high-level, logs too low level to show full picture

Distributed Tracing

  • Shows relationships between components and services
    • Communication between distributed components
    • How requests are propagated


  • Span—description of event in system
    • Has parent and child spans
  • Trace—tree/list of spans
    • cf. call stacks
    • How long each request took
    • Services interacted with
    • Latency
    • Visualise synchronous/asynchronous calls


Graph View