Use Case 1: Investigate a Critical Alert with Full Context Analysis

User Goal

Receive a critical alert and conduct a comprehensive root cause investigation by analyzing alert details, affected resources, topology relationships, metric patterns, and traces.

When to Use This

Use this workflow when:

  • A critical alert fires and you need to understand the underlying cause
  • You want a comprehensive investigation that considers multiple data sources
  • You need to understand not just what broke, but why it broke

How to Start

Launch Copilot → Switch to Root Cause channel

How to Ask Copilot

You can start broad and let the agent gather context:

  • “Investigate alert 121898533 and determine the probable root cause”
  • “What are the affected resources and their current state?”
  • “Show me the topology around this resource — are there upstream dependencies that might be causing this?”
  • “What do the metrics look like before and during the alert window?”
  • “Have there been similar alerts on this resource or related components recently?”

What Copilot Provides

  • Alert Context: Full alert details including severity, subject, resource, metric threshold
  • Resource Analysis: Current state and health of affected resources
  • Topology Insights: Infrastructure relationships showing dependencies and potential impact chains
  • Metric Correlation: Time-series data showing patterns before/during the alert
  • Historical Context: Similar past alerts and their resolutions
  • Root Cause Hypothesis: Probable root cause based on all gathered evidence with confidence level

What to Ask Next (If Needed)

  • “Can you check if the host this VM is running on also has issues?”
  • “Show me other resources in this cluster — are they affected too?”
  • “Are there any trace errors correlating with this metric spike?”

Actions You Can Take

  • Review the evidence chain and validate hypothesis
  • Drill into specific metrics or traces mentioned
  • Request investigation of related resources
  • Use findings to guide remediation

Outcome

You have a clear understanding of what triggered the alert, which components are involved, and a data-backed hypothesis of the probable root cause to guide remediation.

Use Case 2: Analyze Correlated Alerts (Inference Investigation)

User Goal

Multiple related alerts have been correlated into an inference. Investigate the inference to understand the single underlying issue causing all the alerts.

When to Use This

Use this workflow when:

  • Multiple alerts on the same or related resources have been grouped as an inference
  • You need to understand the relationship between correlated alerts
  • You want to find the single root cause affecting multiple components

How to Start

Launch Copilot → Switch to Root Cause channel. NOTE: Probable Root Cause Agent automatically fetches the insights generated for the inference alerts (if available) when you start the investigation.

How to Ask Copilot

Start with the inference and narrow down systematically:

  • “Analyze the alert 121898533 and determine the root cause”
  • “What are all the alerts included in this inference?”
  • “Show me the timeline — which alert fired first and what followed?”
  • “What resources are affected across these alerts?”
  • “Is there a common component or dependency causing all these issues?”
  • “Show me the topology view of all affected resources”

What Copilot Provides

  • Inference Summary: All correlated alerts with their relationships
  • Timeline Analysis: Chronological order of alert firing to identify origin
  • Resource Mapping: All affected resources and their interdependencies
  • Topology Visualization: Infrastructure view showing how resources relate
  • Common Patterns: Shared metrics, components, or events across alerts
  • Root Cause Analysis: The underlying issue triggering the cascade of alerts

What to Ask Next (If Needed)

  • “Which alert in this inference represents the actual root cause?”
  • “Are there metric patterns showing this started from a specific component?”
  • “Have we seen this pattern of alerts together before?”
  • “Could this be a cascading failure from a single point?”

Actions You Can Take

  • Identify the primary alert representing the root cause
  • Investigate the origin component more deeply
  • Understand blast radius and affected services
  • Plan remediation targeting the root cause

Outcome

You understand how multiple alerts relate to each other, identify the single underlying issue, and can focus remediation on the actual root cause rather than symptoms.

Use Case 3: Service-Level Root Cause with Trace Analysis

User Goal

A service is experiencing errors or latency issues. Use distributed traces, service maps, and dependencies to pinpoint the failing component, slow operation, or infrastructure bottleneck.

When to Use This

Use this workflow when:

  • Application/service alerts fire (errors, latency, throughput drops)
  • Multi-service architecture suspected
  • Traces and eBPF data are available
  • You need to identify application-level or network-level issues

How to Start

Launch Copilot → Switch to Root Cause channel

How to Ask Copilot

Provide service context and let the agent map dependencies:

  • “My payment-api-service had high errors in the last hour — investigate the root cause”
  • “Get the service overview and show me the full service map”
  • “Which downstream services or databases is payment-api-service calling?”
  • “Analyze traces for payment-api-service — which operations are failing or slow?”
  • “Show me error stack traces and identify the bottleneck”
  • “Are there network latency issues between services based on eBPF data?”

What Copilot Provides

  • Service Overview: Health, throughput, error rate, latency percentiles
  • Service Map: Full dependency graph showing upstream and downstream services
  • Trace Analysis:
    • Slow spans and operations
    • Error hotspots with stack traces
    • Request paths showing latency breakdown
  • Infrastructure Correlation: Host, network, or database issues affecting the service
  • Root Cause Identification: Whether issue is application, infrastructure, network, or external dependency

What to Ask Next (If Needed)

  • “Is the database service this service calls also showing high latency?”
  • “Show me the specific stack trace for the most common error”
  • “Are there any failed network calls or timeouts in the traces?”
  • “Compare current trace patterns to baseline — what changed?”
  • “Is this affecting only certain operations or all traffic?”

Actions You Can Take

  • Drill into specific failing operations or endpoints
  • Investigate identified bottleneck services or infrastructure
  • Check deployment or config changes around error spike time
  • Correlate with infrastructure metrics (CPU, memory, network)

Outcome

Clear identification of which service, operation, or infrastructure component is the root cause, with trace-level evidence showing exactly where failures or slowdowns occur.

Example Use Cases

Use this template to document use cases such as:

  • Identifying the most impacted resource and correlating alerts
  • Analyzing an alert to understand why it fired
  • Analyzing and resolving a ticket
  • Understanding alert trends over time
  • Investigating network device issues
  • Understanding policy violations