Trace Proxy Metrics

Introduction

Trace proxy metrics typically provide insights into the performance, behavior, and health of a system’s distributed tracing component. This document provides information about the various trace metrics monitored and reported by our tracing system.

Trace Proxy Metrics

Metric	Description
`trace_operations_latency`	This metric captures the span latency in microseconds (µs) categorized by service, operation, and app. This metric is presented as a Prometheus histogram with predefined buckets specified as {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}. The histogram allows for a distribution analysis of latency values, providing insights into the performance characteristics of various services, operations, and applications within the traced system.
`trace_root_operation_latency`	This metric measures the latency in microseconds (µs) for the root spans in a system. This latency is categorized by service, operation, and app. The metric is represented as a Prometheus histogram with predefined buckets set at {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}. This histogram structure allows for a detailed analysis of the distribution of latency values for the root spans across different quantiles, providing insights into the overall performance of the system at the root span level.
`trace_accepted`	This metric serves as an indicator that a new trace has been successfully added to the collector's cache. This metric is useful for monitoring and keeping track of the acceptance or ingestion of traces within the system.
`trace_operations_latency_ms`	This metric denotes the time difference, measured in milliseconds (ms), between the start and end times of a span for each trace operation. This metric provides insights into the duration or latency of individual trace operations within a trace.
`trace_operations_failed`	This metric represents the number of error events occurring in spans for each trace operation within a trace. This metric provides a count of the instances where an error has been identified or logged during the execution of individual trace operations.
`trace_operations_succeeded`	This metric indicates the number of succeeded events in spans for each trace operation within a trace. This metric counts instances where the execution of individual trace operations was successful without encountering errors.
`trace_spans_count_total`	Represents the total count of spans within a trace. This metric provides a numerical value indicating the overall number of spans that make up a particular trace.
`trace_root_operation_latency_ms`	Refers to the time difference, measured in milliseconds (ms), between the start and end times of a root span for each trace operation. This metric focuses specifically on the latency associated with the root span of a trace, providing insights into the duration of the entire trace operation.
`trace_root_span`	Represents the number of root spans within a particular operation. Monitoring the "trace_root_span" metric can provide insights into the number of distinct operations or workflows initiated within a system, as each root span often corresponds to an independent unit of work or transaction.
`trace_spans_count`	Indicates the count of total spans within a trace for each operation. Monitoring the "trace_spans_count" metric for each operation provides information on the total number of spans associated with individual operations.
`trace_root_operations_failed`	Represents the number of error events occurring in root spans for each trace operation within a trace. This metric specifically focuses on errors encountered at the root span level, providing insights into the health and reliability of the initial or parent spans within traced operations.
`trace_operation_error`	Defined as the ratio of the number of failed trace operations (trace_operation_failed) to the total count of spans within a trace (trace_spans_count). This ratio provides a measure of the proportion of trace operations that have encountered errors relative to the total number of spans in the trace.
`trace_response_http_status`	Represents the total count of requests categorized based on their HTTP status codes within a traced system. This metric provides a breakdown of the number of requests corresponding to different HTTP status codes, allowing for the monitoring and analyzing the distribution of responses.
`trace_response_grpc_status`	Represents the total count of requests categorized based on their GRPC (Google Remote Procedure Call) status codes within a traced system. In GRPC, status codes are used to indicate the success or failure of an RPC (Remote Procedure Call).
`trace_apdex_latency`	This metric establishes buckets according to the configured apdex threshold for latency in milliseconds. As traces are ingested, they categorize into these buckets based on their latency values, subsequently increasing the counter associated with each bucket. The counter for a specific bucket can be accessed through the auto-generated metric trace_apdex_latency_bucket{le="< latency>"}.

Trace Metrics

Metric	Description
`trace_duration_ms`	Represents the processing time spent by a span in the trace proxy, measured in milliseconds.
`trace_send_dropped`	Indicates the number of traces that were dropped by the sampler. The mentioned scenario involves a dry run mode where, when enabled, the metric "trace_send_kept" increments for each trace that is sent, while "trace_send_dropped" remains 0. This configuration reflects that all traces are sent to Opsramp during the dry run, and none are dropped by the sampler.
`trace_send_kept`	Indicates the number of traces that are sent after applying the sampling rule. In the described scenario with dry run mode enabled, the metric "trace_send_kept" increments for each trace sent, reflecting that all traces are being sent to Opsramp during the dry run. Meanwhile, the metric "trace_send_dropped" remains 0, indicating that no traces are being dropped by the sampler.
`trace_send_ejected_full`	Indicates the number of traces that are sent when the trace capacity is greater than the cache capacity, based on this condition.
`trace_send_ejected_memsize`	Indicates the number of traces that cannot be retained within the existing cache due to memory size constraints. In response to this condition, the system puts the traces that cannot be kept in the current cache into a new cache, and they are subsequently sent.
`trace_send_expired`	Indicates the number of traces that are sent when the trace timeout is completed, based on this condition.
`trace_send_got_root`	Indicates the number of traces that have a root span and are sent based on this condition.
`trace_send_has_root`	Represents the count of spans that are identified as root spans within a trace. This count indicates how many spans within a set of traces have been designated as root spans.
`trace_send_no_root`	Represents the count of spans within a trace that are not identified as root spans.
`trace_sent_cache_hit`	Indicates that the trace proxy has received a span belonging to a trace that had already been sent. In this scenario, the trace proxy checks the sampling decision for the trace. If the trace has already been sent, the trace proxy may either forward the span immediately to Opsramp or drop the span, depending on the implemented sampling strategy.

Collector Metrics

Metric	Description
`collector_cache_buffer_overrun`	This metric is a value that, ideally should remain zero. An increase in this value might indicate a potential issue and could suggest the need to increase the size of the collector's circular buffer. The size of this circular buffer is typically configured using the "CacheCapacity" field. Even if this metric is increasing, it does not necessarily indicate that the cache is full. This situation may occur when the "collector_cache_entries" values (representing the number of entries in the cache) remain low in comparison to the configured "collect_cache_capacity".
`collector_cache_capacity`	Represents the configured capacity of the collector's cache. It provides information about the total size or capacity of the circular buffer that is used to temporarily store traces before they are processed or sent. you can use "collector_cache_capacity" with the "collect_cache_entries" metric to assess how full the cache is getting over time.
`collector_cache_entries`	Provides various statistical measures (average, maximum, minimum, and percentile values such as p50, p95, and p99) that collectively indicate how full the collector's cache is over time. This metric reflects the number of records or traces present in the cache at different points in time.
`collector_cache_size`	Represents the length or size of a circular buffer that currently stores traces in a tracing system. This circular buffer serves as a temporary storage mechanism for traces before they are further processed, analyzed, or sent to a destination.
`collector_incoming_queue`	Records various statistical measures (average, maximum, minimum, and percentile values such as p50, p95, and p99) to indicate how full the queue of spans that were received from outside of the trace proxy and need to be processed by the collector.
`collector_peer_queue`	Records various statistical measures (average, maximum, minimum, and percentile values such as p50, p95, and p99) to indicate how full the queue of spans that were received from other trace proxy peers and need to be processed by the collector.

Routing Metrics

Metric	Description
`incoming_router_batch`	Represent the number of times the trace proxy's batch event processing endpoint is hit by master instance of tracing proxy.
`peer_router_batch`	Represent the number of times the trace proxy's batch event processing endpoint is hit by peer instance of trace proxy.
`incoming_router_dropped`	Represent the number of times the trace proxy fails to add new spans to a receive buffer when processing new events from the application to master instance of trace proxy.
`peer_router_dropped`	Represent the number of times the trace proxy fails to add new spans to a receive buffer when processing new events from the master instance of trace proxy.
`incoming_router_event`	Represent the count of times the trace proxy's single event processing endpoint is hit by master instance of tracing proxy.
`peer_router_event`	Represent the count of times the trace proxy's single event processing endpoint is hit by peer instance of trace proxy.
`incoming_router_nonspan`	Represent the count of times the trace proxy's router accepts other non-span events that are not part of a trace from the application to master instance of a trace proxy.
`peer_router_nonspan`	Represent the count of times the trace proxy's router accepts other non-span events that are not part of a trace from the master instance of trace proxy.
`incoming_router_peer`	Represent the count of traces that are routed into the master instance of trace proxy from application.
`peer_router_peer`	Represent the count of traces that are routed into the other instance of trace proxy from master instance of trace proxy.
`incoming_router_proxied`	Represent the count of traces that are routed into the master instance of trace proxy from application and have successfully reached the proxy.
`peer_router_proxied`	Represent the count of traces that are routed into the other instance of trace proxy from master instance trace proxy peer and have successfully reached the proxy.
`incoming_router_span`	Represent the count of events that the trace proxy accepts from applications and identifies as part of a trace, commonly referred to as spans.
`peer_router_span`	Represent the count of events that the trace proxy accepts from a master instance of a trace proxy and identifies as part of a trace, commonly referred to as spans.

Transmission Metrics

Metric	Description
`upstream_enqueue_errors`	Represent the count of spans that encountered errors while being dispatched to OpsRamp environment.
`peer_enqueue_errors`	Represent the count of spans that encountered errors while being dispatched to another instance of trace proxy from master instance of a trace proxy.
`upstream_response_errors`	Represent the count of spans that received an error response or a StatusCode greater than 202 when attempting to hit upstream addresses.
`peer_response_errors`	Represent the count of spans that received an error response or a StatusCode greater than 202 when attempting to peer addresses from master instance of trace proxy.
`upstream_response_20x`	Represent the count of spans that received a successful response (2xx status code) and did not encounter any errors while hitting upstream addresses.
`peer_response_20x`	Represent the count of spans that received a successful response (2xx status code) and did not encounter errors while hitting peer addresses from master instance of trace proxy.

Sampling Metrics

Metric	Description
`dynsampler_num_dropped`	Represent the count of traces that are dropped due to dynamic sampling mechanisms.
`rulessampler_num_dropped`	Represent the count of traces that are dropped due to rules-based sampling mechanisms.
`dynsampler_num_kept`	Represent the count of traces that are not dropped due to dynamic sampling mechanisms.
`rulessampler_num_kept`	Represent the count of traces that are not dropped due to rules-based sampling mechanisms.
`dynsampler_sample_rate`	Records various statistical measures (average, maximum, minimum, and percentile values such as p50, p95, and p99) of the sample rate reported by the configured dynamic sampler.

Cuckoo Cache Metrics

Metric	Description
`cuckoo_current_capacity`	Represents the dropped size of the cuckoo cache as configured in the configuration section.
`cuckoo_future_load_factor`	Represents the fraction of slots that are occupied in the future filter of the cuckoo cache.
`cuckoo_current_load_factor`	Represents the fraction of slots that are occupied in the current filter of the cuckoo cache.

Note: Additional Process and Go metrics prefixed with process_ and go_ are employed to assess the health of the trace proxy.

Introduction

Trace Proxy Metrics

Application Trace-Related Metrics

Trace Metrics

Collector Metrics

Routing Metrics

Transmission Metrics

Sampling Metrics

Cuckoo Cache Metrics