Skip to content

[PROPOSAL] Introduce native jitter metrics (computed in streaming) for Component and e2e stream stability analysis #8583

@GGraziadei

Description

@GGraziadei

Currently, Apache Storm provides comprehensive metrics for throughput and average latency (execute-latency, process-latency). However, in high-precision real-time systems, averages often mask critical performance instabilities.

This proposal introduces a native Jitter Metric calculated at two levels:

  • Component level (Step Jitter): Measures the variance in execution time within individual Bolts and Spouts.
  • Topology level (Global Jitter): Measures the variance in e2e completion latency for fully acked tuples.

In deterministic real-time processing, the variance of the latency is as important as the latency itself (https://ieeexplore.ieee.org/abstract/document/10877871).

Why analysing jitter matters for real-time

In deterministic real-time processing, predictability of latency is as important as latency itself. This is a constraint to building a deterministic system.

  • Mcro-burst detection: high jitter reveals short spikes that average latency smooths out.
  • Compliance: modern SLAs rely on percentiles (e.g., P99). Jitter is a strong leading indicator of tail-latency degradation.
  • Root Cause Analysis: high component jitter means GC pressure or resource contention; instead, high global jitter with stable components suggests network congestion or shuffle bottlenecks.
  • Bottleneck identification: jitter enables precise identification of where bottlenecks occur in the topology and helps distinguish their underlying causes, making performance issues easier to diagnose and resolve.

Proposed model: Exponentially Weighted Moving Average (EWMA)

To ensure negligible performance impact, I propose to use an Exponentially Weighted Moving Average (EWMA), following RFC 1889 logic https://www.rfc-editor.org/rfc/rfc1889#appendix-A.8

Mathematical Model:
J_new = J_old + (|D_current - D_previous| - J_old) / 16

GIVEN a State {ewmaJitter, lastTransit}
PROCEDURE addValue(transitMs)
    IF transitMs < 0 THEN 
        EXIT PROCEDURE

    IF lastTransit IS NOT UNINITIALIZED THEN
        // Calculate the absolute difference between the current and previous transit time
        deviation = ABS(transitMs - lastTransit)
        
        // Update the Exponentially Weighted Moving Average using the RFC 1889 smoothing factor
        ewmaJitter = ewmaJitter + (deviation - ewmaJitter) * RFC1889_ALPHA
    END IF

    // Store current transit time for the next iteration
    lastTransit = transitMs
END PROCEDURE

Performance impact

  • Minimal computational overhead: by utilizing an EWMA, we avoid the need for storing large datasets or sliding window buffers. The jitter is updated via a single linear equation, requiring only basic arithmetic.
  • Memory efficiency: The EWMA algorithm is extremely memory-light, requiring only a single persistent variable (8 bytes) per executor to maintain the moving average state, plus a reference for the previous latency sample.
  • System calls: To eliminate redundant overhead, the metric hooks into existing latency tracking logic. This point requires additional brainstorming to evaluate already sampled metrics.

Limitations and constraints

  • Clock skew: Global jitter may be affected in the case of unsynchronised nodes. However, since jitter measures variance between consecutive samples, constant skew cancels out mathematically.
  • Sampling bias: Low sampling rates may miss high-frequency jitter spikes.
  • Warm-up: as an EWMA-based metric, values may fluctuate initially before stabilizing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions