Pcpu delta spike by asmacdo · Pull Request #423 · con/duct

asmacdo · 2026-04-28T19:38:14Z

Warning: PR description is currently just accumulated notes from claude-code session. Some inaccuracies to clean up, lots of cruft to remove.

Linux delta-pcpu sampler (fixes #399)

Why

ps -o pcpu on Linux reports a process's lifetime average
(cputime / realtime), not instantaneous CPU. Summing it across a
churn of short-lived high-CPU workers is unbounded — duct logged
5363% on a 20-core box (#399). See
findings/calc-sketch-linux-pcpu.md (notebook worktree) for the
full mechanism.

Changes

instantaneous-pcpu calculation For each pid, recover cumulative cputime
from pcpu × etimes by taking a delta across consecutive samples, divide
by delta etimes.
Audit-trail fields. ProcessStats gains pcpu_raw (the
unmodified ps value) and etimes (float seconds). pcpu's
semantics shift to "corrected when computable, equals
pcpu_raw at fallback." The string etime is retained.
Fallback policy. When a pid is first observed, or
Δetimes ≤ 0, pcpu falls back to pcpu_raw rather than 0.0.
Silently reporting 0% for a churning workload would mislead
worse than carrying the lifetime ratio. Tradeoff is documented
in Limitations below.
State plumbing. The previous-sample raw values live on the
Tracker (which already holds cross-sample state); Linux/Mac
sampler functions stay pure with signature
(session_id, prev) -> (Optional[Sample], prev').
Drop young pids (smon's pattern). Pids whose ps etime
reads "00:00" (too young for ps to report a non-zero elapsed
time) are skipped from the sample. Mitigates oddity from observing the plot: add max's to legend, and check what is "total" pcpu/plot average? #399's
churn-of-short-lived-workers case more aggressively than the
pcpu_raw fallback alone — inflated lifetime ratios from
brand-new pids never enter the sum at all. Toggled by a
module-level DROP_YOUNG_PIDS constant (default True); see
TODOs for follow-up.
Empty-sample handling. When every pid in a ps round is
filtered (common at process startup before etime ticks past
1s), _get_sample_linux now returns (None, new_prev) instead
of an empty Sample. Mirrors the existing _get_sample_mac
no-pids path so the monitor loop skips the round rather than
crashing on Averages.from_sample's not-None assertions.
Linux only. macOS ps -o pcpu is a 5/8-aging exponential
average, not lifetime-cumulative; the delta math is invalid
there. Darwin keeps current behavior; finding documented in
findings/darwin-pcpu.md.
Schema bump 0.2.2 → 0.3.0. Reflects the additive fields and
the pcpu/total_pcpu semantics shift.
Sub-1s sample-interval warning. cli.py::run_command emits
a one-shot lgr.warning when --sample-interval < 1.0,
flagging that phase artifacts (false pcpu spikes) are likely on
Linux. Suppressed under --quiet / --log-level NONE.

Empirical validation

Workload: 4s busy + 10s idle Python loop, --s-i 1 --r-i 1.

	peak pcpu	average pcpu
main (lifetime calc)	300%	80.1%
this PR	100%	27.7%
ground truth	~100%	28.6%

Wall-time identical → no measurable overhead. Logs:
~/devel/duct/.duct/logs/2026.04.27T20.00.28-164596_* (main),
~/devel/duct/.worktrees/pcpu-delta-spike/.duct/logs/2026.04.27T20.00.07-164502_*.

Limitations (honest)

Captured in PROBLEMS.md (notebook worktree):

Sub-interval spikes are smoothed. A 0.2s 100% burst inside a
1s sample reads as 20%. Strictly better than the lifetime
method for late-run spikes, but lower-resolution than a
hypothetical /proc/<pid>/stat source.
Short-lived workers lose their audit record entirely.
DROP_YOUNG_PIDS discards the whole record (including
cmd/rss/vsz/pmem), not just pcpu, for any pid ps
reports with etime="00:00". Tradeoff in favor of mitigating
oddity from observing the plot: add max's to legend, and check what is "total" pcpu/plot average? #399 churn over preserving full visibility on every pid. Pids
that survive long enough for etime to tick past 1s are
observed normally on subsequent samples.
First-observed pids with etime > 0 still fall back to
pcpu_raw. A pid that's been alive for > 1s before duct
first samples it has no prior to delta against, so its first
recorded pcpu is the lifetime ratio. Less common than the
etime="00:00" case but not eliminated by DROP_YOUNG_PIDS.
Bounds-style representation (pcpu_lower/pcpu_upper) is the
long-term direction; deferred to a follow-up PR.
Sub-second sample intervals produce phase artifacts. ps etime resolution is 1s, so when configured sample-interval is
below that, Δetime collapses to 0 between many sample pairs.
The corrected calc falls back to pcpu_raw (the lifetime
ratio) on those samples, which can show as false pcpu spikes —
especially for short-lived pids whose lifetime ratio is
artificially high. At default 1.0s the issue does not occur in
practice (positive overhead keeps actual interval ≥ 1s);
cli.py warns when the configured value is below 1.0s.
First-sample volatility from procps. Occasionally the very
first record for a brand-new pid reports nonsense etime
(~10⁶-year readings observed once across ~7 manual runs). The
fallback protects the calc, but the raw value persists in the
audit record. Mitigation deferred.

Schema and back-compat

New required fields on ProcessStats: pcpu_raw: float,
etimes: float. Additive in JSON.
pcpu and total_pcpu keep their names; semantics shift on
Linux (corrected) and are unchanged on Darwin (lifetime).
__schema_version__ bumped to 0.3.0.
No deserialization shim is shipped: nothing in
con-duct pp / plot / ls reconstructs ProcessStats from
parsed JSON — all consumers read dicts. Old usage.jsonl files
remain plottable; their pcpu values continue to reflect the
old lifetime semantics (correct for files written by older
duct).
Mixed-version caveat: users who have both new and old
con-duct versions in rotation should be aware that
pcpu/total_pcpu numbers from new-duct runs are not
comparable to numbers from old-duct runs on the same workload.

TODOs left for follow-up

If a future consumer ever deserializes ProcessStats from
parsed JSON, it will need defaults for pcpu_raw/etimes on
pre-0.3.0 records. None exist today; flagged so the back-compat
cost is visible if one is added.
ls.py:113-119 already branches on parse_version for
pre-0.2.1 and pre-0.2.2 info.json files. If usage.jsonl ever
needs analogous version-aware loading, the same pattern
applies.
Promote DROP_YOUNG_PIDS from a module-level constant to an
env var + CLI flag. Today the toggle lives only in
_constants.py and is overridden in tests via a session-scoped
monkeypatch in test/conftest.py. That mechanism doesn't reach
subprocesses, which is why test_e2e.py had to bump several
fixture sleeps to ≥1s instead. The proper fix is the same
pattern as DUCT_SAMPLE_INTERVAL / DUCT_REPORT_INTERVAL:
env-var-driven default in _constants.py, paired CLI flag in
cli.py, parameter threaded through duct_execute → Report
→ _get_sample_linux. Once that lands, the e2e fixture sleeps
can return to short values and the conftest override can use
the env var directly. TODO (post-merge): file the issue
using the draft at
.worktrees/resource-measurement-notebook/FOLLOWUP_ISSUE_drop_young_pids.md
— only if this PR is accepted upstream.
Consider bumping --sample-interval default from 1.0s to
2.0s. At configured 1.0s with positive overhead the corrected
calc is reliable (Δetime ≥ 1 in steady state), so the bump is
not a correctness fix — it's a conservatism call that gives
users an extra safety margin away from the 1s etime quantum.
Pushing the default also pushes the observability floor (sub-2s
pids become single-sample with pcpu_raw fallback). Worth a
separate discussion / PR.
docs/resource-statistics.md rewrite on top of the new calc.
Will supersede Document what duct's resource statistics actually measure #414. Tracked separately.

Test coverage

test/duct_main/test_duct_utils.py — parser
(etime_to_etimes) and calculator (instantaneous_pcpu)
green/red paths.
test/duct_main/test_aggregation.py and test_report.py —
fixtures updated for the new fields.
test/duct_main/test_sampling.py (new) — three-call test of
_get_sample_linux with mocked ps: first-observation
fallback, corrected pcpu via prev lookup, pid-disappearance
handling in the rebuilt new_prev, and etime="00:00"
drop assertions. Plus a dedicated all-young-dropped test
asserting (None, new_prev) is returned. Both opt into
DROP_YOUNG_PIDS=True via the enable_drop_young_pids
fixture (the conftest disables the filter session-wide for
in-process tests).
test/duct_main/test_report.py::test_collect_sample_round_trips_prev_raw
(new) — regression guard for the load-bearing decision that
the Tracker holds the cross-sample state.
test/conftest.py gains a session-autouse fixture that
flips DROP_YOUNG_PIDS=False for in-process tests (so
pre-existing fixtures that rely on observing sub-second
workloads keep working) and a function-scoped opt-in
fixture for the unit tests above.
test/duct_main/test_e2e.py — sample-asserting tests
(test_spawn_children, test_session_modes,
test_session_mode_behavior_difference) bumped from
sub-second to sleep 2 so the workload outlives ps's
etime quantum. Tests that don't assert on samples
(test_sanity, test_logging_levels) keep their
original short sleeps. The empty-sample (None, new_prev)
return in _get_sample_linux keeps duct exiting cleanly
on those short-sleep tests rather than crashing the
monitor thread.

A multi-step trajectory test was considered and dropped: the
sampler rebuilds new_prev from scratch on each call, so a
3-step series cannot fail when a 2-step one passes. The
parameterized cases in test_instantaneous_pcpu_green already
cover the math.

Commits

d6054d6 parser: etime_to_etimes
56bae66 math: instantaneous_pcpu calculator
7ec330f fallback: instantaneous_pcpu returns
curr_pcpu (not None)
ab382ad schema: persist pcpu_raw and etimes on
ProcessStats
a1e61ae sampling: wire delta-pcpu into the Linux sampler
3953290 constants: bump schema_version to 0.3.0
dd18f13 tests: cover prev-state threading in
_get_sample_linux
ba24400 tests: cover prev_raw round-trip in
Report.collect_sample
a945ea6 sampling: drop young pids (etime=="00:00") by
default (also brings the empty-sample None-return,
conftest override, and targeted e2e sleep bumps)
0d742f5 cli: warn on sub-1s sample-interval

Each commit is atomic and pre-commit-clean; the series reads as a
coherent fix unit per Operating principles.

Parses ps's elapsed-time string format ([[DD-]HH:]MM:SS) into float seconds. The downstream instantaneous-pcpu calculation needs a numeric elapsed time, but ps -o etime emits only the formatted string. A future change may switch to ps -o etimes (which ps can produce directly as seconds); the parser stays useful either way for reading existing usage.jsonl logs that persist the string form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure function that recovers instantaneous %CPU between two ps samples of the same pid by inverting the procps cumulative-pcpu identity: cputime = pcpu * etime / 100. Subtracting cputime at two samples and dividing by the interval gives a value in the same units as the input pcpu. Linux-only by assumption (Darwin's ps reports a 5/8-decayed EWMA, not a cumulative ratio, so the identity does not hold there). Returns None when the elapsed-time interval is non-positive -- either the pid has not aged a full ps tick, or etimes regressed (suspected pid reuse). This commit only adds the function and its tests; wiring into the sampler is a separate change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add two required fields to ProcessStats so the sampler keeps a parsed numeric elapsed time alongside ps's raw outputs: - etimes: float seconds derived from etime via etime_to_etimes. Used by the upcoming delta-pcpu calc; etime (string) is retained in this commit for back-compat with existing usage.jsonl readers. - pcpu_raw: copy of the value ps reported. Equal to pcpu in this commit; pcpu will diverge once the delta calc is wired and pcpu_raw becomes the audit trail for the derivation. Both samplers (Linux and Darwin) populate the new fields. Aggregate takes max for pcpu_raw and the later sample's etimes, mirroring existing behavior for pcpu and etime. Test fixtures across test_aggregation and test_report pick up matching values. A backwards-compat layer for existing usage.jsonl logs that pre-date these fields is intentionally deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous semantics returned None when the elapsed-time interval was non-positive (etimes did not advance, or regressed). That forced every caller into the same fallback boilerplate. New semantics: return curr_pcpu in the non-positive case. A sophisticated consumer can still detect "fallback was used" by comparing the result against pcpu_raw post-hoc; the audit data is unchanged. The function's return type tightens from float | None to float, and the only existing caller (the test suite) is updated. Rationale: returning curr_pcpu is the conservative answer that keeps the value honest about "something is here" rather than emitting a silent zero -- avoiding the false-zero failure mode of under-reporting churning workloads as idle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Linux _get_sample_linux now accepts a prev-state dict mapping pid -> (pcpu_raw, etimes), looks up each pid's prior raw inputs, and computes corrected pcpu via instantaneous_pcpu. First-observed pids fall back to pcpu_raw (no prior to delta against). The dict is rebuilt fresh from the current sample, so pids that disappear between samples are dropped automatically -- no unbounded growth across long runs. Mac sampler matches the new signature for uniform dispatch but returns prev unchanged; Darwin's ps reports a 5/8-decayed EWMA, not a cumulative ratio, so the delta calc does not apply there. Report holds the prev_raw dict across collect_sample calls, initialized empty in __init__. No public-API changes; tests that construct Sample/ProcessStats directly are untouched. Limitations (also captured in PROBLEMS.md in the resource-measurement-notebook worktree): - For workloads where no pid survives long enough for etimes to advance past the previous sample's value (the con#399 trigger case of sequentially-churning short-lived workers), every observation falls back to pcpu_raw and the calc provides no improvement. - Sub-second sample intervals frequently land both samples in the same etime quantum, yielding the fallback path. - The persisted pcpu_raw alongside the corrected pcpu lets a consumer detect "fallback was used" by comparing the two. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ProcessStats gained two required fields (pcpu_raw, etimes) and pcpu's semantics shifted on Linux from "lifetime average" to "corrected when computable, equals pcpu_raw at fallback." The schema is additive on the JSON wire (existing keys keep their names), but consumers reading new files with old expectations should know the meaning of pcpu/total_pcpu has changed -- hence the minor bump. ls.py's ensure_compliant_schema patches info.json fields only (working_directory at 0.2.1, message at 0.2.2). No 0.3.0 entry is needed there: the additions live on usage.jsonl ProcessStats, and no consumer in this codebase reconstructs ProcessStats from parsed JSON (all read dicts directly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The unit tests in test_duct_utils.py exercise instantaneous_pcpu in isolation, but nothing covers the wiring layer between it and the live sampler -- the dict-keyed prev lookup, the first-observation fallback, and the rebuild-from-scratch behavior of new_prev that drops vanished pids. Three-call test with a mocked ps subprocess: - Call 1: empty prev -> pcpu falls back to pcpu_raw. - Call 2: corrected via instantaneous_pcpu (matches the constant 84% case from the parameterized green tests). - Call 3: pid disappears, new pid appears -> new_prev contains only the surviving pid. Catches regressions where the (pcpu_raw, etimes) tuple order is swapped, the prev lookup is dropped, or new_prev accumulates stale entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

One of the load-bearing decisions in the delta-pcpu work is that the Tracker holds the cross-sample state (rather than a module-level dict or a sampler-as-class). Without a regression test, refactoring self.prev_raw to a local would silently disable the calc on every sample after the first. Two-call test mocking _get_sample with a side_effect that inspects each invocation's input prev. Asserts the first call gets {} and the second gets the dict returned from the first; asserts report.prev_raw lands on the latest returned value after each call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adopts smon's pattern of skipping pids whose ps elapsed-time reads "00:00" -- pids too young for ps to report a meaningful etime. Mitigates the con#399 churn-of-short-lived-workers case more aggressively than the pcpu_raw fallback alone, by preventing inflated lifetime ratios from ever entering the sample. Tradeoff: loses the entire record (cmd/rss/vsz/pmem) for any pid ps reports with etime=="00:00", not just the pcpu field. Pids that survive long enough for etime to tick past 1s are observed normally on subsequent samples. Toggled by DROP_YOUNG_PIDS in _constants.py rather than a CLI flag for now. Promoting it to an env-var + CLI flag (matching the DUCT_SAMPLE_INTERVAL / DUCT_REPORT_INTERVAL pattern) is tracked as a follow-up; see PR description. Empty-sample handling: _get_sample_linux now returns (None, new_prev) when every pid in a ps round is filtered (common at process startup before etime ticks past 1s). Mirrors the existing _get_sample_mac no-pids path so the monitor loop skips the round rather than crashing on Averages.from_sample's not-None assertions. The return type narrows from Sample to Optional[Sample] to match. Tests: - test_sampling.py grows a young-pid line in the threading test (asserts both sample and new_prev exclude it) and a dedicated all-young-dropped test (asserts the (None, new_prev) return). Both opt into DROP_YOUNG_PIDS=True via enable_drop_young_pids applied as a usefixtures mark. - test/conftest.py gains a session-autouse fixture that flips DROP_YOUNG_PIDS=False for in-process tests (so pre-existing fixtures asserting on sub-second workloads keep working) and a function-scoped enable_drop_young_pids opt-in for unit tests that specifically exercise the drop. The session override reaches in-process tests only; subprocess tests in test_e2e.py inherit the production default. - test_e2e.py: sample-asserting tests (test_spawn_children, test_session_modes, test_session_mode_behavior_difference) bumped from sub-second to >=1s sleeps. Tests that don't assert on samples (test_sanity, test_logging_levels) keep their short sleeps -- the empty-sample handling makes them exit cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ps reports etime at 1-second resolution, so configured sample-interval below 1s produces phase artifacts in the corrected pcpu calc -- adjacent sample pairs frequently land in the same etime quantum, the calc falls back to pcpu_raw (lifetime ratio), and short-lived pids with high startup CPU show as false spikes. At the default 1.0s with positive sampling overhead the issue does not occur in practice (actual interval >= 1s, Δetime >= 1 in steady state). The warning fires only when the user explicitly configures a sub-1s interval; --quiet and --log-level NONE suppress it through the normal log machinery. Bumping the default to 2.0s for an extra safety margin is captured as a TODO in the PR description; not a correctness fix, just a conservatism call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-28T19:41:16Z

Codecov Report

❌ Patch coverage is 96.22642% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.99%. Comparing base (2162da9) to head (0d742f5).
⚠️ Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
src/con_duct/cli.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #423      +/-   ##
==========================================
+ Coverage   91.87%   91.99%   +0.12%     
==========================================
  Files          15       15              
  Lines        1120     1162      +42     
  Branches      139      147       +8     
==========================================
+ Hits         1029     1069      +40     
- Misses         69       71       +2     
  Partials       22       22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

asmacdo and others added 10 commits April 27, 2026 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pcpu delta spike#423

Pcpu delta spike#423
asmacdo wants to merge 10 commits intocon:mainfrom
asmacdo:pcpu-delta-spike

asmacdo commented Apr 28, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

asmacdo commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linux delta-pcpu sampler (fixes #399)

Why

Changes

Empirical validation

Limitations (honest)

Schema and back-compat

TODOs left for follow-up

Test coverage

Commits

Uh oh!

codecov Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

asmacdo commented Apr 28, 2026 •

edited

Loading

codecov Bot commented Apr 28, 2026 •

edited

Loading