Conversation
Parses ps's elapsed-time string format ([[DD-]HH:]MM:SS) into float seconds. The downstream instantaneous-pcpu calculation needs a numeric elapsed time, but ps -o etime emits only the formatted string. A future change may switch to ps -o etimes (which ps can produce directly as seconds); the parser stays useful either way for reading existing usage.jsonl logs that persist the string form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure function that recovers instantaneous %CPU between two ps samples of the same pid by inverting the procps cumulative-pcpu identity: cputime = pcpu * etime / 100. Subtracting cputime at two samples and dividing by the interval gives a value in the same units as the input pcpu. Linux-only by assumption (Darwin's ps reports a 5/8-decayed EWMA, not a cumulative ratio, so the identity does not hold there). Returns None when the elapsed-time interval is non-positive -- either the pid has not aged a full ps tick, or etimes regressed (suspected pid reuse). This commit only adds the function and its tests; wiring into the sampler is a separate change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two required fields to ProcessStats so the sampler keeps a parsed numeric elapsed time alongside ps's raw outputs: - etimes: float seconds derived from etime via etime_to_etimes. Used by the upcoming delta-pcpu calc; etime (string) is retained in this commit for back-compat with existing usage.jsonl readers. - pcpu_raw: copy of the value ps reported. Equal to pcpu in this commit; pcpu will diverge once the delta calc is wired and pcpu_raw becomes the audit trail for the derivation. Both samplers (Linux and Darwin) populate the new fields. Aggregate takes max for pcpu_raw and the later sample's etimes, mirroring existing behavior for pcpu and etime. Test fixtures across test_aggregation and test_report pick up matching values. A backwards-compat layer for existing usage.jsonl logs that pre-date these fields is intentionally deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous semantics returned None when the elapsed-time interval was non-positive (etimes did not advance, or regressed). That forced every caller into the same fallback boilerplate. New semantics: return curr_pcpu in the non-positive case. A sophisticated consumer can still detect "fallback was used" by comparing the result against pcpu_raw post-hoc; the audit data is unchanged. The function's return type tightens from float | None to float, and the only existing caller (the test suite) is updated. Rationale: returning curr_pcpu is the conservative answer that keeps the value honest about "something is here" rather than emitting a silent zero -- avoiding the false-zero failure mode of under-reporting churning workloads as idle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Linux _get_sample_linux now accepts a prev-state dict mapping pid -> (pcpu_raw, etimes), looks up each pid's prior raw inputs, and computes corrected pcpu via instantaneous_pcpu. First-observed pids fall back to pcpu_raw (no prior to delta against). The dict is rebuilt fresh from the current sample, so pids that disappear between samples are dropped automatically -- no unbounded growth across long runs. Mac sampler matches the new signature for uniform dispatch but returns prev unchanged; Darwin's ps reports a 5/8-decayed EWMA, not a cumulative ratio, so the delta calc does not apply there. Report holds the prev_raw dict across collect_sample calls, initialized empty in __init__. No public-API changes; tests that construct Sample/ProcessStats directly are untouched. Limitations (also captured in PROBLEMS.md in the resource-measurement-notebook worktree): - For workloads where no pid survives long enough for etimes to advance past the previous sample's value (the con#399 trigger case of sequentially-churning short-lived workers), every observation falls back to pcpu_raw and the calc provides no improvement. - Sub-second sample intervals frequently land both samples in the same etime quantum, yielding the fallback path. - The persisted pcpu_raw alongside the corrected pcpu lets a consumer detect "fallback was used" by comparing the two. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ProcessStats gained two required fields (pcpu_raw, etimes) and pcpu's semantics shifted on Linux from "lifetime average" to "corrected when computable, equals pcpu_raw at fallback." The schema is additive on the JSON wire (existing keys keep their names), but consumers reading new files with old expectations should know the meaning of pcpu/total_pcpu has changed -- hence the minor bump. ls.py's ensure_compliant_schema patches info.json fields only (working_directory at 0.2.1, message at 0.2.2). No 0.3.0 entry is needed there: the additions live on usage.jsonl ProcessStats, and no consumer in this codebase reconstructs ProcessStats from parsed JSON (all read dicts directly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unit tests in test_duct_utils.py exercise instantaneous_pcpu in isolation, but nothing covers the wiring layer between it and the live sampler -- the dict-keyed prev lookup, the first-observation fallback, and the rebuild-from-scratch behavior of new_prev that drops vanished pids. Three-call test with a mocked ps subprocess: - Call 1: empty prev -> pcpu falls back to pcpu_raw. - Call 2: corrected via instantaneous_pcpu (matches the constant 84% case from the parameterized green tests). - Call 3: pid disappears, new pid appears -> new_prev contains only the surviving pid. Catches regressions where the (pcpu_raw, etimes) tuple order is swapped, the prev lookup is dropped, or new_prev accumulates stale entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One of the load-bearing decisions in the delta-pcpu work is that
the Tracker holds the cross-sample state (rather than a
module-level dict or a sampler-as-class). Without a regression
test, refactoring self.prev_raw to a local would silently disable
the calc on every sample after the first.
Two-call test mocking _get_sample with a side_effect that
inspects each invocation's input prev. Asserts the first call
gets {} and the second gets the dict returned from the first;
asserts report.prev_raw lands on the latest returned value after
each call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adopts smon's pattern of skipping pids whose ps elapsed-time reads "00:00" -- pids too young for ps to report a meaningful etime. Mitigates the con#399 churn-of-short-lived-workers case more aggressively than the pcpu_raw fallback alone, by preventing inflated lifetime ratios from ever entering the sample. Tradeoff: loses the entire record (cmd/rss/vsz/pmem) for any pid ps reports with etime=="00:00", not just the pcpu field. Pids that survive long enough for etime to tick past 1s are observed normally on subsequent samples. Toggled by DROP_YOUNG_PIDS in _constants.py rather than a CLI flag for now. Promoting it to an env-var + CLI flag (matching the DUCT_SAMPLE_INTERVAL / DUCT_REPORT_INTERVAL pattern) is tracked as a follow-up; see PR description. Empty-sample handling: _get_sample_linux now returns (None, new_prev) when every pid in a ps round is filtered (common at process startup before etime ticks past 1s). Mirrors the existing _get_sample_mac no-pids path so the monitor loop skips the round rather than crashing on Averages.from_sample's not-None assertions. The return type narrows from Sample to Optional[Sample] to match. Tests: - test_sampling.py grows a young-pid line in the threading test (asserts both sample and new_prev exclude it) and a dedicated all-young-dropped test (asserts the (None, new_prev) return). Both opt into DROP_YOUNG_PIDS=True via enable_drop_young_pids applied as a usefixtures mark. - test/conftest.py gains a session-autouse fixture that flips DROP_YOUNG_PIDS=False for in-process tests (so pre-existing fixtures asserting on sub-second workloads keep working) and a function-scoped enable_drop_young_pids opt-in for unit tests that specifically exercise the drop. The session override reaches in-process tests only; subprocess tests in test_e2e.py inherit the production default. - test_e2e.py: sample-asserting tests (test_spawn_children, test_session_modes, test_session_mode_behavior_difference) bumped from sub-second to >=1s sleeps. Tests that don't assert on samples (test_sanity, test_logging_levels) keep their short sleeps -- the empty-sample handling makes them exit cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ps reports etime at 1-second resolution, so configured sample-interval below 1s produces phase artifacts in the corrected pcpu calc -- adjacent sample pairs frequently land in the same etime quantum, the calc falls back to pcpu_raw (lifetime ratio), and short-lived pids with high startup CPU show as false spikes. At the default 1.0s with positive sampling overhead the issue does not occur in practice (actual interval >= 1s, Δetime >= 1 in steady state). The warning fires only when the user explicitly configures a sub-1s interval; --quiet and --log-level NONE suppress it through the normal log machinery. Bumping the default to 2.0s for an extra safety margin is captured as a TODO in the PR description; not a correctness fix, just a conservatism call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #423 +/- ##
==========================================
+ Coverage 91.87% 91.99% +0.12%
==========================================
Files 15 15
Lines 1120 1162 +42
Branches 139 147 +8
==========================================
+ Hits 1029 1069 +40
- Misses 69 71 +2
Partials 22 22 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Warning: PR description is currently just accumulated notes from claude-code session. Some inaccuracies to clean up, lots of cruft to remove.
Linux delta-pcpu sampler (fixes #399)
Why
ps -o pcpuon Linux reports a process's lifetime average(
cputime / realtime), not instantaneous CPU. Summing it across achurn of short-lived high-CPU workers is unbounded — duct logged
5363% on a 20-core box (#399). See
findings/calc-sketch-linux-pcpu.md(notebook worktree) for thefull mechanism.
Changes
from
pcpu × etimesby taking a delta across consecutive samples, divideby delta etimes.
ProcessStatsgainspcpu_raw(theunmodified ps value) and
etimes(float seconds).pcpu'ssemantics shift to "corrected when computable, equals
pcpu_rawat fallback." The stringetimeis retained.Δetimes ≤ 0,
pcpufalls back topcpu_rawrather than 0.0.Silently reporting 0% for a churning workload would mislead
worse than carrying the lifetime ratio. Tradeoff is documented
in Limitations below.
Tracker(which already holds cross-sample state); Linux/Macsampler functions stay pure with signature
(session_id, prev) -> (Optional[Sample], prev').etimereads
"00:00"(too young for ps to report a non-zero elapsedtime) are skipped from the sample. Mitigates oddity from observing the plot: add max's to legend, and check what is "total" pcpu/plot average? #399's
churn-of-short-lived-workers case more aggressively than the
pcpu_rawfallback alone — inflated lifetime ratios frombrand-new pids never enter the sum at all. Toggled by a
module-level
DROP_YOUNG_PIDSconstant (defaultTrue); seeTODOs for follow-up.
psround isfiltered (common at process startup before
etimeticks past1s),
_get_sample_linuxnow returns(None, new_prev)insteadof an empty
Sample. Mirrors the existing_get_sample_macno-pids path so the monitor loop skips the round rather than
crashing on
Averages.from_sample's not-Noneassertions.ps -o pcpuis a 5/8-aging exponentialaverage, not lifetime-cumulative; the delta math is invalid
there. Darwin keeps current behavior; finding documented in
findings/darwin-pcpu.md.the
pcpu/total_pcpusemantics shift.cli.py::run_commandemitsa one-shot
lgr.warningwhen--sample-interval < 1.0,flagging that phase artifacts (false pcpu spikes) are likely on
Linux. Suppressed under
--quiet/--log-level NONE.Empirical validation
Workload: 4s busy + 10s idle Python loop,
--s-i 1 --r-i 1.Wall-time identical → no measurable overhead. Logs:
~/devel/duct/.duct/logs/2026.04.27T20.00.28-164596_*(main),~/devel/duct/.worktrees/pcpu-delta-spike/.duct/logs/2026.04.27T20.00.07-164502_*.Limitations (honest)
Captured in
PROBLEMS.md(notebook worktree):1s sample reads as 20%. Strictly better than the lifetime
method for late-run spikes, but lower-resolution than a
hypothetical
/proc/<pid>/statsource.DROP_YOUNG_PIDSdiscards the whole record (includingcmd/rss/vsz/pmem), not justpcpu, for any pid psreports with
etime="00:00". Tradeoff in favor of mitigatingoddity from observing the plot: add max's to legend, and check what is "total" pcpu/plot average? #399 churn over preserving full visibility on every pid. Pids
that survive long enough for
etimeto tick past 1s areobserved normally on subsequent samples.
etime > 0still fall back topcpu_raw. A pid that's been alive for > 1s before ductfirst samples it has no prior to delta against, so its first
recorded
pcpuis the lifetime ratio. Less common than theetime="00:00"case but not eliminated byDROP_YOUNG_PIDS.Bounds-style representation (
pcpu_lower/pcpu_upper) is thelong-term direction; deferred to a follow-up PR.
ps etimeresolution is 1s, so when configured sample-interval isbelow that, Δetime collapses to 0 between many sample pairs.
The corrected calc falls back to
pcpu_raw(the lifetimeratio) on those samples, which can show as false pcpu spikes —
especially for short-lived pids whose lifetime ratio is
artificially high. At default 1.0s the issue does not occur in
practice (positive overhead keeps actual interval ≥ 1s);
cli.pywarns when the configured value is below 1.0s.first record for a brand-new pid reports nonsense
etime(~10⁶-year readings observed once across ~7 manual runs). The
fallback protects the calc, but the raw value persists in the
audit record. Mitigation deferred.
Schema and back-compat
ProcessStats:pcpu_raw: float,etimes: float. Additive in JSON.pcpuandtotal_pcpukeep their names; semantics shift onLinux (corrected) and are unchanged on Darwin (lifetime).
__schema_version__bumped to0.3.0.con-duct pp / plot / lsreconstructsProcessStatsfromparsed JSON — all consumers read dicts. Old
usage.jsonlfilesremain plottable; their
pcpuvalues continue to reflect theold lifetime semantics (correct for files written by older
duct).
con-ductversions in rotation should be aware thatpcpu/total_pcpunumbers from new-duct runs are notcomparable to numbers from old-duct runs on the same workload.
TODOs left for follow-up
ProcessStatsfromparsed JSON, it will need defaults for
pcpu_raw/etimesonpre-0.3.0 records. None exist today; flagged so the back-compat
cost is visible if one is added.
ls.py:113-119already branches onparse_versionforpre-0.2.1 and pre-0.2.2 info.json files. If usage.jsonl ever
needs analogous version-aware loading, the same pattern
applies.
DROP_YOUNG_PIDSfrom a module-level constant to anenv var + CLI flag. Today the toggle lives only in
_constants.pyand is overridden in tests via a session-scopedmonkeypatch in
test/conftest.py. That mechanism doesn't reachsubprocesses, which is why
test_e2e.pyhad to bump severalfixture sleeps to ≥1s instead. The proper fix is the same
pattern as
DUCT_SAMPLE_INTERVAL/DUCT_REPORT_INTERVAL:env-var-driven default in
_constants.py, paired CLI flag incli.py, parameter threaded throughduct_execute→Report→
_get_sample_linux. Once that lands, the e2e fixture sleepscan return to short values and the conftest override can use
the env var directly. TODO (post-merge): file the issue
using the draft at
.worktrees/resource-measurement-notebook/FOLLOWUP_ISSUE_drop_young_pids.md— only if this PR is accepted upstream.
--sample-intervaldefault from 1.0s to2.0s. At configured 1.0s with positive overhead the corrected
calc is reliable (Δetime ≥ 1 in steady state), so the bump is
not a correctness fix — it's a conservatism call that gives
users an extra safety margin away from the 1s etime quantum.
Pushing the default also pushes the observability floor (sub-2s
pids become single-sample with
pcpu_rawfallback). Worth aseparate discussion / PR.
docs/resource-statistics.mdrewrite on top of the new calc.Will supersede Document what duct's resource statistics actually measure #414. Tracked separately.
Test coverage
test/duct_main/test_duct_utils.py— parser(
etime_to_etimes) and calculator (instantaneous_pcpu)green/red paths.
test/duct_main/test_aggregation.pyandtest_report.py—fixtures updated for the new fields.
test/duct_main/test_sampling.py(new) — three-call test of_get_sample_linuxwith mocked ps: first-observationfallback, corrected pcpu via prev lookup, pid-disappearance
handling in the rebuilt
new_prev, andetime="00:00"drop assertions. Plus a dedicated all-young-dropped test
asserting
(None, new_prev)is returned. Both opt intoDROP_YOUNG_PIDS=Truevia theenable_drop_young_pidsfixture (the conftest disables the filter session-wide for
in-process tests).
test/duct_main/test_report.py::test_collect_sample_round_trips_prev_raw(new) — regression guard for the load-bearing decision that
the
Trackerholds the cross-sample state.test/conftest.pygains a session-autouse fixture thatflips
DROP_YOUNG_PIDS=Falsefor in-process tests (sopre-existing fixtures that rely on observing sub-second
workloads keep working) and a function-scoped opt-in
fixture for the unit tests above.
test/duct_main/test_e2e.py— sample-asserting tests(
test_spawn_children,test_session_modes,test_session_mode_behavior_difference) bumped fromsub-second to
sleep 2so the workload outlives ps'setimequantum. Tests that don't assert on samples(
test_sanity,test_logging_levels) keep theiroriginal short sleeps. The empty-sample
(None, new_prev)return in
_get_sample_linuxkeeps duct exiting cleanlyon those short-sleep tests rather than crashing the
monitor thread.
A multi-step trajectory test was considered and dropped: the
sampler rebuilds
new_prevfrom scratch on each call, so a3-step series cannot fail when a 2-step one passes. The
parameterized cases in
test_instantaneous_pcpu_greenalreadycover the math.
Commits
d6054d6parser:etime_to_etimes56bae66math:instantaneous_pcpucalculator7ec330ffallback:instantaneous_pcpureturnscurr_pcpu(notNone)ab382adschema: persistpcpu_rawandetimesonProcessStatsa1e61aesampling: wire delta-pcpu into the Linux sampler3953290constants: bump schema_version to 0.3.0dd18f13tests: cover prev-state threading in_get_sample_linuxba24400tests: coverprev_rawround-trip inReport.collect_samplea945ea6sampling: drop young pids (etime=="00:00") bydefault (also brings the empty-sample None-return,
conftest override, and targeted e2e sleep bumps)
0d742f5cli: warn on sub-1s sample-intervalEach commit is atomic and pre-commit-clean; the series reads as a
coherent fix unit per Operating principles.