Conversation
Fixes the 5363% peak in con#399. - Previous plot summed totals.pcpu/totals.rss across pids per interval. Both sums double-count: multi-core for pcpu, shared pages for rss. - Replace with per-pid lines. For pcpu, compute pdcpu (delta-corrected %CPU) at plot time from consecutive (etime, pcpu) pairs in usage.jsonl. - Detect kernel pid reuse via Δetime ≈ Δwall (2s tolerance). Strict "etime regressed" misses cases where the new pid's etime crept above the old's (con#399 pid 3323259: 49s → 54s in 60s of wall = pid reuse). - Clamp pdcpu < 0 to None. duct aggregates pcpu as max-across-samples but etime as the last sample, so a spike-then-idle pattern in the prior interval can push the cputime delta negative even on a continuous pid. - Filter pids by "notable on either axis": peak pdcpu >= 0.5% OR peak rss >= 10MB. Cap legend at hybrid top-10 (top 5 by peak pdcpu unioned with top 5 by peak rss). - vsz commented out by default. Caveats: - pcpu × etime is an upper bound on cputime under max-across- samples aggregation; pdcpu inherits the approximation. - ~87% of pids in con#399's tox horde appear in only one record and don't get a pdcpu measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #424 +/- ##
==========================================
- Coverage 91.87% 91.53% -0.34%
==========================================
Files 15 15
Lines 1120 1229 +109
Branches 139 165 +26
==========================================
+ Hits 1029 1125 +96
- Misses 69 75 +6
- Partials 22 29 +7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Updated requirements from @yarikoptic review:
|
Per Yarik's review on PR con#424, replace per-pid colored lines + top-N legend with per-metric color + max/sum envelope layout. - All pcpu pid lines share one color (orange); all pmem lines share another (blue). Reviewers don't need to identify which pid was busy, just that something was. - For each metric, plot two envelopes across the kept pids: - max-across-pids: lower bound on total resource use, solid. "If some pid was at 50%, the total was at least 50%." - sum-across-pids: upper bound, dashed. Can blow past 100% on multi-core (per-pid pdcpu doesn't know about cores) and overstate memory (shared pages get counted multiple times in pmem); both caveats are accepted. - Drop rss from the default chart. peak_rss is still used as a relevance signal so memory-only pids contribute to the pmem cloud and envelope. - Drop the top-N hybrid cap. With faint same-color dotted per-pid lines, a cloud of dozens-to-hundreds of traces reads as background texture through which the envelopes are clearly visible. The peak_pdcpu >= 0.5% OR peak_rss >= 10MB filter still suppresses noise. - Two legends: metric color (top-left, orange/blue) and color- agnostic linestyle key (top-right): upper bound / lower bound / per-pid. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
I think we should consider adding a separate scale 0-100 for pmem on the right side, since pcpu is also percent but can go many times higher for multicore, which effectively squashes pmem. The issue with giving the second axis to pmem is that it would replace rss as the second y-axis. We could
wdyt? |
|
previews look much better! what about an alternative I believe I alluded to in the chat -- by default plot rss (not %mem) and state total physical ram in legend (if we know)? Then it would the right "Memory" with abs sizes up and we would be all set -- and I feel that for memory absolute value is good. |
Per Yarik's second-pass review on PR con#424, replace pmem on the shared y-axis with rss (absolute bytes) on a secondary y-axis. Reasoning: under SLURM, pmem = rss / host_total, where host_total is the whole node's physical memory rather than the cgroup's allocated memory. A job using 100% of a 4GB request on a 256GB host therefore shows ~1.5% pmem, which the shared y-axis squashes to invisibility. Absolute rss avoids this entirely. - pcpu (orange): primary y-axis (%), unchanged. - rss (blue): secondary y-axis (bytes), formatted via the existing HumanizedAxisFormatter + _MEMORY_UNITS. - Legend label is best-effort: if info.json is available -- passed directly, or as a sibling to the usage file via {prefix}info.json -- read system.memory_total and render "rss (host: X.XTB)". Otherwise fall back to plain "rss". Plot CLI still accepts a bare usage file; info.json remains optional. - Filter unchanged: peak_pdcpu >= 0.5% OR peak_rss >= 10MB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plot's _MEMORY_UNITS used 1024-base divisors with "KB/MB/GB/TB" suffixes -- correct as KiB/MiB/GiB/TiB per IEC 80000-13, but the suffixes claimed decimal. SummaryFormatter.naturalsize used a parallel FILESIZE_SUFFIXES tuple at base 1000 with "kB/MB/GB" prefixes. The data table was duplicated, the conventions disagreed, and the plot axis lied about its base. - Add module-level FILESIZE_UNITS in _formatter.py: single source of truth for byte-suffix data. Base 1000, suffixes B/kB/MB/GB/TB/PB/ EB/ZB/YB. - Refactor naturalsize to walk FILESIZE_UNITS instead of the local FILESIZE_SUFFIXES tuple. Drop FILESIZE_SUFFIXES. Behavior preserved (covered by test_summary_formatter_S_sizes). - Fix the broken "%.3f" docstring example: actual output is "3.000 kB", not the "2.930 kB" left over from a 1024-base era. - plot.py imports FILESIZE_UNITS directly (no alias), drops local _MEMORY_UNITS. - Drop _format_bytes_compact (added in the previous commit) and route the rss legend label through SummaryFormatter().naturalsize for the same reason: avoid keeping a third byte-format helper. - test_plot._MEMORY_UNITS parametrize cases switch to 1000-multiples with kB/MB/GB suffixes; drops the test_format_bytes_compact case. Plot axis tick numbers shift slightly (e.g. "1.5MB" was 1.5 * 1024**2 bytes; same physical byte count now displays as "1.6 MB" since the divisor is smaller). The displayed *meaning* is now correct. Note: this commit is separable -- it can be cherry-removed and shipped as its own PR for a distinct changelog entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Swap rss for pmem, done. Note: I fixed a bug caused by duplication. The Summary formatter used decimal humanization, the graph used binary but still used MB GB instead of MiB GiB. (Fixed by removing duplication, we now report decimal everywhere). I dont think this bugfix belongs in the PR, but is worth including during draft mode so we can review the plot changes together. I've noted removing this as a TODO in the description. |


Summary
Fixes #399. The current
con-duct plotreadstotals.pcpuandtotals.rssper record and draws a single line for each.However, both totals double-count (multi-core for pcpu, shared pages for rss), producing the 5363% spike that started this issue.
This PR:
pdcpu, a delta-corrected pcpu computed at plot time from consecutive(etime, pcpu)pairs inusage.jsonl.ΔetimeagainstΔwall_clock.None.A handful of "REVERT ME" commits with temporary images/data for easier review.
Demo
See
tmp-plot-demo.mdTODO
Pre-merge alignment on shape, not just correctness.
pdcpuinverts the procps identitypcpu = cputime/etime, which assumes ps reports a cumulativelifetime ratio. macOS ps reports a 5/8-decayed EWMA, so
pcpu × etime ≠ cputimeand the delta math is meaningless.Today's
info.jsondoes not record the host platform, sowe can't tell at plot time whether a log came from Linux. We
need either:
- a
platformfield ininfo.json(sampler change, separatePR), with plot.py refusing or warning on non-Linux logs; or
- some heuristic (e.g., detect that pcpu monotonically grows
between two flat-rate samples) and warn.
Currently silently produces wrong numbers on macOS logs.
REVERT ME: TEMPORARY DEMOcommit (the demo script, demo doc, log files, rendered output)