Conversation
Explain the semantics of each field (`pcpu`, `rss`, `pmem`, `vsz`) under duct's current `ps`-based sampler, where the numbers are reliable, and where they mislead. Motivated by recurring confusion: `peak_pcpu` exceeding `Ncores × 100%` (issue con#399), `peak_vsz` reaching terabytes on workloads with only GBs of real memory, and `total_*` disagreeing with `sum(per-pid)` in aggregated records. All are correct behavior of `ps` semantics and duct's max-reduction aggregation — not bugs — but there was no reference explaining why. Scope is current behavior only; alternative samplers (psutil, cgroup) will be documented separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No behavior change. Existing platform dispatch (_get_sample_linux / _get_sample_mac) stays as private helpers; PsSampler.sample() delegates to them. Report takes an optional sampler kwarg that defaults to PsSampler(), so every existing caller is unaffected. First step of the multi-sampler POC: this commit only reshapes where the ps-based sampling lives. The CLI flag, env var, and alternative sampler come in follow-ups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire a --sampler={ps,cgroup-ps-hybrid} flag (default ps, DUCT_SAMPLER
env) through execute() to Report. The ps path is identical to before;
cgroup-ps-hybrid raises NotImplementedError (real impl comes later).
Every usage.jsonl record and info.json now carries a "sampler" field
indicating which sampler produced it. Additive only -- schema_version
bump and a pre-tag translation layer are tracked as Open Questions for
production follow-up (see docs/design/multiple-samplers.md, landing in
a later commit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cherry-picks the substantive work from the upstream branch claude/validate-resource-stats-uK9NV (commits a60a852, 0128f85) plus its lint/type fixups (6645c90, 8a59ad7, c1c0432), squashed to one commit since all five were Claude-authored upstream too. Drops test_total_rss_is_sum_of_processes from the pickup: per-sample total_rss-equals-sum-of-pid-rss is an artifact of the ps sampler and will not hold for cgroup-based samplers (see docs/resource-statistics.md Peak vs. average for the broader invariant-rejection rationale). Adds: - test/data/memory_children.py (multiprocessing helper) - test/duct_main/test_resource_validation.py (8 tests: memory allocation, wall-clock, CPU idle/busy, samples structure, child process memory, per-process tracking) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New test/data/workloads/ holds deterministic, standalone-runnable workload scripts for the sampler matrix tests (landing in a follow-up). Each script's docstring states its ground-truth contract so matrix assertions can reference it directly. - steady_cpu.py: one-core-pinned busy-loop; peak_pcpu ~= 100% - alloc_memory.py: bytearray(N MB), hold D seconds; peak_rss >= N MB README.md documents the catalog, points at test/data/memory_children.py (from PR con#403) as an additional matrix workload living outside this directory, and carries TODOs for: - a spikey multi-core CPU-burst workload (the con#399 anchor; planned but deferred out of this commit) - consolidating workload locations and overlap with test/data/test_script.py once the POC direction lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plumbing for the sampler matrix -- no tests use it yet (checkpoints 6/8
populate the cells). Cells are simple pass/fail for now; richer status
vocabulary deferred.
- @pytest.mark.sampler_matrix(workload=X, sampler=Y): marks a test as
one cell of the matrix.
- test/conftest.py: session-start clears stale results; a
pytest_runtest_makereport hook appends {workload, sampler, status}
to .sampler_matrix_results.jsonl.
- scripts/gen_sampler_matrix.py: pivots the JSONL into
test/sampler_matrix.csv (rows=workload, cols=sampler, cells=pass/
fail/(not yet tested)). The CSV is committed; fixed column order
keeps diffs stable.
- test/sampler_matrix.csv committed in header-only state; checkpoints
6 and 8 fill in ps and cgroup-ps-hybrid columns respectively.
CI step to enforce `git diff --exit-code` drift of the CSV is deferred
(still within this PR, just not this commit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reshapes checkpoint 5's sampler-matrix infra around three ideas: - Per-sampler CSVs (test/sampler_matrix_<sampler>.csv) with rows=workload, columns=property, cells=pass/fail/n/a. Easier to read as a per-sampler capability card than one pivoted CSV. - Marker carries (sampler, workload, property, expected) -- the test declares what property it probes AND whether we currently expect it to hold. - expected="fail" auto-dispatches to xfail(strict=True) via a pytest_collection_modifyitems hook. Known sampler limitations stay committed as expected-fail cells; the suite stays green as long as the matrix reflects reality; if a sampler improves, the xpass surfaces it noisily so we update the committed expectation. The conftest records the *actual* assertion outcome (from call.excinfo) rather than pytest's post-xfail interpretation, so the CSV reflects what the sampler did, not what we hoped. Adds test/duct_main/test_sampler_matrix.py with four ps-column tests: - alloc_memory/peak_rss_reaches_alloc: pass - memory_children/peak_rss_reaches_alloc: pass - memory_children/peak_rss_no_overcount: fail [con#399 anchor] - steady_cpu/peak_pcpu_reaches_floor: pass The memory_children cell is the con#399 story: ps double-counts shared library pages per PID, so peak_rss exceeds a realistic physical- memory ceiling. cgroup-ps-hybrid (checkpoint 7/8) should flip that cell to pass. File carries a TODO to consolidate with test_resource_validation.py once the POC direction lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First working implementation of --sampler=cgroup-ps-hybrid. The
--sampler flag now actually does something in addition to tagging
records. The ps code path is unchanged.
Shape:
- Per-pid stats still come from the ps sub-sampler (gives us the
``stats: {pid: ProcessStats}`` dict users expect in usage.jsonl).
- Session totals are cgroup-sourced: ``total_rss`` from
``memory.current``, ``total_pcpu`` from a delta of
``cpu.stat.usage_usec`` over the last sample interval (stateful -
sampler remembers the previous reading).
- ``total_vsz`` and ``total_pmem`` remain ps-sourced -- cgroup v2 has
no direct analogs (memory.current is already physical memory, and
vsz is per-process by definition).
- Reader mode only. Duct does NOT create a cgroup; it reads the one
it already lives in via /proc/self/cgroup.
Constraints enforced at startup:
- cgroup v2 required (refuse on v1 with NotImplementedError pointing
at /sys/fs/cgroup/cgroup.controllers).
- --mode=current-session required; --mode=new-session + cgroup-ps-
hybrid raises ValueError before log paths are created.
Type plumbing: added a ``Sampler`` Union alias in _sampling.py so the
factory return + Report.sampler annotation can refer to either
concrete class without introducing an ABC.
TODO(poc) markers in code:
- Polling shape (sample-at-interval + Sample.aggregate max) is a ps-
shaped compromise; cgroup could emit cumulative/delta directly.
- End-of-run overwrite of full_run_stats.total_rss with memory.peak
would give a truer peak than max-of-currents.
- Assumes the tracked command stays in duct's cgroup; systemd-run or
similar migrations would silently break the measurement.
Zero schema changes: the ``sampler`` tag added in checkpoint 2 is
already the source disambiguator on each record.
Matrix tests for the cgroup column are deferred to checkpoint 8 --
those need a scoped cgroup (SLURM job, container with cgroup ns, or
systemd unit) to assert meaningful behavior; root-cgroup readings
reflect the whole host and would make assertions trivially pass or
fail by accident.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Populates the cgroup-ps-hybrid column of the sampler matrix with real data. Each cgroup matrix test spawns duct in a transient ``systemd-run --user --scope --collect`` unit, so the cgroup CgroupSampler reads is dedicated to just ``duct + workload``. Without the scope, duct's ambient cgroup (a login user slice, a non-ns container, etc.) contains megabytes to gigabytes of unrelated memory and ceilings become meaningless. Four cells, same workloads/properties as the ps column: alloc_memory/peak_rss_reaches_alloc = pass memory_children/peak_rss_reaches_alloc = pass memory_children/peak_rss_no_overcount = pass <-- the flip steady_cpu/peak_pcpu_reaches_floor = pass The memory_children cell flipping fail -> pass across the two samplers is the con#399 story captured in the matrix: ps per-pid sum overcounts shared library pages; cgroup v2 memory.current counts each physical page once. Opt-in via new ``@pytest.mark.cgroup_matrix`` + ``--cgroup-matrix`` pytest option. Default runs skip these tests (they require systemd-run --user and a writable /sys/fs/cgroup v2). To populate or refresh the CSV: .tox/py312/bin/python -m pytest -o addopts= \\ --cgroup-matrix test/duct_main/test_sampler_matrix.py python3 scripts/gen_sampler_matrix.py CI integration for the opt-in tests is TODO: likely a separate GitHub Actions job with podman-or-systemd-run setup; flagged in the PR description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the canonical regeneration recipe (pytest --cgroup-matrix + the
JSONL-to-CSV pivot) as a checked-in script, so it can be invoked
reproducibly via:
datalad run scripts/regen_matrix.sh
which records the command, inputs, and outputs in the commit
metadata. Works in plain git too (provenance is just missing).
Deletes the previously-committed test/sampler_matrix_cgroup-ps-hybrid.csv
so the next datalad-run regeneration creates it fresh (with provenance).
The ps CSV stays as-is (already committed from checkpoint 6; a
datalad-run refresh will update it in place if any ps cell changes).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
=== Do not change lines below ===
{
"chain": [],
"cmd": "scripts/regen_matrix.sh",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [
"test/sampler_matrix_ps.csv",
"test/sampler_matrix_cgroup-ps-hybrid.csv"
],
"pwd": "."
}
^^^ Do not change lines above ^^^
Schema: marker kwargs change from ``property=<name>`` to ``metric=<rss|pcpu>`` + ``direction=<underreport|overreport>``. Generator pivots to rows=``<workload>/<metric>`` and columns= ``no_<direction>``. Each cell now cleanly reports the sampler's ability to avoid a specific failure mode on a specific workload. Column layout is easier to scan as a capability card and avoids bespoke per-property column names. New workloads (test/data/workloads/): - ``ephemeral_cpu.py`` -- fork N short-lived parallel CPU workers. ps misses them (children die between samples, session empty at sample time). cgroup ``cpu.stat.usage_usec`` is cumulative and captures the work regardless. - ``spikey_cpu.py`` -- multi-threaded ``hashlib.pbkdf2_hmac`` (GIL-released) across N*M threads. Principled standalone equivalent of con#399's tox-C-extension-compile pathology. Triggers ps's Bug 1: lifetime cumulative ``cputime/elapsed`` inflates per worker, duct's per-pid sum compounds across workers. New matrix cells (ps / cgroup): ephemeral_cpu/pcpu no_underreport fail / pass spikey_cpu/pcpu no_overreport fail / pass (Linux only) Spikey is Linux-only (Bug 1 is ps-on-Linux-specific; Darwin's ps uses a decaying 1-min average). Spikey also skips when ``cpu_count >= workers * threads`` -- without thread oversubscription, ps reports stay near legitimate physical peak and the ceiling can't discriminate. Workload + ceiling are tuned for 4-12 core hosts. Cite: RESEARCH.md section 1.1 (pcpu fully broken); DEEP_DIVE_PROGRESS.md section 2 (Bug 1 confirmed in src/con_duct/_sampling.py). Distinct from Bug 2 (aggregation, xfailed in c017800). Matrix CSV regeneration via ``datalad run scripts/regen_matrix.sh`` lands as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical results from host run showed two tunings needed: - Ephemeral cgroup floor 200% was too tight. Actual cgroup reading for 4 × 30ms parallel bursts over a 100ms sample window, with Python multiprocessing startup overhead included, comes in around 120-200%. Drop the floor to 80% -- still comfortably above ps's ~0% (which is what underreport-detection needs) while tolerating sample-window dilution and startup variance. - Spikey workload demand was 16 threads (4 workers x 4 threads), which on hosts with >= 16 cores skips via the oversubscription guard. Bump to 32 threads (4 workers x 8 threads) so the test can run on typical 8-16 core laptops and demonstrate Bug 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
=== Do not change lines below ===
{
"chain": [],
"cmd": "scripts/regen_matrix.sh",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [
"test/sampler_matrix_ps.csv",
"test/sampler_matrix_cgroup-ps-hybrid.csv"
],
"pwd": "."
}
^^^ Do not change lines above ^^^
Revision of the aspirational design doc (originally drafted on the pcpu-overshoot-investigation branch) to reflect the POC as landed on branch sampler-choice. - Scope tightens to cgroup-ps-hybrid as the one additional sampler. psutil moves from a first-class backend to Future Directions (section 10.1), along with creator mode, memory.peak overwrite, pipeline-reshape for cgroup's cumulative nature, matrix completeness, CI wiring, and the placeholder-name decision. - Tense flips to present-indicative for implemented pieces: the sampler abstraction lives in _sampling.py, CgroupSampler does X, etc. Out-of-scope items stay future-tense in section 10. - New section 9 Schema Open Questions covers additive-tag trade-offs the POC deferred (schema_version bump, pre-tag-consumer compat, per-block source tagging, placeholder-name resolution). - New section 8 Results ships the two matrix CSVs as a capability card, with the three ps->cgroup flips called out as the POC claim. - ps pathologies in section 2 are described by mechanism rather than "Bug 1 / Bug 2" labels -- the cputime/elapsed lifetime ratio plus per-pid summing for CPU, shared-page per-pid summing for memory. Linux-only caveat for the CPU pathology called out (Darwin ps is a decaying 1-min average). - Section 12 annotates the original eight open questions as resolved / deferred / still-open. - Section 13 lists the landed commits so a reviewer can read the PR in order. This is the doc reviewers should open first. It documents what the POC actually demonstrates, what we deliberately left as future work, and why. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #415 +/- ##
==========================================
- Coverage 91.87% 88.90% -2.97%
==========================================
Files 15 15
Lines 1120 1181 +61
Branches 139 149 +10
==========================================
+ Hits 1029 1050 +21
- Misses 69 107 +38
- Partials 22 24 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces multiprocessing.Process with bare os.fork + os._exit so child lifetime collapses to ~work_ms without interpreter-reinit or spawn-method overhead. multiprocessing bloat defeated the "children die between samples" phenomenon on PyPy and Python 3.14 (forkserver default), causing test_ps_ephemeral_cpu_pcpu_no_underreport to xpass-strict on those runners while xfailing on CPython 3.10-3.13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proof of Concept
Overreporting / Underreporting tests
Some of the added tests use systemd to directly run in isolation so we can accurately separate sample signal from host noise. The ps tests dont need this because they get the isolation with sessions. These results are output to csvs, with the intention of putting those results into the docs to help users pick the sampler that is appropriate for their use case. (currently have
n/afor unwritten test, expect those to pass for both samplers)Current sampler (
ps):https://github.com/asmacdo/duct/blob/dfd2702246dab94d1bf74c13589f3fc9a649b226/test/sampler_matrix_ps.csv
New optional sampler (
cgroups hybrid with ps, linux only)https://github.com/asmacdo/duct/blob/dfd2702246dab94d1bf74c13589f3fc9a649b226/test/sampler_matrix_cgroup-ps-hybrid.csv
The underlying claim
is that on linux systems that support cgroups, we can get significantly more accurate reporting, that should be very aligned with the monitoring that SLURM does, so we can get accurate readings to predict necessary resource requests on those jobs.
TODOs: