Skip to content

Length-biased sampling and 5-partnership truncation bias in formation target stats #72

@smjenness

Description

@smjenness

Context

Identified during PR #71 review while implementing duration.method = \"weibull_strat\". The issue surfaced in the duration-fitting discussion but is broader than duration estimation — it affects the formation target stats (edges, nodefactor_*, nodematch_*, concurrent, absdiff_*) that come out of build_netparams() / build_netstats(). Opening as a separate issue per PI request.

The problem

ARTnet is a cross-sectional web-based survey with two non-standard sampling features that neither the legacy univariate fits nor the joint g-computation infrastructure (#61#63) currently correct for:

1. Length-biased sampling

The survey asks about partners that were active within the past 12 months. For ongoing partnerships, the probability of being in the sample is proportional to partnership duration (longer partnerships intersect the recall window more often). For completed partnerships, a different bias applies: only those that ended within the past 12 months are included, which selectively drops older completed partnerships.

This shows up downstream:

  • Respondents with longer main partnerships are overrepresented in the "has a main partner" sample, which biases md.main upward.
  • Mean degree calculations conditional on age/race carry the same bias, potentially heterogeneously.

2. 5-most-recent partner truncation

Respondents are asked about up to 5 most-recent partners per layer. For respondents with more than 5, the excess partners are not observed. Two effects:

  • Right truncation of partnership count: true degree is underestimated for high-activity respondents.
  • Selection bias on retained partners: the 5 most-recent tend to skew longer (if "most recent" orders by recency of activity rather than start date), further interacting with the length-bias issue above.

Scope: which target stats are affected

Target stat Affected by length-bias Affected by 5-truncation
md.main / md.casl yes (upward) yes (downward cap on high-degree)
nf.<attr> (all) yes (heterogeneous by attr) yes
concurrent yes yes (especially)
nf.deg.{main,casl,tot} yes yes
nm.<attr> (all) partial (mixing by duration-correlated attrs) partial
absdiff_* partial partial
durations (durs.*.byage) yes (handled in #63 phase 3 via length-biased Weibull) tangential

Proposed approach (open for discussion)

  1. Survey the methodology literature — the ARTnet Weiss et al. 2020 paper should note how the univariate approach was designed to be robust to these biases (if at all). Similar literature on egocentric network sampling corrections (Vardi 1989, Asgharian et al. 2002, Krivitsky & Morris 2017 on egocentric inference).
  2. Length-biased correction for Poisson/binomial fits — the joint Poisson fits in netparams$<layer>$joint_model are currently standard MLE. A length-biased version would weight observations by 1 / P(obs | duration). Needs partnership-duration covariates on the RHS, which may not be directly available for the ego-level fits.
  3. Truncation correction — fit a truncated Poisson or zero-inflated alternative for count-of-partners models. Would need the truncation boundary (5) explicitly in the likelihood.
  4. Reweighting — alternative: reweight observations by inverse probability of inclusion, if we can estimate that probability.

Priority / impact

Unknown until we compare ARTnet-derived estimates against a length-bias-correct baseline. Magnitudes:

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions