Skip to content

cluster: replay queued up events#836

Draft
dkropachev wants to merge 1 commit intomasterfrom
fix/replay-up-after-down-handling
Draft

cluster: replay queued up events#836
dkropachev wants to merge 1 commit intomasterfrom
fix/replay-up-after-down-handling

Conversation

@dkropachev
Copy link
Copy Markdown
Collaborator

Summary

  • Replay queued node-up events after down handling completes instead of dropping them while the host remains down.
  • Track down-handling revisions so stale or superseded callbacks do not clear newer host state work.
  • Keep queued-up replay invalidatable until on_up() reacquires the host lock, and preserve no-retry auth failures for hosts that were never marked up.

Tests

  • uv run pytest tests/unit/test_cluster.py -q
  • git diff --check

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

Comment thread cassandra/cluster.py Fixed
@dkropachev dkropachev force-pushed the fix/replay-up-after-down-handling branch 3 times, most recently from 2094ebd to db683a0 Compare April 30, 2026 17:33
Comment thread cassandra/cluster.py Fixed
@dkropachev dkropachev force-pushed the fix/replay-up-after-down-handling branch from db683a0 to cff85ac Compare April 30, 2026 17:56
Comment thread cassandra/cluster.py Fixed
Host status events can race when an UP notification arrives while DOWN handling is still running in the executor. Previously the UP path could complete first, only for the pending DOWN path to remove pools and start a reconnector afterwards, leaving host liveness state stale.

Track per-host liveness epochs and queue UP handling while DOWN handling is active. Replay the queued UP only if no newer DOWN or REMOVE event superseded it, and guard reconnection and pool cleanup against stale host objects.

Add unit coverage for superseded up/down/remove sequences, queued replay, and endpoint updates.
@dkropachev dkropachev force-pushed the fix/replay-up-after-down-handling branch from cff85ac to 368e7e6 Compare April 30, 2026 21:19
Comment thread cassandra/cluster.py
def on_up(self, host):
return self._on_up(host)

def _on_up(self, host, expected_epoch=None):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant