Skip to content

Add separate PVC for tiered storage cache#239

Open
eduardagarici wants to merge 12 commits intomasterfrom
CORE-152486
Open

Add separate PVC for tiered storage cache#239
eduardagarici wants to merge 12 commits intomasterfrom
CORE-152486

Conversation

@eduardagarici
Copy link
Copy Markdown

Add tiered storage cache as a separate PVC

Summary

  • Introduces a tieredStorageCache: true flag on StorageConfig that marks a volume as a Kafka tiered storage cache (e.g. rsm.config.fetch.chunk.cache.path). Cache volumes are provisioned as independent PVCs, separate from
    Kafka log volumes.
  • Cache volumes are excluded from log.dirs so Kafka does not treat them as data directories.
  • Cache volumes are excluded from Cruise Control capacity calculations and disk rebalance operations — CC must not account for ephemeral storage.
  • Controller-only nodes skip cache PVCs entirely since they do not serve broker traffic.
  • Shrink resize: because cache data is ephemeral, koperator supports shrinking a cache PVC via a safe delete-and-recreate flow coordinated with the rolling upgrade machinery. The old PVC is annotated pending-deletion, a
    replacement is created immediately (provisioning overlaps with gate evaluation), and the old PVC is deleted once the broker pod stops. A grow continues to use the standard Kubernetes in-place expansion path. Full details in
    docs/tiered-storage-pvc-resize.md.

Test plan

  • testTieredStorageCachePvcResize passes against a local kind cluster — verifies the full shrink lifecycle (staging, pod cycle, annotation cleanup, correct final size)
  • Cache volume does not appear in log.dirs on the broker
  • CC capacity config does not include the cache mount path
  • Normal log volume grow/shrink behaviour is unaffected

Comment thread docs/tiered-storage-pvc-resize.md Outdated
@dobrerazvan
Copy link
Copy Markdown

Pre-Flight Review — CORE-152486

Branch: CORE-152486 vs master | Scope: 20 files, tiered storage cache PVC shrink feature


Summary

The feature design is sound — CR status as state store, atomicity ordering (state before PVC creation), size-based PVC identification, and treating terminating pods as absent are all correct for a Kubernetes operator. One HIGH bug must be fixed before merging. Eight MEDIUM observations are recommended fixes; six LOW items are optional improvements.


Alignment Check

Concern Status
Crash safety / re-entrancy Correct — state written before replacement PVC created
Terminating pod = absent Correct
Grow path (no special handling) Correct
CC + log.dirs exclusion Correct
CRD backward compatibility omitempty on all new fields

Findings

HIGH — Fix before merging

C1 — deleteRemovedCachePVCs deletes a mounted PVC
pkg/resources/kafka/kafka.go

When a TieredStorageCache: true storage config is removed while the broker pod is still running and no CacheVolumeStates entry exists for that mount path, handleBrokerCacheResizeCleanup returns early (len(cacheVolumeStates) == 0), then deleteRemovedCachePVCs runs unconditionally and issues a Delete against the actively-mounted PVC. The PVC-protection finalizer prevents immediate data loss, but the delete is issued unintentionally.

cleanupOrphanedDuplicateCachePVCs (correctly) gates on !brokerPodExists. deleteRemovedCachePVCs must do the same:

if !brokerPodExists {
    if err := r.deleteRemovedCachePVCs(ctx, log, brokerId, pvcList, desiredMountPaths); err != nil {
        return err
    }
}

MEDIUM — Strongly recommended

C4 — Missing test: crash re-entry with no replacement PVC
pkg/resources/kafka/kafka_test.go

No test covers: initialCacheVolumeState = CacheResizePendingDeletion, only the old PVC exists (reconciler crashed between writing state and creating replacement), pod running. Expected: no Delete, no state clear, one Create (replacement re-staged). The invariant is described in comments but untested.


A1 — Empty-string sentinel for map deletion is an implicit contract
pkg/k8sutil/status.go:182

CacheResizeState("") silently deletes map entries in generateBrokerState. The established pattern is DeleteVolumeStatus. An uninitialized CacheResizeState variable passed by a future caller would delete entries with no compile error. Consider a named constant (CacheResizeStateNone) or a dedicated DeleteCacheVolumeStatus helper.


A2 — "tieredStorageCache" annotation key duplicated as bare string 9+ times
pkg/resources/kafka/kafka.go, pvc.go, kafka_test.go

A typo at any read site silently treats a cache PVC as a regular data volume. Define a named constant (e.g., TieredStorageCacheAnnotation = "tieredStorageCache") following the BrokerIdLabelKey precedent.


A5 — Tiered-cache PVC guard predicate duplicated at 3 sites
pkg/resources/kafka/kafka.go:1298, 1326, 1339

pvc.DeletionTimestamp != nil || pvc.Annotations["tieredStorageCache"] != annotationTrue appears three times. Extract to a helper (isTieredCachePVC).


SE1 — PVC deletion gated only on mutable annotations, no OwnerReference check
pkg/resources/kafka/kafka.godeleteRemovedCachePVCs, cleanupOrphanedDuplicateCachePVCs

The List selector (LabelsForKafka + BrokerIdLabelKey) and annotation filter (tieredStorageCache, mountPath) have no OwnerReference verification. A principal with patch access to PVCs in the namespace could craft a matching PVC to trigger deletion. Add an ownership guard before deleting in both helpers.


SE2 — Annotation backfill enables two-step privilege escalation
pkg/resources/kafka/kafka.go:1608

A KafkaCluster CR writer (without direct PVC delete access) can: (1) set TieredStorageCache: true on an existing data volume's storage config — operator backfills the annotation; (2) remove the storage config entry — deleteRemovedCachePVCs deletes the now-annotated data PVC, bypassing the CC disk-removal workflow. Consider restricting backfill to ClaimPending PVCs or adding a validation webhook.


D1 — Rollback mid-resize leaves two PVCs indefinitely with no recovery path
docs/tiered-storage-pvc-resize.md

If the operator is rolled back while pending-deletion state is set, the old operator ignores the unknown field — two PVCs coexist at the same mount path with no automatic recovery and no visible signal in kubectl describe kafkacluster. Add a "Rollback safety" section with detection steps and a manual recovery procedure.


D2 — No Kubernetes Events for resize lifecycle
pkg/resources/kafka/kafka.go

The entire resize lifecycle is visible only in operator logs. Emit record.Event(...) at minimum when pending-deletion is first written and when cleanup completes. This is the standard on-call triage interface; without it, a broker restart from a resize is indistinguishable from a crash without operator log access.


LOW — Optional improvements

ID Finding Location
C3 Add unit test for C1's scenario (config removed, pod running, no prior state) kafka_test.go
A3 annotationTrue used for both annotation values and Kafka config comparisons — split or inline in configmap.go kafka.go:72
A4 skipBroker bool return from handleBrokerCacheResizeCleanup couples helper to caller's loop structure kafka.go:360
S3 Two new //nolint:funlen lack rationale comments kafka.go:1402, 1552
D4 UpdateBrokerStatus(ConfigOutOfSync) called every cycle during rolling restart; guard may miss due to stale in-memory state kafka.go:~1653
D6 Pre-cleanup in e2e uses 10-min tsResizePhaseTimeout for a step that should complete in seconds test_tiered_storage_cache_resize.go:228

Rejected Findings

ID Reason
C2 getCreatedPvcForBroker deduplication target is irrelevant — reconcileDesiredPvcsForBroker independently identifies the correct PVC by size
A6 Per-PVC r.List in inner loop is pre-existing; PR doesn't worsen it
SE3 Log verbosity preference; mount paths/PVC names are standard operational metadata
D3 SETUP_ENVTEST_VERSION pin has explicit rationale comment; intentionally excluded from renovate
D5 Namespace set explicitly on the line before requireDeleteKafkaCluster; code comment documents the pattern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants