Skip to content

perf(rust): reduce per-query overhead and coalesce small batches#422

Open
vikrantpuppala wants to merge 1 commit intomainfrom
rust/perf-small-query-optimizations
Open

perf(rust): reduce per-query overhead and coalesce small batches#422
vikrantpuppala wants to merge 1 commit intomainfrom
rust/perf-small-query-optimizations

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Collaborator

Summary

Four small kernel changes that together close the per-query gap vs the existing Thrift backend on small/medium results, with no regressions on large CloudFetch queries.

  1. Skip redundant DELETE for inline-Closed statements. The SEA server returns status=Closed alongside inline result data — the statement is already cleaned up server-side, so issuing a DELETE is a wasted round-trip (~250ms). Plumbs a server_side_closed: bool through ExecuteResult; Statement::execute_single skips registering the statement_id for cleanup when set.

  2. Make Drop for Statement non-blocking. Drop previously called block_on(close_statement(...)), forcing every caller to pay a synchronous cleanup round-trip even when nothing was waiting for the result. Spawn the close on the runtime instead — best-effort fire-and-forget. Saves ~250ms on every CloudFetch query.

  3. Coalesce small batches on the inline path. InlineArrowProvider was emitting 200+ tiny RecordBatches per 100K-row result (one per IPC message). Adds a batch_merge_target_rows parameter and applies the same coalescing logic the CloudFetch download path already uses. Reduces per-batch overhead at language bindings (e.g. PyO3, ODBC).

  4. Enable batch_merge_target_rows by default (128k rows). Was 0 (disabled). All consumers now get coalesced batches by default; no API change.

Behavior change to call out

Default batch_merge_target_rows flips from 0 to 128_000. Consumers that previously saw many small batches per chunk will now see ~1 large batch per chunk (post-merge). Set the option explicitly to 0 to opt out.

Benchmark

Dogfood warehouse, randomized interleaved (Rust vs Thrift) benchmark, 20 runs per size, median wall time on fetchall_arrow path:

size Rust before Rust after Thrift ratio (after/Thrift)
SELECT 1 500ms 394ms 387ms 1.02×
10K 950ms 893ms 1014ms 0.88×
100K 1450ms 1148ms 1145ms 1.00×
500K 2600ms 2178ms 3305ms 0.66×
1M 3700ms 3579ms 3814ms 0.94×
10M 8700ms 8677ms 8802ms 0.99×

Test plan

  • cargo +stable fmt --all -- --check clean
  • cargo clippy --all-targets -- -D warnings clean
  • cargo test (full suite, 349 tests) pass
  • End-to-end smoke against dogfood warehouse with both inline and CloudFetch paths

This pull request and its description were written by Isaac.

Four small kernel changes that together close the per-query gap vs the
existing Thrift backend on small/medium results, with no regressions on
large CloudFetch queries.

1. Skip redundant DELETE for inline-Closed statements.
   The SEA server returns status=Closed alongside inline result data —
   the statement is already cleaned up server-side, so issuing a DELETE
   is a wasted round-trip (~250ms). Plumb a `server_side_closed: bool`
   through ExecuteResult; Statement::execute_single skips registering
   the statement_id for cleanup when set.

2. Make Drop for Statement non-blocking.
   Drop previously block_on(close_statement(...)), forcing every caller
   to pay a synchronous cleanup round-trip even when nothing was waiting
   for the result. Spawn the close on the runtime instead — best-effort
   fire-and-forget. Saves ~250ms on every CloudFetch query.

3. Coalesce small batches on the inline path.
   InlineArrowProvider was emitting 200+ tiny RecordBatches per 100K-row
   result (one per IPC message). Add a batch_merge_target_rows knob and
   apply the same coalescing logic the CloudFetch download path uses.
   Reduces per-batch overhead at language bindings (e.g. PyO3, ODBC).

4. Enable batch_merge_target_rows by default (128k rows).
   Was 0 (disabled). All consumers now get coalesced batches by default;
   no API change.

Measured on dogfood warehouse, randomized interleaved benchmark vs
Thrift backend (median wall time, fetchall_arrow path):

  size      Rust (before) -> Rust (after)   Thrift   ratio
  SELECT 1     500ms      ->     394ms       387ms   1.02x
  10K          950ms      ->     893ms      1014ms   0.88x
  500K        2600ms      ->    2178ms      3305ms   0.66x
  1M          3700ms      ->    3579ms      3814ms   0.94x
  10M         8700ms      ->    8677ms      8802ms   0.99x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant