Skip to content

docs(rust): add async public API design#398

Draft
msrathore-db wants to merge 4 commits intomainfrom
design/rust-async-interface
Draft

docs(rust): add async public API design#398
msrathore-db wants to merge 4 commits intomainfrom
design/rust-async-interface

Conversation

@msrathore-db
Copy link
Copy Markdown
Collaborator

Summary of the design

Proposes a parallel, native-async public API on the Rust ADBC kernel. The existing synchronous ADBC trait implementations and the C FFI surface used by the ODBC driver are preserved byte-identically — ODBC keeps linking against the same symbols with the same behavior.

Motivation: the Node.js SQL driver's SEA support is being built as a napi-rs native addon that wraps this kernel. napi-rs cannot consume the current sync-over-async public surface because block_on / block_in_place inside a napi-rs tokio worker thread triggers nested-runtime panics or starves the worker pool.

Approach: publish the async infrastructure that already exists inside the kernel (e.g. StreamingCloudFetchProvider::next_batch, #[async_trait] DatabricksClient, the existing CancellationToken plumbing, memory-bounded concurrency via chunks_in_memory: AtomicUsize, and the TokenStore Fresh/Stale/Expired state machine) as an additive public contract. No new async primitives are built — this is mostly a visibility / surface-area change.

Estimated ~700 LOC added, ~0 LOC deleted across a 9-phase implementation plan.

This PR is docs-only. Once aligned, each phase in §10 becomes a separate code PR.

Key decisions and alternatives considered

  1. Inherent async methods on existing structs, parallel to sync ADBC trait impls — chosen. Both surfaces share client, session, and handle state. Rejected alternative: separate crate for the async surface (would require lifting shared types into a third core crate with arbitrary-feeling visibility boundaries).

  2. New AsyncResultReader trait next to the existing ResultReader — chosen. The existing ResultReader feeds ResultReaderAdapter → Arrow's RecordBatchReader, which is a fundamentally sync iterator and cannot be async. A parallel trait is cleaner than retrofitting.

  3. RuntimeMode enum (Owned(Runtime) | Borrowed(Handle)) — proposed. ODBC continues to get the Owned path (today's behavior, zero change). napi-rs passes a Handle and gets Borrowed mode. Alternative: always take an optional Handle. Flagged as Open Question for reviewer input.

  4. AsyncAuthProvider as a new trait, not a replacement for AuthProvider — chosen. Every concrete auth type (PAT, OAuth M2M, OAuth U2M) implements both. Sync trait stays forever as the ADBC contract. Removes the block_in_place + Handle::current().block_on(...) bridge inside get_auth_header for async callers.

  5. Gated behind an async-api Cargo feature, enabled by default — proposed. Open Question for reviewers.

  6. Drop in Borrowed mode spawns cleanup onto the Handle and returns immediately — chosen. Avoids JS-thread freezes during V8 GC of napi-rs wrappers. Callers should prefer explicit close_async().await for deterministic cleanup.

Rejected alternatives (§14 of the doc):

  • Making ADBC trait implementations themselves async — breaks every ADBC consumer.
  • Removing block_on from sync methods entirely — sync-only C/C++ callers have no runtime to enter.
  • Exposing async through the C FFI — C has no async; each language binding (napi-rs, PyO3, JNI) consumes the Rust-native async API directly.

Areas needing specific review focus

  1. ADBC / ODBC byte-identical guarantee (§8 of doc). Does the plan actually leave ODBC behavior unchanged? The existing databricks-odbc integration test suite is the proposed tripwire — is that sufficient, or do we need additional gates?

  2. Runtime ownership model (§4.3, §6.5). Is RuntimeMode enum the right abstraction, or would an optional Handle (always Borrowed when provided, always Owned otherwise) be simpler?

  3. AsyncResultReader trait shape (§5.4). Does schema() / next_batch() / cancel() cover every realistic consumer need, or are we missing things like row-count hints, schema metadata propagation, or partition iteration?

  4. Drop semantics (§6.4). Is spawn-and-forget in Borrowed mode acceptable, or do we need a stronger guarantee (e.g., a reaper task that awaits all outstanding cleanups on kernel shutdown)?

  5. Feature flag default (§9). Enable async-api by default, or opt-in?

  6. Upstream coordination / rollout. This design benefits from a discussion with kernel maintainers before the first code PR lands. Who owns that conversation and on what timeline?

Notes

  • Branch pushed directly to upstream given admin access; fork (msrathore-db/databricks) is not GitHub-linked as a fork so cross-repo PR wasn't possible.
  • No Jira ticket associated yet — will backfill once we align on rollout approach.
  • Drafting implementation Jira tickets one-per-phase after this design is approved.

Co-authored-by: Isaac

This design document proposes exposing a parallel, native-async public
API on the Rust ADBC kernel. The existing synchronous ADBC traits and
the C FFI surface used by the ODBC driver are preserved byte-identically.

Motivation: the Node.js SQL driver's SEA support is being built as a
napi-rs native addon that wraps this kernel. napi-rs cannot consume the
current sync-over-async public surface because block_on / block_in_place
inside a napi-rs tokio worker thread causes nested-runtime panics.

The design covers runtime-ownership refactoring, async constructors,
async metadata/execute methods, a new AsyncResultReader trait, async
auth provider mirrors, Drop semantics for borrowed-runtime mode,
phased implementation plan, testing strategy, and compatibility
guarantees for ODBC.

Co-authored-by: Isaac
Copy link
Copy Markdown
Contributor

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

(1) Please see the upstream proposal: apache/arrow-adbc#3714
(2) We just released an ADBC driver manager for Node.js, is there any reason we can't use that instead?

Adds §18 to the async public API design based on a full audit of the
Node driver's existing auth/transport stack and how it interacts with
the proposed kernel async surface.

Five gaps not fully addressed in the initial design are documented:

1. Kernel's set_auth_provider uses OnceLock — panics on second call —
   which blocks any post-construction token rotation. Proposes adding
   interior mutability (Arc<RwLock<String>>) to PersonalAccessToken with
   an update_token setter. Small kernel change (~20 LOC).

2. Mid-stream token refresh strategy: TS-driven pre-emptive push based
   on JWT exp claim, plus a 401-retry safety net for idempotent metadata
   ops only.

3. TLS trust-store mismatch — Node uses its bundled Mozilla CA list
   plus NODE_EXTRA_CA_CERTS env; Rust uses the OS system store and
   ignores NODE_EXTRA_CA_CERTS. Concrete failure scenarios + proposed
   TS-side bridge (caPath option, temp-file serialization for in-memory
   ca Buffers) + flagged kernel PR for in-memory Certificate::from_pem.

4. mTLS gap — Node supports client certs, kernel doesn't. Flagged as
   known SEA v1 limitation with a separate kernel issue.

5. Proxy divergence — Rust has no SOCKS support. Node binding must
   fail-fast with a clear error for SOCKS users.

Plus addresses User-Agent masquerade (kernel defaults to a JDBC UA
string that would misattribute SEA traffic), read-timeout default
(60s is too short for real workloads), and adds five new Open
Questions + a new Phase 2b for the mutable-PAT work.

Co-authored-by: Isaac

## 1. Summary

This design proposes adding a **parallel, native-async public API** to the `databricks-adbc` Rust kernel. The existing synchronous ADBC trait surface (which the ODBC C++ driver consumes via a C FFI) remains byte-identical. The new async surface unlocks a Node.js binding (and future Python/JNI bindings) that cannot tolerate the `block_on` / `block_in_place` bridges currently used everywhere sync-public-wraps-async-internal.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a node.js driver manager that could load the driver without this, it would be better for this proposal to go to the arrow-adbc repo

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're yet to align on whether we want the rust kernel to adhere to the ADBC spec. If the decision is not follow the spec but have an independent databricks rust kernel, then this binding might be reimplemented. This current doc PR is just experimental on what we can achieve. The final design doc will be decided later

@msrathore-db msrathore-db marked this pull request as draft April 23, 2026 20:57
Adds §19 — a validation pass comparing this design against the actual
implementation of @apache-arrow/adbc-driver-manager (Apache's Node-side
napi-rs wrapper over the ADBC C API).

Reading Apache's implementation produced three categories of findings:

**Patterns to adopt verbatim** (§19.2):
- Capture-then-reject error pattern. Never use env.throw() inside
  AsyncTask::reject — it creates a pending V8 exception that escapes
  as uncaughtException instead of rejecting the Promise. Fix: capture
  raw error in compute() into a field on the Task, then construct a
  structured JS Error object in reject() via create_error +
  set_named_property. Non-negotiable — getting this wrong produces
  production-breaking uncaughtExceptions.
- Option<Arc<T>> state with close() = take() — uniform across
  Database/Connection/Statement/ResultIterator.
- Raw napi classes prefixed with underscore, TS wrappers mandatory
  (our translation layer).
- Symbol.asyncDispose support for using / await using ergonomics.
- String-keyed HashMap for options.

**Patterns to deliberately diverge from** (§19.3):
- Streaming granularity: Apache drains entire RecordBatchReader into
  one IPC buffer per query. Would OOM on our EXTERNAL_LINKS GB-scale
  results. We stay with batch-at-a-time next_batch().
- AsyncTask + libuv vs #[napi] async fn + tokio: Apache is sync-kernel
  so AsyncTask fits; our kernel is async so #[napi] async fn is right.
- Node version floor: we need Node 16+; Apache requires Node 22+. Not
  a choice for us.
- musl-linux: included in our matrix; Apache excludes.
- cancel(): Apache omits; we need it for long SEA queries.

**Validated existing design choices:**
- Arrow IPC bytes as FFI wire format.
- Two-layer TS wrapper over raw napi classes.
- Structured JS error with sqlState.
- IBackend / SeaBackend split.

Four independent reasons reaffirm we cannot adopt Apache's npm directly
(§19.7): ADBC-compliance coupling risk, alpha status, Node 22+ floor,
auth feature gap. We stay aligned with their binding-layer conventions
while building our own crate.

Adds two new open questions (#12 runtime model for Borrowed mode,
#13 capture-then-reject style guide), updates file-by-file map with
concrete version pins (napi-rs 3.0, napi-build 2, arrow-ipc/array/
schema >=53.1.0).

Co-authored-by: Isaac
Adds a short §20 to the design covering the convergence plan with
apache/arrow-adbc#3714 (active upstream PR that adds async traits to
adbc_core). Three items:

1. Adopt RPITIT style (`fn foo() -> impl Future<Output=T> + Send`)
   instead of `#[async_trait]` from day one. Supersedes prior
   references to async-trait in §5.4/§5.5. Zero-cost convergence
   when #3714 lands.

2. Convergence is a mechanical ~200 LOC refactor when upstream
   merges — no behavioral change, napi-rs binding unaffected. Our
   AsyncResultReader becomes Pin<Box<dyn RecordBatchStream + Send>>;
   our inherent methods become trait impls; our Databricks-specific
   metadata method variants go away since upstream `get_objects`
   covers them. Our AsyncAuthProvider stays — ADBC has no auth trait.

3. Deep-dive analysis of the upstream PR and the Apache Node driver
   manager (@apache-arrow/adbc-driver-manager) lives in a companion
   reference doc (adbc-upstream-references.md) to keep the design
   focused on architecture rather than investigation notes.

Why this was not folded into §19 (Apache Node binding validation):
different subject matter (runtime trait shape vs binding-layer
patterns), different convergence paths (we already implement
adbc_core sync traits; we'll add async ones; we don't adopt the
Node package).

Co-authored-by: Isaac
@msrathore-db
Copy link
Copy Markdown
Collaborator Author

Two things:

(1) Please see the upstream proposal: apache/arrow-adbc#3714 (2) We just released an ADBC driver manager for Node.js, is there any reason we can't use that instead?

  1. What's the timeline for the upstream proposal to land?
  2. Will it be able to support async execution? If yes, what's the timeline for that?

@lidavidm
Copy link
Copy Markdown
Contributor

Two things:
(1) Please see the upstream proposal: apache/arrow-adbc#3714 (2) We just released an ADBC driver manager for Node.js, is there any reason we can't use that instead?

1. What's the timeline for the upstream proposal to land?

2. Will it be able to support async execution? If yes, what's the timeline for that?

For (2), I assume you mean the Node.js side? @kentkwu can you help out here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants