docs(rust): add async public API design#398
Conversation
This design document proposes exposing a parallel, native-async public API on the Rust ADBC kernel. The existing synchronous ADBC traits and the C FFI surface used by the ODBC driver are preserved byte-identically. Motivation: the Node.js SQL driver's SEA support is being built as a napi-rs native addon that wraps this kernel. napi-rs cannot consume the current sync-over-async public surface because block_on / block_in_place inside a napi-rs tokio worker thread causes nested-runtime panics. The design covers runtime-ownership refactoring, async constructors, async metadata/execute methods, a new AsyncResultReader trait, async auth provider mirrors, Drop semantics for borrowed-runtime mode, phased implementation plan, testing strategy, and compatibility guarantees for ODBC. Co-authored-by: Isaac
lidavidm
left a comment
There was a problem hiding this comment.
Two things:
(1) Please see the upstream proposal: apache/arrow-adbc#3714
(2) We just released an ADBC driver manager for Node.js, is there any reason we can't use that instead?
Adds §18 to the async public API design based on a full audit of the Node driver's existing auth/transport stack and how it interacts with the proposed kernel async surface. Five gaps not fully addressed in the initial design are documented: 1. Kernel's set_auth_provider uses OnceLock — panics on second call — which blocks any post-construction token rotation. Proposes adding interior mutability (Arc<RwLock<String>>) to PersonalAccessToken with an update_token setter. Small kernel change (~20 LOC). 2. Mid-stream token refresh strategy: TS-driven pre-emptive push based on JWT exp claim, plus a 401-retry safety net for idempotent metadata ops only. 3. TLS trust-store mismatch — Node uses its bundled Mozilla CA list plus NODE_EXTRA_CA_CERTS env; Rust uses the OS system store and ignores NODE_EXTRA_CA_CERTS. Concrete failure scenarios + proposed TS-side bridge (caPath option, temp-file serialization for in-memory ca Buffers) + flagged kernel PR for in-memory Certificate::from_pem. 4. mTLS gap — Node supports client certs, kernel doesn't. Flagged as known SEA v1 limitation with a separate kernel issue. 5. Proxy divergence — Rust has no SOCKS support. Node binding must fail-fast with a clear error for SOCKS users. Plus addresses User-Agent masquerade (kernel defaults to a JDBC UA string that would misattribute SEA traffic), read-timeout default (60s is too short for real workloads), and adds five new Open Questions + a new Phase 2b for the mutable-PAT work. Co-authored-by: Isaac
|
|
||
| ## 1. Summary | ||
|
|
||
| This design proposes adding a **parallel, native-async public API** to the `databricks-adbc` Rust kernel. The existing synchronous ADBC trait surface (which the ODBC C++ driver consumes via a C FFI) remains byte-identical. The new async surface unlocks a Node.js binding (and future Python/JNI bindings) that cannot tolerate the `block_on` / `block_in_place` bridges currently used everywhere sync-public-wraps-async-internal. |
There was a problem hiding this comment.
We already have a node.js driver manager that could load the driver without this, it would be better for this proposal to go to the arrow-adbc repo
There was a problem hiding this comment.
We're yet to align on whether we want the rust kernel to adhere to the ADBC spec. If the decision is not follow the spec but have an independent databricks rust kernel, then this binding might be reimplemented. This current doc PR is just experimental on what we can achieve. The final design doc will be decided later
Adds §19 — a validation pass comparing this design against the actual implementation of @apache-arrow/adbc-driver-manager (Apache's Node-side napi-rs wrapper over the ADBC C API). Reading Apache's implementation produced three categories of findings: **Patterns to adopt verbatim** (§19.2): - Capture-then-reject error pattern. Never use env.throw() inside AsyncTask::reject — it creates a pending V8 exception that escapes as uncaughtException instead of rejecting the Promise. Fix: capture raw error in compute() into a field on the Task, then construct a structured JS Error object in reject() via create_error + set_named_property. Non-negotiable — getting this wrong produces production-breaking uncaughtExceptions. - Option<Arc<T>> state with close() = take() — uniform across Database/Connection/Statement/ResultIterator. - Raw napi classes prefixed with underscore, TS wrappers mandatory (our translation layer). - Symbol.asyncDispose support for using / await using ergonomics. - String-keyed HashMap for options. **Patterns to deliberately diverge from** (§19.3): - Streaming granularity: Apache drains entire RecordBatchReader into one IPC buffer per query. Would OOM on our EXTERNAL_LINKS GB-scale results. We stay with batch-at-a-time next_batch(). - AsyncTask + libuv vs #[napi] async fn + tokio: Apache is sync-kernel so AsyncTask fits; our kernel is async so #[napi] async fn is right. - Node version floor: we need Node 16+; Apache requires Node 22+. Not a choice for us. - musl-linux: included in our matrix; Apache excludes. - cancel(): Apache omits; we need it for long SEA queries. **Validated existing design choices:** - Arrow IPC bytes as FFI wire format. - Two-layer TS wrapper over raw napi classes. - Structured JS error with sqlState. - IBackend / SeaBackend split. Four independent reasons reaffirm we cannot adopt Apache's npm directly (§19.7): ADBC-compliance coupling risk, alpha status, Node 22+ floor, auth feature gap. We stay aligned with their binding-layer conventions while building our own crate. Adds two new open questions (#12 runtime model for Borrowed mode, #13 capture-then-reject style guide), updates file-by-file map with concrete version pins (napi-rs 3.0, napi-build 2, arrow-ipc/array/ schema >=53.1.0). Co-authored-by: Isaac
Adds a short §20 to the design covering the convergence plan with apache/arrow-adbc#3714 (active upstream PR that adds async traits to adbc_core). Three items: 1. Adopt RPITIT style (`fn foo() -> impl Future<Output=T> + Send`) instead of `#[async_trait]` from day one. Supersedes prior references to async-trait in §5.4/§5.5. Zero-cost convergence when #3714 lands. 2. Convergence is a mechanical ~200 LOC refactor when upstream merges — no behavioral change, napi-rs binding unaffected. Our AsyncResultReader becomes Pin<Box<dyn RecordBatchStream + Send>>; our inherent methods become trait impls; our Databricks-specific metadata method variants go away since upstream `get_objects` covers them. Our AsyncAuthProvider stays — ADBC has no auth trait. 3. Deep-dive analysis of the upstream PR and the Apache Node driver manager (@apache-arrow/adbc-driver-manager) lives in a companion reference doc (adbc-upstream-references.md) to keep the design focused on architecture rather than investigation notes. Why this was not folded into §19 (Apache Node binding validation): different subject matter (runtime trait shape vs binding-layer patterns), different convergence paths (we already implement adbc_core sync traits; we'll add async ones; we don't adopt the Node package). Co-authored-by: Isaac
|
For (2), I assume you mean the Node.js side? @kentkwu can you help out here? |
Summary of the design
Proposes a parallel, native-async public API on the Rust ADBC kernel. The existing synchronous ADBC trait implementations and the C FFI surface used by the ODBC driver are preserved byte-identically — ODBC keeps linking against the same symbols with the same behavior.
Motivation: the Node.js SQL driver's SEA support is being built as a napi-rs native addon that wraps this kernel. napi-rs cannot consume the current sync-over-async public surface because
block_on/block_in_placeinside a napi-rs tokio worker thread triggers nested-runtime panics or starves the worker pool.Approach: publish the async infrastructure that already exists inside the kernel (e.g.
StreamingCloudFetchProvider::next_batch,#[async_trait] DatabricksClient, the existingCancellationTokenplumbing, memory-bounded concurrency viachunks_in_memory: AtomicUsize, and theTokenStoreFresh/Stale/Expired state machine) as an additive public contract. No new async primitives are built — this is mostly a visibility / surface-area change.Estimated ~700 LOC added, ~0 LOC deleted across a 9-phase implementation plan.
This PR is docs-only. Once aligned, each phase in §10 becomes a separate code PR.
Key decisions and alternatives considered
Inherent async methods on existing structs, parallel to sync ADBC trait impls — chosen. Both surfaces share client, session, and handle state. Rejected alternative: separate crate for the async surface (would require lifting shared types into a third core crate with arbitrary-feeling visibility boundaries).
New
AsyncResultReadertrait next to the existingResultReader— chosen. The existingResultReaderfeedsResultReaderAdapter→ Arrow'sRecordBatchReader, which is a fundamentally sync iterator and cannot be async. A parallel trait is cleaner than retrofitting.RuntimeModeenum (Owned(Runtime) | Borrowed(Handle)) — proposed. ODBC continues to get theOwnedpath (today's behavior, zero change). napi-rs passes aHandleand getsBorrowedmode. Alternative: always take an optionalHandle. Flagged as Open Question for reviewer input.AsyncAuthProvideras a new trait, not a replacement forAuthProvider— chosen. Every concrete auth type (PAT, OAuth M2M, OAuth U2M) implements both. Sync trait stays forever as the ADBC contract. Removes theblock_in_place + Handle::current().block_on(...)bridge insideget_auth_headerfor async callers.Gated behind an
async-apiCargo feature, enabled by default — proposed. Open Question for reviewers.Drop in Borrowed mode spawns cleanup onto the
Handleand returns immediately — chosen. Avoids JS-thread freezes during V8 GC of napi-rs wrappers. Callers should prefer explicitclose_async().awaitfor deterministic cleanup.Rejected alternatives (§14 of the doc):
block_onfrom sync methods entirely — sync-only C/C++ callers have no runtime to enter.async; each language binding (napi-rs, PyO3, JNI) consumes the Rust-native async API directly.Areas needing specific review focus
ADBC / ODBC byte-identical guarantee (§8 of doc). Does the plan actually leave ODBC behavior unchanged? The existing
databricks-odbcintegration test suite is the proposed tripwire — is that sufficient, or do we need additional gates?Runtime ownership model (§4.3, §6.5). Is
RuntimeModeenum the right abstraction, or would an optionalHandle(always Borrowed when provided, always Owned otherwise) be simpler?AsyncResultReadertrait shape (§5.4). Doesschema() / next_batch() / cancel()cover every realistic consumer need, or are we missing things like row-count hints, schema metadata propagation, or partition iteration?Drop semantics (§6.4). Is spawn-and-forget in Borrowed mode acceptable, or do we need a stronger guarantee (e.g., a reaper task that awaits all outstanding cleanups on kernel shutdown)?
Feature flag default (§9). Enable
async-apiby default, or opt-in?Upstream coordination / rollout. This design benefits from a discussion with kernel maintainers before the first code PR lands. Who owns that conversation and on what timeline?
Notes
msrathore-db/databricks) is not GitHub-linked as a fork so cross-repo PR wasn't possible.Co-authored-by: Isaac