Skip to content

feat(capi): C ABI for distance, VectorStore, HNSW, MMap#18

Open
tsvet01 wants to merge 12 commits into
mainfrom
docs/cpp-c-api-plan
Open

feat(capi): C ABI for distance, VectorStore, HNSW, MMap#18
tsvet01 wants to merge 12 commits into
mainfrom
docs/cpp-c-api-plan

Conversation

@tsvet01
Copy link
Copy Markdown
Collaborator

@tsvet01 tsvet01 commented May 29, 2026

Summary

Adds a shippable C ABI (extern "C", symbol prefix vanedb_cpp_) over the header-only C++ core, built as a static library libvanedb_cpp_capi.a behind a new VANEDB_BUILD_CAPI CMake option (default OFF). This is Plan 1 of 3 toward vanedb/vanedb-bench, the cross-language benchmark harness — it lets a Rust driver link the C++ implementation and call it through the same C ABI as the Rust implementation (vanedb_rs_*).

Includes the implementation plan doc (docs/superpowers/plans/2026-05-28-cpp-c-api.md).

What's exposed

  • Distance: vanedb_cpp_l2_sq / _cosine_distance / _dot_product
  • VectorStore: _store_new / _add / _search / _free
  • HNSW: _hnsw_new (seeded) / _add / _search (per-call ef_search) / _save / _load / _free
  • MMap: _mmap_build / _open / _search / _free

Design invariants (verified in review)

  • Exception safety: every wrapper that can throw is try/catch(...)-guarded — no C++ exception crosses the C boundary.
  • Null-handle safety: handle deref is guarded before the reinterpret_cast (a null deref is UB that catch(...) can't catch).
  • Ownership: every _new/_open/_load (.release()d) is balanced by one _free.
  • ABI parity: non-const handles + per-call ef_search are intentional, to stay byte-identical to the parallel Rust ABI. Documented in the header.

Testing

  • C smoke test tests/capi/test_capi.c (compiled as C, linked C++) exercises every function incl. HNSW save/load round-trip and negative paths.
  • Built TDD, one wrapper family per commit, each spec- + quality-reviewed.
  • Notable fix: the final holistic review caught that the test no-op'd under Release -DNDEBUG (asserts compiled out). Fixed by stripping NDEBUG from C release flags (matching the existing C++ treatment) and hoisting all calls out of assert(). Verified by negative control (a wrong assert now fails Release ctest) and by .bin artifacts appearing in Release.

Verification

cmake -B build-capi -DCMAKE_BUILD_TYPE=Release -DVANEDB_BUILD_CAPI=ON
cmake --build build-capi --target test_capi --parallel
ctest --test-dir build-capi -R capi --output-on-failure   # 1/1 passed: enum/distance/store/hnsw/mmap OK

Default builds are unaffected (VANEDB_BUILD_CAPI is OFF; the C flags only affect the lone C test source).

Follow-up (not in this PR)

  • Wire VANEDB_BUILD_CAPI=ON into a CI job so the C API is continuously guarded — most natural once Plan 3's harness consumes it.

🤖 Generated with Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented May 29, 2026

Code Review: feat(capi): C ABI for distance, VectorStore, HNSW, MMap

Overall this is a clean, well-considered implementation. The exception-boundary safety, null-handle guards, and ABI-parity rationale are all well-handled. A few real issues and suggestions below.


Bugs / Correctness

set_ef_search side-effect is an observable mutation, not just a thread-safety concern
vanedb_cpp_hnsw_search calls idx->set_ef_search(ef_search) before search, which permanently mutates the handle. If a caller does:

vanedb_cpp_hnsw_search(h, q1, k, 200, ...);   // ef_search now 200
vanedb_cpp_hnsw_search(h, q2, k, 50,  ...);   // ef_search now 50

that's fine. But if search itself throws (caught by the outer catch), ef_search is still modified. The handle is left in a mutated state even though the call "failed" (returned 0). For a single-threaded benchmark this is harmless, but it's an inconsistency worth noting in the header comment alongside the thread-safety note.

to_metric silently maps any invalid enum value to L2
In C, callers can pass any int where vanedb_metric is expected. The default: return DistanceMetric::L2 branch is silent—no indication that the input was bad. Consider at minimum documenting this in the header: "undefined metric values are treated as L2."

Test binary artifacts are never cleaned up
capi_hnsw.bin and capi_mmap.bin are written to the ctest working directory and never removed. On repeated runs they persist, which could mask a broken _load (it loads the file from a previous run). Consider using cmake_tests's WORKING_DIRECTORY to scope artifacts, or add --cleanup steps after the test.


Design / Interface Concerns

target_compile_features changed from PUBLIC to PRIVATE
The plan doc specifies PUBLIC cxx_std_20; the actual CMakeLists uses PRIVATE. PRIVATE is the right call here (consumers only see the C header, not the C++ internals), but it's worth confirming no downstream target_link_libraries(... vanedb_cpp_capi) consumer expects the feature to propagate.

No VANEDB_CAPI_EXPORT macro
This is intentionally a static lib today, but if it ever needs to ship as a shared lib (e.g., for dynamic loading in the benchmark harness), every function will need visibility annotations. A no-op #define VANEDB_CAPI_EXPORT in the header now costs nothing and avoids a future ABI-breaking header change.

No ABI version constant in the header
The comment says "C API v0 — unstable until a tagged release" but there's no #define VANEDB_CAPI_VERSION 0. A version constant lets consuming code #if VANEDB_CAPI_VERSION >= 1 guard future additions without requiring out-of-band versioning.


Test Coverage Gaps

  • No negative path for vanedb_cpp_hnsw_load with a nonexistent or corrupt path — the most likely production failure mode.
  • No test for vanedb_cpp_mmap_build with n=0 (empty store).
  • vanedb_cpp_store_free(NULL) / vanedb_cpp_hnsw_free(NULL) / vanedb_cpp_mmap_free(NULL) are all valid (delete nullptr is defined), but none are exercised. Since you already test null-handle guards for add/search/save, completing the set here is low effort.
  • The test only exercises VANEDB_L2; no smoke pass for VANEDB_COSINE or VANEDB_DOT through the store or HNSW path.

Minor / Style

  • capi/vanedb_capi.cpp:99vanedb_cpp_store_add has the null guard, but the function body pattern differs slightly from hnsw_add: the try/catch wraps the null check in hnsw but the null check is outside the try in store. This is functionally equivalent but inconsistent. Prefer keeping the null check outside the try for clarity (current store style is better).
  • The plan document at docs/superpowers/plans/2026-05-28-cpp-c-api.md is useful as an internal artifact but may surprise external contributors who mistake it for public documentation. Worth a note in the PR about whether to keep it in-tree vs. link from a wiki/issue.

What's done well

  • Exception-to-error translation at every boundary — no leakage possible.
  • Null-handle guards on all mutating/output-writing calls.
  • target_compile_features PRIVATE cxx_std_20 correctly keeps C++ requirements from leaking to C consumers.
  • The enable_language(C) placement is correct (inside VANEDB_BUILD_CAPI) with a clear comment.
  • .release() on the unique_ptr from HNSWIndex::load is the right ownership transfer pattern.
  • C flags (CMAKE_C_FLAGS_*) matching the existing C++ flags is the right fix for the NDEBUG/assert issue.
  • The seed truncation (uint64_tuint32_t) is clearly documented at the call site.

🤖 Generated with Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented May 29, 2026

Code Review: feat(capi): C ABI for distance, VectorStore, HNSW, MMap

Overall: Well-designed C ABI wrapper. Exception safety, null-handle guards, and the ABI comment block are solid. A few correctness gaps and one latent bug worth addressing before the bench harness lands.


Bugs / Correctness

[High] Null path parameters can SIGSEGV through the catch(...)

vanedb_cpp_hnsw_save, vanedb_cpp_hnsw_load, vanedb_cpp_mmap_build, and vanedb_cpp_mmap_open all accept a const char* path but pass it straight to C++ file operations without a null check. If a caller passes NULL, the OS will raise SIGSEGV — a signal, not a C++ exception — and catch(...) will not intercept it. The process dies instead of returning a safe error code.

// vanedb_capi.cpp — add to each path-taking function before the try block:
if (!path) return 1; // or return nullptr for load/open

The null-handle pattern already applied to store/hnsw/mmap handles should be extended to path parameters.

[Medium] vanedb_cpp_hnsw_search result count is uncapped

The wrapper writes res.size() elements into caller-supplied out_ids/out_dists of length k. The underlying HNSWIndex::search should never return more than k results, but the wrapper has no defensive cap:

size_t count = std::min(res.size(), k); // add this
for (size_t i = 0; i < count; ++i) { ... }
return count;

Same applies to store_search and mmap_search. Without this, a buggy core change becomes a silent buffer overflow at the ABI boundary.


Design Notes

[Minor] set_ef_search + search is a two-step mutation

The header comment documents this correctly ("not thread-safe to call concurrently with different ef_search values"). For the intended single-threaded bench consumer, this is fine. Worth flagging that if this ABI is later used in multi-threaded contexts, the atomicity gap will be a real hazard. A future v1 might prefer search(q, k, ef_search) at the core level.

[Minor] uint64_t seed silently truncated to uint32_t

The truncation comment in the .cpp is good. A note in the header's ABI block would also help callers who read only the header:

// seed: only the low 32 bits are used; high bits are silently dropped
vanedb_cpp_hnsw* vanedb_cpp_hnsw_new(..., uint64_t seed);

[Nit] cxx_std_20 is correctly PRIVATE

The plan doc says PUBLIC but the PR correctly uses PRIVATE — consumers link via the C header, not C++. The PR's choice is right.


CMake

[Minor] C compiler flags are set globally

The CMAKE_C_FLAGS_* additions apply project-wide, not just to the C test. This is harmless today (the only C source is the test), but it's surprising for a C++ project. Prefer target-level flags:

target_compile_options(test_capi PRIVATE -O3)

Or at minimum, scope the flag assignment inside the if(VANEDB_BUILD_CAPI) block with a comment explaining why.


Testing

Test artifacts are not cleaned up

capi_hnsw.bin and capi_mmap.bin are created in CWD (the build dir when ctest runs). They're not deleted after the test. This is fine for a smoke test, but worth noting — parallel ctest runs or stale files from a previous run could potentially cause interference. Consider using WORKING_DIRECTORY in add_test and a fixed temp path, or note it explicitly.

Negative paths are well-covered — dim=0 constructors, null handles across all types. Good work catching the NDEBUG strip issue.


CI Gap (acknowledged)

No CI job exercises VANEDB_BUILD_CAPI=ON. Given that the bench harness (Plan 3) is the consumer, wiring CI then is reasonable. Just note that any refactor of HNSWIndex, VectorStore, or MMapVectorStore interfaces between now and then could silently break the C API with no CI signal.


Summary

Area Status
Exception safety at C boundary ✅ Solid
Null handle guards ✅ Done
Null path guards ❌ Missing — can SIGSEGV
Output buffer bounds ❌ Uncapped — latent overflow
ABI documentation ✅ Clear and accurate
Test coverage ✅ Good (positive + negative paths)
CI coverage ⚠️ Deferred (acknowledged)

The null-path and output-buffer issues are the blockers to address. Everything else is polish.

@claude
Copy link
Copy Markdown

claude Bot commented May 30, 2026

Code Review — feat(capi): C ABI for distance, VectorStore, HNSW, MMap

Overall this is a clean, well-structured C ABI layer. The exception-boundary discipline, null-handle guards, and documented ABI conventions are all solid. A few things worth addressing before merge, plus some minor notes.


Issues

[major] No CI coverage for VANEDB_BUILD_CAPI

The PR acknowledges this as follow-up, but it's worth calling out: the C API is entirely untested in CI. Any future change to the C++ core (distance functions, HNSW save format, VectorStore behavior) could silently break the C ABI without any signal. The PR description suggests wiring a CI job once the bench harness (Plan 3) consumes it — that's acceptable for now, but I'd recommend tracking this as a concrete task rather than an open-ended follow-up.

[minor] hnsw_search mutates index state non-atomically

// capi/vanedb_capi.cpp:135-137
idx->set_ef_search(ef_search);
auto res = idx->search(q, k);

set_ef_search mutates the index before the search. Two concurrent callers on the same handle with different ef_search values will race. The header documents this as intentionally not thread-safe for the single-threaded benchmark consumer — that's fine — but this design means the C ABI can never be safely used for concurrent search on a shared handle without external locking. Worth a one-line note in the header that callers requiring concurrent search should use separate handles per thread.

[minor] uint64_t seed silently truncated to uint32_t

// capi/vanedb_capi.cpp:122
new HNSWIndex(dim, to_metric(metric), capacity, M, ef_construction,
              static_cast<uint32_t>(seed)));

The truncation is documented in the header and in a code comment, which is the right approach given the Rust ABI parity constraint. One additional suggestion: the header comment says "only the low 32 bits are used" — it might be worth adding that seeds ≥ 2³² will silently differ between C++ and any future Rust consumer that passes the full 64-bit value, since the Rust side may use the full seed. If both sides only use the low 32 bits, great — confirm that's the case in vanedb_rs_*.

[minor] to_metric silently maps unrecognized values to L2

// capi/vanedb_capi.cpp:77
default: return DistanceMetric::L2;

Documented in the header. For a closed enum (VANEDB_L2/COSINE/DOT) this is fine, but if vanedb_metric gains a new value in a future update and a consumer passes it before recompiling the C++ side, they'll silently get L2 semantics. This is a known tradeoff of the design, just worth acknowledging.


Nits

target_compile_features is PRIVATE in the actual CMakeLists.txt but PUBLIC in the plan doc

# CMakeLists.txt:38 (actual)
target_compile_features(vanedb_cpp_capi PRIVATE cxx_std_20)

# docs/superpowers/plans/... (plan doc, Task 1 Step 4)
target_compile_features(vanedb_cpp_capi PUBLIC cxx_std_20)

PRIVATE is correct here — C++20 is an implementation detail, and the public header is pure C. The plan doc has a stale discrepancy; not a problem in practice but worth noting if the plan doc is used as a reference for follow-up work.

Test writes temp files to the working directory without cleanup

capi_hnsw.bin and capi_mmap.bin are created during the test but never removed. In CI (ephemeral build dir) this doesn't matter, but locally they'll accumulate. If ctest_build_dir is under version control, these could show up as untracked files. Consider using CMAKE_CURRENT_BINARY_DIR-prefixed paths via a WORKING_DIRECTORY property on the test, or add a FIXTURES_CLEANUP ctest fixture.

enable_language(C) inside nested conditionals

# CMakeLists.txt:42
enable_language(C)

This is inside if(VANEDB_BUILD_CAPI) + if(VANEDB_BUILD_TESTS). Calling enable_language() after project() in a conditional block is technically supported in CMake 3.20+ but is uncommon and can cause generator-expression issues in some configurations. Works fine in the current setup; just an unusual pattern to be aware of.

Plan doc contains agent scaffolding instructions

docs/superpowers/plans/2026-05-28-cpp-c-api.md opens with:

REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development...

These are implementation-process artifacts, not contributor documentation. They're harmless in the repo, but a future contributor reading this file expecting design rationale will find agent workflow instructions instead. Consider stripping the scaffolding preamble, or moving plan docs to a location that signals they're not normative reference material (e.g., .claude/plans/ or similar).


What's done well

  • Exception boundary: every wrapper that can throw is guarded with try/catch(...). No C++ exceptions cross the ABI.
  • Null-handle checks: all non-constructor functions guard against null before dereferencing.
  • Result buffer cap: size_t n = res.size() < k ? res.size() : k prevents writing past the caller's out_ids/out_dists buffers.
  • Opaque handle pattern: clean typedef struct approach; callers never touch the internals.
  • VANEDB_BUILD_CAPI default OFF: default builds are fully unaffected by this PR.
  • Test coverage: both happy-path and negative-path (null handles, dim=0 constructors, null path args) are covered in the C smoke test, compiled as C to verify actual C linkage.
  • ABI conventions documented: the block comment in vanedb_capi.h is the right place for this; good that it's there from day one.

@claude
Copy link
Copy Markdown

claude Bot commented May 30, 2026

Code Review: feat(capi): C ABI for distance, VectorStore, HNSW, MMap

Overall this is a clean, well-structured piece of work. The design invariants are sound, the exception-boundary discipline is good, and TDD with negative paths is exactly the right approach for a C ABI. A few things to address before merge:


Correctness

Distance functions lack try/catch (capi/vanedb_capi.cpp, lines 121–128)

The PR's stated design invariant is "every wrapper that can throw is try/catch-guarded," but the three distance functions have no guard:

float vanedb_cpp_l2_sq(const float* a, const float* b, size_t dim) {
  return l2_sq(a, b, dim);   // no try/catch
}

The underlying SIMD code is pure float arithmetic and almost certainly won't throw today, but if a future change adds bounds checking or allocation to the hot path, an uncaught C++ exception crossing the C boundary is undefined behavior. Wrap these the same way as the rest, or add a comment explicitly accepting the risk.

set_ef_search mutation inside vanedb_cpp_hnsw_search is a structural thread-safety problem

idx->set_ef_search(ef_search);   // mutates shared state
auto res = idx->search(q, k);    // then reads

The header documents this as single-threaded only, which is fine for the benchmark consumer. But the mutation is baked into the handle — a later caller who does concurrent searches on separate handles shares no state, but two callers racing on the same handle will corrupt ef_search silently. Consider whether the C++ core can accept ef_search as a search parameter instead of storing it on the object. If the Rust ABI parity constraint locks you into the current per-call parameter shape, at least make set_ef_search + search atomic under a lock, or call it out explicitly in hnsw_index.h as a known thread-unsafe path (not just in the C header).

vanedb_cpp_mmap_build can leave a partial file on failure

If b.add() succeeds for 5000 vectors and then throws, the catch (...) returns 1, but the file at path may already exist with partial data. A caller that retries after a non-zero return will open a corrupt store. Consider removing the file in the catch block, or document that the caller must remove it.


Design / API

to_metric silently maps unknown values to L2

case VANEDB_L2:
default:  return DistanceMetric::L2;

This is documented in the header, but silent fallback to L2 when the caller passes an invalid metric (e.g., (vanedb_metric)99) can cause correct-looking results that are subtly wrong. Distance functions that don't crash are the worst kind of bug in a benchmark harness. Since this is an internal API used by a controlled harness, it's probably fine — but consider returning an error (or even assert(false) in debug builds) for the out-of-range case rather than silently defaulting.

Public include path exposes C++ headers to consumers of the static lib

target_include_directories(vanedb_cpp_capi PUBLIC
  ${CMAKE_CURRENT_SOURCE_DIR}/src    # <-- exposes all C++ core headers
  ${CMAKE_CURRENT_SOURCE_DIR})

The src/ path is PUBLIC, so anyone linking vanedb_cpp_capi gets the C++ headers on their include path. The benchmark harness (a Rust crate) only needs capi/vanedb_capi.h. Change src to PRIVATE — the .cpp impl file needs it, but downstream C consumers should not.


Test coverage

dim=0 negative tests assume the C++ constructors throw — is this guaranteed?

vanedb_cpp_store* s_bad = vanedb_cpp_store_new(0, VANEDB_L2);
assert(s_bad == NULL);

This test passes only if VectorStore(0, ...) throws. If the core silently accepts dim=0 and produces a zero-capacity store, the assert will fail with a misleading message. Add a comment citing the specific C++ behavior being relied on (or add a static_assert / unit test in the C++ layer that dim=0 throws).

HNSW .bin cleanup is skipped on failure

assert(rc_save == 0);
vanedb_cpp_hnsw_free(h);
// ... later ...
remove("capi_hnsw.bin");

If an earlier assert fires, remove() is never called, leaving capi_hnsw.bin in the test working directory. Not critical for CI (ephemeral runners), but can cause false failures on developer machines when re-running tests. Consider a RAII-style cleanup or just call remove() at the top of the block before the test runs.


Minor / nits

  • using namespace vanedb; in the .cpp file — fine here (not a header), just noting it for reviewers from other codebases.
  • Planning doc in-tree (docs/superpowers/plans/) — the comment says "internal planning artifact / historical record," which is fine, but 534 lines of step-by-step TDD instructions are a lot of noise in git log --follow. If the team wants to keep plans in-tree, a docs/internal/ or docs/adr/ structure would scale better.
  • CI job only on ubuntu-latest — acknowledged in the PR as a follow-up, but worth flagging that the MSVC path (Windows) is completely untested for the C API. MSVC's __declspec(dllexport) semantics and its C runtime are the most common source of C ABI bugs across platforms.

What's done well

  • Exception boundary discipline: every complex wrapper has try/catch(...), with null-handle guards placed before the cast (not inside the try, which wouldn't catch null-deref UB). This is the right order.
  • Ownership model: HNSWIndex::load() returns unique_ptr; .release() correctly transfers ownership to the C caller, and _free does the matching delete. No leak.
  • ABI comment block in the header: the documentation of non-const handles, per-call ef_search, seed truncation, and thread-safety is better than most C API headers in the wild.
  • Compiled-as-C test: test_capi.c is compiled by the C compiler (not C++), which is the correct way to validate that the header is valid C and that symbols link correctly. Good.
  • VANEDB_BUILD_CAPI=OFF default: existing builds are completely unaffected.

Summary: Three correctness items (missing try/catch on distance functions, potential corrupt file on mmap_build failure, ef_search thread-safety model) and one CMake visibility bug (PUBLICPRIVATE on src/) are worth fixing before merge. The rest are suggestions.

@claude
Copy link
Copy Markdown

claude Bot commented May 30, 2026

Code Review — feat(capi): C ABI for distance, VectorStore, HNSW, MMap

Overview

Clean addition of a extern \"C\" shim layer over the header-only C++ core. The design (opaque handles, reinterpret_cast, catch-all exception boundary, _new/_free ownership) is the right pattern for this kind of C++ → C wrapping. ABI-parity reasoning with the Rust side is well-documented.


Potential Bug: CI missing -DVANEDB_BUILD_TESTS=ON

The CI configure step is:

cmake -B build-capi -DCMAKE_BUILD_TYPE=Release -DVANEDB_BUILD_CAPI=ON -DVANEDB_BUILD_BENCHMARKS=OFF ...

But in CMakeLists.txt, test_capi is only registered when VANEDB_BUILD_TESTS is ON:

if(VANEDB_BUILD_TESTS)
  add_executable(test_capi ...)

If VANEDB_BUILD_TESTS defaults to OFF, the next CI step (--target test_capi) will fail with "target not found." This needs -DVANEDB_BUILD_TESTS=ON in the configure command, or the test should be unconditionally registered when VANEDB_BUILD_CAPI=ON (since the test only exercises the C API, not the broader suite).


Code Quality

capi/vanedb_capi.cpp — null-check inconsistency in search functions

The _add functions guard against a null handle, but the _search functions only guard the handle, not the output buffer pointers:

size_t vanedb_cpp_store_search(vanedb_cpp_store* s, const float* q, size_t k,
                               uint64_t* out_ids, float* out_dists) {
  if (!s) return 0;
  // ...writes to out_ids[i] and out_dists[i] without checking for NULL

If a caller passes k > 0 with out_ids = NULL, this is a crash. The header says "caller-owned buffers of length k" but a defensive guard (or at least a k == 0 early-return) would be consistent with how null handles are treated elsewhere. Same applies to hnsw_search and mmap_search.

to_metric silent fallthrough

case VANEDB_L2:
default: return DistanceMetric::L2;

Mapping any unrecognized metric value to L2 is documented in the header, but it silently swallows invalid inputs. Consider whether VANEDB_L2 and default should stay merged — separating them makes it easier to add a future error return if the ABI version allows it.

set_ef_search mutating shared index state

vanedb_cpp_hnsw_search calls idx->set_ef_search(ef_search) before each search, which mutates HNSWIndex state. The header documents the thread-safety caveat clearly ("not thread-safe to call concurrently with different ef_search values on the same handle"). This is fine for a single-threaded benchmark consumer — just noting it's an implicit contract that the ABI documentation needs to keep prominent as the handle gets shared more widely.

uint64_t seeduint32_t truncation

The comment in the .cpp is good:

// seed is uint64_t in the ABI (Rust parity) but the core takes uint32_t; high bits are dropped.

Since cross-implementation graphs differ by RNG regardless, this doesn't affect correctness. But the truncation is a future footgun if HNSWIndex ever upgrades to a 64-bit seed — worth adding static_assert(sizeof(seed) >= sizeof(uint32_t)) or a note in the header to revisit when the core changes.


Test Coverage

The smoke test is solid: enum values, all three distance functions with exact expected results, store/HNSW/mmap round-trips, HNSW save/load, and null-handle negative paths. A few gaps:

  • vanedb_cpp_store_free(NULL) and vanedb_cpp_hnsw_free(NULL) are not tested. delete nullptr is safe in C++, but it's worth having a line in the null-path block to confirm the C API doesn't crash on free(NULL) (which callers may reasonably try).
  • The negative path for vanedb_cpp_store_new(0, VANEDB_L2) expecting NULL relies on VectorStore throwing on dim=0. This assumption is implicit — a comment linking it to the expected exception would help.
  • No test for vanedb_cpp_mmap_build with a non-writable path (e.g., "" or /nonexistent/path.bin). The null-path guards section tests NULL path but not an invalid-but-non-NULL path.

CMake / Build

  • C language declaration: Adding LANGUAGES CXX C to project() is correct and needed for the set_target_properties(test_capi PROPERTIES LINKER_LANGUAGE CXX) to work reliably cross-platform.
  • CMAKE_C_FLAGS_* mirroring the C++ flags: Correct; this ensures the C test gets the same -O3/no-NDEBUG treatment as the C++ code.
  • install() placement: The install(TARGETS vanedb_cpp_capi ...) and install(FILES capi/vanedb_capi.h ...) are inside if(VANEDB_BUILD_CAPI), which is correct — the header shouldn't be installed when the lib isn't built.
  • Cache key: hashFiles('CMakeLists.txt') for capi deps is correct.

Planning Document (docs/superpowers/plans/2026-05-28-cpp-c-api.md)

This is 534 lines of task-by-task implementation scaffolding committed as a repository artifact. The file itself labels itself "Internal planning artifact — historical record." A few considerations:

  1. Repo maintenance posture: CLAUDE.md lists what this repo accepts (bug fixes, perf work, platform support, format additions) — internal planning docs aren't in that list. If docs/superpowers/ grows over time, it becomes non-user-facing noise in the tree.
  2. The plan contains intermediate code that differs from the final implementation (e.g., Task 3 Step 3 has vanedb_cpp_store_add without the null guard; final code added it). This means the plan is already partially inaccurate as a "historical record."

Suggestion: keep it in the PR description / commit messages rather than the tree, or move to a docs/internal/ path with a note that these are ephemeral. Not a blocker, but worth a team decision.


Summary

Category Finding Severity
Bug CI configure missing -DVANEDB_BUILD_TESTS=ON (test target may not exist) High
Code quality No null-check for out_ids/out_dists in search functions Medium
Code quality to_metric default-to-L2 swallows invalid enum values silently Low
Tests free(NULL) not exercised; mmap bad-path only tests NULL not invalid path Low
Docs Planning artifact in tree is already stale vs. final code Low

The C ABI design itself is correct and well-considered. The CI test-target issue is the one thing that should be verified/fixed before merge.

@claude
Copy link
Copy Markdown

claude Bot commented May 30, 2026

Code review posted below (written in plain text to avoid shell escaping issues).

@claude
Copy link
Copy Markdown

claude Bot commented May 30, 2026

Code Review: feat(capi) C ABI for distance, VectorStore, HNSW, MMap

Overview

This PR adds a shippable extern "C" surface over the header-only C++ core, built as an opt-in static library (VANEDB_BUILD_CAPI=OFF by default). The design rationale (ABI parity with Rust, benchmark harness consumption) is well-motivated and the implementation is mostly clean. A few issues worth addressing before merge.


Issues

1. Thread-safety footgun in vanedb_cpp_hnsw_search (correctness risk)

set_ef_search + search is not atomic. Two concurrent callers on the same handle will race on ef_search. The header block comment documents this, but I'd suggest also adding a short /* NOT THREAD-SAFE: see ABI conventions */ inline comment directly on the vanedb_cpp_hnsw_search declaration in the header, so callers see it at the call site. The longer-term fix is to pass ef_search directly into search() in the C++ core.

2. CI job does not pin VANEDB_BUILD_TESTS (latent failure risk)

The test target is gated on VANEDB_BUILD_TESTS, but the CI configure step does not pass -DVANEDB_BUILD_TESTS=ON. This works today because that option presumably defaults to ON. If a future PR changes the default (e.g. to OFF for embedded consumers), this CI job silently builds nothing and passes vacuously. Explicitly adding -DVANEDB_BUILD_TESTS=ON to the configure step fixes this.

3. out_ids/out_dists null pointers not guarded

In vanedb_cpp_store_search, vanedb_cpp_hnsw_search, and vanedb_cpp_mmap_search, if the caller passes k > 0 but null output buffers, the function writes through a null pointer. A segfault is not a C++ exception so catch(...) will not save it. Adding if (!out_ids || !out_dists) return 0; early in each search function is consistent with how null handles are already guarded.


Minor Suggestions

4. Plan document in the repodocs/superpowers/plans/2026-05-28-cpp-c-api.md is 534 lines labeled "internal planning artifact / not user-facing." Consider keeping it in the PR description or a wiki, or moving it to a clearly marked internal path. CLAUDE.md does not describe a docs/superpowers/ convention for this repo.

5. Seed truncation comment placement — The uint64_tuint32_t comment in the .cpp is good. Mirror it on the vanedb_cpp_hnsw_new declaration in vanedb_capi.h so callers see it at the call site.

6. to_metric silent fallback to L2 — Documented as intentional in the header, fine for v0. Worth revisiting when the ABI stabilizes, as an uninitialized/wrong metric silently becomes L2.

7. target_include_directories PUBLIC scope leaks internalssrc/ being PUBLIC transitively exposes internal headers to consumers. Consider PRIVATE for src/ (needed only to compile the .cpp) and INTERFACE for the project root (needed to resolve capi/vanedb_capi.h).


What is Done Well

  • Exception boundary is airtight. Every wrapper that can throw is inside try/catch(...), null handles are guarded before the reinterpret_cast, the ordering is correct throughout.
  • Smoke test is thorough. Covers all functions, positive and negative paths (null handles, null paths, dim=0 constructors). The NDEBUG fix was caught and proven by the test.
  • Default-OFF flag. VANEDB_BUILD_CAPI=OFF ensures existing builds are unaffected. CI covers Linux, Windows, and macOS.
  • ABI conventions documented. Non-const handle rationale and ef_search threading caveat are documented in the header where callers will read them.
  • File cleanup. Test artifacts removed with remove(). build-capi/ added to .gitignore.

Summary

Issues 1–3 are worth fixing before merge: issue 2 is a one-liner, issue 3 is a few lines matching existing patterns, and issue 1 is a comment addition at the declaration site. The rest is minor polish. Overall a solid C ABI implementation — exception-to-error boundary, null-guard discipline, and smoke test coverage are all well executed.

tsvet01 and others added 12 commits May 30, 2026 12:01
TDD plan for the vanedb-cpp extern "C" surface (distance, VectorStore,
HNSW, MMap) consumed by vanedb-bench. Plans 2 (Rust vanedb-capi) and 3
(bench harness) follow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds capi/vanedb_capi.h with the full extern "C" contract (distance, VectorStore,
HNSW, MMap), an empty-bodied capi/vanedb_capi.cpp with metric mapping, a C smoke
test (tests/capi/test_capi.c) that validates enum values, and CMake wiring for
the VANEDB_BUILD_CAPI=ON static library target. Also adds build-capi/ to .gitignore
and enables the C language when VANEDB_BUILD_CAPI+VANEDB_BUILD_TESTS are both ON
(project declares LANGUAGES CXX only).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add vanedb_cpp_mmap_build/_open/_search/_free wrappers; transient
MMapVectorStoreBuilder in _build, MMapVectorStore* behind opaque handle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test put load-bearing calls inside assert(), which are compiled out
under -DNDEBUG. The project strips NDEBUG from CXX release flags but not
C, so the C test silently no-op'd in Release. Fix: set CMAKE_C_FLAGS_RELEASE
without NDEBUG (matching CXX), and hoist all calls out of assert() so they
execute regardless of NDEBUG. Also document the hnsw seed narrowing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- null const char* path args reached fstream/mmap before any guard; a
  null path raises SIGSEGV (a signal, not a C++ exception) which catch(...)
  cannot intercept. Guard path on hnsw_save/load and mmap_build/open.
- cap _search copy loops at k (defense-in-depth; the core already caps,
  but the wrapper now enforces it too).
- document to_metric's L2-default and the hnsw seed narrowing in the header.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, project(C)

- add a dedicated 'C API' CI job (configures VANEDB_BUILD_CAPI=ON, builds+runs the C smoke test)
- header: note per-thread handles for concurrent hnsw_search; clarify seed low-32 behavior
- declare C at project() level (LANGUAGES CXX C); drop the nested enable_language(C)
- test: remove capi_hnsw.bin/capi_mmap.bin artifacts after use
- plan doc: fix stale PUBLIC->PRIVATE snippet; mark as internal artifact
MSVC is exactly where the C API's C-source-linked-against-C++-static-lib
pattern (LINKER_LANGUAGE CXX) can diverge, so a Linux-only smoke test
gives false cross-platform confidence. Matrix it; --config Release for
the MSVC multi-config generator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ctest uses -C/--build-config; --config is a cmake --build flag. The
build step was correct; the ctest step errored 'Unknown argument:
--config' on all three OSes. --build-config is needed by MSVC's
multi-config generator and ignored by single-config Unix generators.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tsvet01 tsvet01 force-pushed the docs/cpp-c-api-plan branch from e3a61bd to e97f177 Compare May 30, 2026 11:02
@claude
Copy link
Copy Markdown

claude Bot commented May 30, 2026

Code Review: feat(capi): C ABI for distance, VectorStore, HNSW, MMap

Overall: This is a well-structured C ABI layer. The three design invariants called out in the PR description (exception safety, null-handle safety, ownership) are all properly implemented and documented. Some specific points below.


Correctness

capi/vanedb_capi.cpp

  • vanedb_cpp_hnsw_search: non-atomic set_ef_search + search — The header documents that concurrent calls with different ef_search values are unsafe, but the real hazard is broader: set_ef_search and search aren't atomic, so any concurrent call on the same handle through this path races on the ef_search field. The comment should say the handle is not safe for any concurrent access (not just "different ef_search values"), so consumers don't infer that same-ef_search concurrent reads are OK.

  • vanedb_cpp_store_add null-check placementif (!s) return 1 is correct and outside the try block, which is fine. But vanedb_cpp_hnsw_add and vanedb_cpp_hnsw_search take the same approach — consistent and good.

  • No null check on vector data pointers (v, q) — Callers passing nullptr for vector data would crash inside the core. For a low-level C API this is arguably caller responsibility, but a brief note in the header would set expectations. Alternatively, guard with if (!v) return 1 / if (!q) return 0 at the wrapper layer before the try block.

  • to_metric silent default — The silent fallback to L2 for unrecognized metric values is documented in the header, which is good. One concern: this means a typo in a metric enum cast (e.g., passing 3 or a wrong cast from Rust) silently succeeds as L2. Fine for v0, but worth flagging.


CMake / Build

  • VANEDB_BUILD_TESTS implicit dependency in CI — The CI job (c-api) runs cmake --build build-capi --target test_capi but never passes -DVANEDB_BUILD_TESTS=ON. The test_capi target is only added inside if(VANEDB_BUILD_TESTS). This works if VANEDB_BUILD_TESTS defaults to ON, but it's fragile — if the default ever changes to OFF, the CI step silently breaks (the target won't exist and CMake will error on the missing --target). Suggest either explicitly passing -DVANEDB_BUILD_TESTS=ON in the CI configure step, or decoupling the C API test target from VANEDB_BUILD_TESTS so it's always present when VANEDB_BUILD_CAPI=ON.

  • NDEBUG stripping via CMAKE_C_FLAGS_RELEASE — Setting CMAKE_C_FLAGS_RELEASE = "-O3" (omitting CMake's default -DNDEBUG) is the right fix for keeping asserts alive in the test, and the PR description explains the rationale well. One note: this affects all future C files added to the project. Since the only C code today is the test, the risk is low, but a comment in CMakeLists.txt noting why -DNDEBUG is intentionally absent from C release flags would help the next developer who wonders why.

  • target_compile_features(vanedb_cpp_capi PRIVATE cxx_std_20)PRIVATE here is correct: consumers of the static lib don't need C++20. Good.


Testing

  • Coverage is solid — Every function family is exercised, including HNSW save/load round-trip and null-handle negative paths. The vanedb_cpp_store_new(0, ...) and vanedb_cpp_hnsw_new(0, ...) dim=0 tests verify the exception-to-NULL translation.

  • Missing: null vector data pathsvanedb_cpp_store_add(s, id, NULL) and vanedb_cpp_store_search(s, NULL, ...) aren't tested. Given the API doesn't validate these, a crash here would be silent in the smoke test.

  • No sanitizer coverage — The C test doesn't run under ASan/UBSan. The reinterpret_cast ownership path (especially hnsw_load releasing a unique_ptr) would benefit from a sanitizer pass to rule out double-free or leak on the failure paths. Noted as a follow-up but worth flagging.

  • Test artifact cleanupremove("capi_hnsw.bin") and remove("capi_mmap.bin") are present. Good. The HNSW bin is cleaned up at the end of the HNSW block but a test failure before remove() would leave the files behind — not a correctness issue for ctest but worth noting.


Documentation

  • docs/superpowers/plans/ — The 534-line planning document is explicitly labeled an internal artifact. It's thorough and clearly useful as an audit trail for the agentic dev workflow. One suggestion: the self-review checklist items at the end of the plan still have - [ ] unchecked checkboxes (the "executor steps") — if this document is meant as a historical record, marking those done would make it clearer this was fully implemented.

Minor

  • vanedb_cpp_store_free(nullptr) is safe (C++ delete nullptr is a no-op). Same for the other _free functions. Good.
  • The min(res.size(), k) cap in search wrappers is redundant if the core already returns ≤ k, but it's a safe defensive pattern.
  • The closing } of extern "C" in vanedb_capi.cpp could use a matching comment } // extern "C" for readability, consistent with the } // namespace style already used.

Summary: The implementation is correct and the invariants are sound. The two most actionable items before merge are (1) explicitly passing -DVANEDB_BUILD_TESTS=ON in the CI configure command so the test_capi target dependency is explicit, and (2) tightening the thread-safety documentation for hnsw_search to cover the full handle (not just the ef_search parameter). Everything else is polish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant