DO-NOT-MERGE: RFC: NVMe-oF orchestrator coexistence — nvme-discoverd, ownership registry, and exclusion list#3442
Conversation
RFCs submitted for review only. Not intended to be merged. Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
6de7bfa to
d29d244
Compare
Registry: - Show only the guarded udev rule in §4; remove the naive unguarded version that preceded it — a skimming reader would copy the wrong one - Aperiodic audit is now bidirectional: re-assert ownership for live controllers with missing entries, not only remove stale ones - Document the residual TOCTOU race (dangerous direction) explicitly Exclusion list: - Fix Python match semantics: None/NULL caller params skip the comparison (same rule as C); they do not block a match - Clarify host-iface= scope: only matches interface-pinned connections; manual controller= entries without host-iface= are not matched - Add address normalization note (inet_pton/inet_ntop, not strcmp) - connect-all --nbft is exempt from the exclusion list Orchestrator coexistence: - Scope "no D-Bus signal" to between orchestrators - "No special-case logic required" scoped to disconnect logic in nvme-stas - connect-all --nbft exemption noted in Tier 2 summary - Add versioning note: these behaviors co-ship in nvme-cli 3.0 / nvme-stas 3.0; earlier pairings do not provide the guarantees nvme-discoverd: - Add RuntimeDirectoryPreserve=yes; without it systemd removes /run/nvme/discoverd on every stop/crash, destroying devid files needed by active ExecStop= lines and state files for crash recovery - Add registry ownership check before adopting pre-existing connections: skip any controller owned by neither discoverd nor nbft - Correct varlink -> D-Bus throughout: StartTransientUnit, StopUnit, RestartUnit, and JobRemoved are all org.freedesktop.systemd1.Manager - Scope "never issues disconnects" to steady-state operation - Clarify referral DC fizzle-out is intentional; add open questions: stale-cache aging (#2), shutdown vs. mounts (#3), NBFT root (#4) - Minor: "three lines" -> "four lines" for --devid-file specifier; GPL-2.0-only (not -or-later); inline unit comments moved to their own lines (systemd unit syntax has no end-of-line comments) Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Add §3.9 explaining the varlink/D-Bus situation for the systemd interface. Not depending on D-Bus was a design wish — systemd is migrating to varlink and new code should follow. However, varlink is not yet sufficient: io.systemd.Unit.StartTransient was only added in systemd v260 (March 2026), and StopUnit, RestartUnit, and ResetFailedUnit have no varlink equivalent at all. Since discoverd already depends on libsystemd for sd_event, using sd-bus (systemd's own D-Bus implementation, part of libsystemd) adds no new library dependency. The D-Bus usage is scoped to the systemd interface only; discoverd's own client-facing socket (§3.7) remains varlink-only. Add open question #5 to track the future migration once the varlink unit lifecycle API is complete. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Content fixes, verified against the kernel, systemd, libnvme, and nvme-stas sources where applicable: - io.systemd.Unit.StartTransient is in systemd v261 (still unreleased), not v260 as previously stated; v261 still lacks StopUnit, RestartUnit, and ResetFailedUnit, strengthening the D-Bus rationale (§3.9) - a unique-NQN DC gets the kernel's 5 s NVME_DEFAULT_KATO, not a dead session; --keep-alive-tmo=30 normalizes the DC kato regardless of NQN - FC WWN traddr comparison is case-insensitive string equality; nothing strips colon separators - TID hash field order now matches nvme-stas staslib/trid.py exactly - atomic write protocol documented as mkstemp() with random suffix, matching the implementation - disconnect-all confirmation: prompt on TTY for both --force and --owner; non-interactive invocations proceed without prompting Design additions from review: - transient units carry Before=nvme-discoverd.service so discoverd stops before shutdown-time disconnects, preventing reconnect attempts (and FC kickstart re-issue) against a shutting-down system - NBFT entries can be Discovery Controllers, not just IOCs; the --owner nbft substitution applies to both unit types - host-traddr= exclusion entries provide interface exclusion for RDMA and FC, where host-iface does not apply - registry §4.4 covers both initramfs boot paths: NBFT and the FC kickstart (unowned until discoverd adopts them at startup) - the recycled-devid stale-entry edge case requires the udev rule race and a libnvme bypass to coincide; normally the rule has already cleaned up Readability: split large sections into topical subsections (registry §1/§4, exclusion §3/§7, discoverd §3.2, coexistence §5) and break up long paragraphs throughout. No content changes from restructuring. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-fable-5 [Claude Code]
Add §11 "Future Release: DC Retention Policy" documenting how discovery-derived configuration is retained after a discovery source becomes unavailable. Introduce `discovery-retention-time` (replacing the mDNS-only `zeroconf-stale-timeout`) as a unified parameter covering all dynamic discovery sources: mDNS, referral DCs, and FC kickstart DCs. Statically configured and NBFT-derived DCs are retried indefinitely — they represent explicit intent and are not subject to the retention timer. Introduce `fc-kickstart-interval-minutes` for periodic FC fabric probing, motivated by the equipment replacement scenario where a new DC may have a different address/NQN and can only be discovered via an active kickstart. Update §4 and §7.1 to note that FC kickstart has no TID cache in the initial release, but the retention policy will require one. Update §5 retry policy to reflect the static vs. dynamic DC distinction. Align §11.1 timer-start semantics with §11.2 (timer starts on source disappearance regardless of connection state). Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Periodic FC kickstart is opt-in, not on by default. nvme-discoverd ships on all Linux systems — enabling periodic fabric probing by default on laptops and desktops with no FC infrastructure would be wasteful and surprising. Use 0 to disable (consistent with systemd's convention for interval parameters, e.g. WatchdogSec=0). 0 was previously invalid; infinity was the disable value. Any value >= 1 is a valid interval in minutes. Fix the consistency issue introduced by the default change: soften "not acceptable in a production environment" to "may not be acceptable in an FC production environment", and reframe the orthogonality paragraph to read naturally with the default-is-0 framing. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
The entry-ID hash was unnecessary complexity: exclusion list management is human-only, so a sequential number scoped to a single interactive `nvme exclusion remove` invocation suffices, and `libnvmf_exclusion_remove()` now matches by exact entry-string content instead. The Python bindings no longer expose exclusion_add/remove/create/delete at all — providing a programmatic management API would contradict the human-administered design; only the read-only exclusion_match/lists/entries() remain. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
The RFC referenced the legacy NVMe-oF autoconnect components only in passing (§1, §7.1), with no single place describing what nvme-cli installs today, when each piece fires, and how nvme-discoverd subsumes it. Add §12 with a full inventory and a clear split between components that establish connections (replaced) and those doing orthogonal tuning, key provisioning, or interface naming (kept). Document the NBFT late-connect path in particular: nvmf-connect-nbft.service has no [Install] section and is driven only by a NetworkManager dispatcher script, so it never fires under systemd-networkd or other managers. discoverd's retry loop covers the same case manager-agnostically, making the late-connect a latency optimization rather than a correctness requirement. Renumber Open Questions to §13 and Glossary to §14. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Expand the RFC with reviewer-driven clarifications: - §6.1: explain why discoverd uses its own discoverd.conf rather than reusing config.json (json-c dependency; JSON has no comments) or discovery.conf (DC-only; cannot express IOCs or global toggles). - §12.1/§12.2: nvmf-connect.target was only a collective handle for the daemon-less design; the discoverd daemon is that coordinator, so no target equivalent is needed. - §12.2: scope the autoconnect-rule replacement to the running system. Initramfs (Phase 1) connect is dracut's job (74nvmf, formerly 95nvmf), which has never used 70-nvmf-autoconnect.rules. Note that StartTransientUnit works over systemd's private socket without the D-Bus daemon, so discoverd is kept out of Phase 1 by design, not impossibility. Recommend removing the dead 70-nvmf-autoconnect.conf snippet from nvme-cli. - §12.5: emphasize systemd-networkd as the common NetworkManager alternative that the legacy NBFT dispatcher path does not cover. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Reconcile the exclusion RFC with what was actually built (review item L13): - §6.1: libnvmf_exclusion_match() takes a struct libnvmf_tid * rather than seven positional string arguments — fewer call-site mistakes. Document the new libnvmf_exclusion_entry_valid() helper used to pre-validate hand-edited files without filesystem side effects. - §6.3: mark the proposed key_value_list_parse()/_free() utilities as intentionally not provided; libnvmf_tid_parse() already covers parsing a semicolon-separated key=value string, and the matcher parses entries internally. - §5: show the actual --name/--entry option form instead of positional <name>/<entry> arguments. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
tbzatek
left a comment
There was a problem hiding this comment.
Well written design documents, I appreciate the level of detail!
Question about the legacy nvme connect-all --nbft: still being used by dracut, so I guess it stays, yet discoverd reimplements some logic. Would be nice to unify the codebase to avoid duplication. My concern is interpretation of various NBFT flags and processes... although certain best practices got added in the Boot Spec, there are still gaps. There have always been disparity between various UEFI implementations and the OS, even though we generally tend to avoid adding quirks in the nvme-cli NBFT code for firmware bugs.
|
|
||
| **Tools that bypass libnvme or use NULL owner** produce unowned connections — no registry entry is written, and `disconnect-all` treats them as freely disconnectable. The most common sources are: | ||
|
|
||
| - **UDisks**: a D-Bus daemon that provides block-device management to desktop environments; it calls libblockdev (`bd_nvme_connect`), which calls libnvme internally, but neither participates in registry ownership. |
There was a problem hiding this comment.
The plan is to always supply owner information, e.g. 'udisks', with a possibility to override this via an optional argument to any arbitrary string.
Since UDisks is rather high-level layer, we're able to respect exclusions actively and refuse connection, unless forced. At this point only simple connect & disconnect commands are provided, falling into Tier 1. However there were plans to provide simple discovery incl. mDNS support browser functionality in the future, leaning towards Tier 2 (as there will likely always be an external consumer to trigger any action first).
There was a problem hiding this comment.
The plan is to always supply owner information, e.g.
'udisks', ...
Thanks — I'll update the UDisks entry accordingly: supplies owner='udisks' (overridable to an arbitrary string), actively respects exclusions and refuses unless forced, Tier 1 today. I'll add this to the RFC.
However there were plans to provide simple discovery incl. mDNS support browser functionality in the future...
On the future mDNS direction, though, I'd like to clarify before we pencil in a Tier 2 / mDNS-browser role for UDisks — because mDNS/DNS-SD discovery of Discovery Controllers is a genuine can of worms, and a third independent browser would compound it. We're already wrestling with how to coordinate mDNS browsing between just two orchestrators, nvme-discoverd and nvme-stas: having both browse at once is a misconfiguration we have to actively detect and arbitrate (the "zeroconf-conflict" problem, still unsolved in the general case). If UDisks adds its own mDNS browser, that's a third layer independently discovering the same DCs and potentially connecting, with no shared owner of "who browses mDNS on this host." Could you say more about what UDisks intends — a full mDNS DC-discovery browser, or something narrower (e.g. surfacing controllers another component already discovered)? If it's the former, I think we need a cross-orchestrator story for mDNS-browser ownership before multiple layers start doing it independently — two is already hard, and three concurrent browsers on one host would be painful to reason about. I'd rather flag it now than debug three browsers fighting later.
There was a problem hiding this comment.
Generally speaking, any new party can come to the game anytime.
The plan with UDisks is to provide underlying services for gvfs - it might actually be gvfs simply adding another protocol to its existing avahi-based network browser. UDisks would then provide high-level discovery and connect services, guarded by polkit rules. This is all on demand and it's tiered - i.e. the base gvfsd-network browser would just report machines in the network exposing _nvme-disc._tcp and discovery is only made upon user intention - i.e. discovery is not supposed to be done proactively. The target audience are end users with expectation that neither nvme-discoverd or nvme-stas are going to be configured for zeroconf on such systems. Neither gvfs or UDisks can use discoverd as that's systemd-only, this will need to be separate reimplementation, reusing as many general parts from libnvme as possible.
| - An NBFT boot-path controller carries `owner=nbft` — nvme-stas never disconnects it. | ||
| - Since nvme-stas never disconnects a controller owned by another orchestrator, there is nothing for nvme-discoverd to reconnect — a bounce loop cannot occur. | ||
|
|
||
| **NBFT controllers are also immune to nvme-discoverd's exclusion list.** nvme-discoverd reconnects NBFT-sourced controllers unconditionally, regardless of any matching exclusion list entry. An exclusion entry targeting a boot device is a misconfiguration; nvme-discoverd logs a warning and reconnects. See `rfc-nvme-registry.md` §4.4 for the `owner=nbft` semantics. |
There was a problem hiding this comment.
We need this for NBFT testing... although the design is well suited for real-world scenarios, we'd like to have option to manually disconnect a particular nbft controller and prevent it to be reconnected.
Further thoughts:
- explicitly stop
nvme-discoverdfor such tests? - what about a multipath scenario where a particular network interface is intentionally going down: perform disconnect first and then tear down the interface to avoid the kernel host nvme driver reconnection mechanism?
There was a problem hiding this comment.
Good — there's a clean answer to all three.
Manually disconnect a particular NBFT controller, and keep it disconnected. There's a single command for exactly this: nvme disconnect --exclude <device>. It derives an exclusion entry from the controller's sysfs attributes (transport, traddr, trsvcid, subsystem NQN — plus host-iface when the connection is pinned to one), writes that entry to the exclusion list first, and only then disconnects. Writing the exclusion before the disconnect is deliberate: it guarantees the entry is in place before the device-removal event ever reaches discoverd, so discoverd gets no window to reconnect. That's the surgical, race-free option — it suppresses just that one controller while discoverd keeps managing everything else.
This works because we're relaxing the previous "NBFT bypasses exclusion" rule: discoverd now honours the exclusion list for NBFT controllers too. (owner=nbft still protects the controller from other orchestrators — only discoverd's own reconnect yields to your explicit exclusion. And since excluding a boot device is a foot-cannon, the nvme exclusion add path warns when an entry would match an owner=nbft controller, skippable with --force.) If you instead want to disconnect without excluding — i.e. let discoverd bring it straight back — plain nvme disconnect <device> is the no-checks single-device escape hatch with no exclusion side effect.
Stopping discoverd for tests? You can, and for a quick "stop managing everything" it's fine. But note it's the blunt instrument: discoverd is the single monitor for every connection on every interface, so stopping it suspends management of all paths, not just the controller under test. If you want to isolate one controller while the rest stay managed, the exclusion list above is the better lever.
The multipath interface-down case. Here I'm less sure there's a problem to solve, so let me lay out what actually happens and you can tell me if you're after something more. When a local interface is brought down under a live controller, the kernel doesn't special-case it — it enters its normal reconnect loop, which is slow (one attempt per reconnect-delay, default 10 s) and, when the connection is pinned with --host-iface, fails fast each time. It doesn't flood anything, and multipath has already failed I/O over to the live paths; when the interface returns, the path reconnects. discoverd, being connect-only, keeps that controller in its desired set and reconnects on return — the intended behaviour. So "disconnect first, then ifdown" is a reasonable manual hygiene step to avoid the (harmless, slow) reconnect churn, but it isn't something discoverd needs to orchestrate. If your goal is to permanently retire a path, that's a desired-set/config change (discoverd should stop wanting it) — and again the exclusion list is the tool. If you're seeing an actual failure beyond the cosmetic reconnect churn, tell me what it is and I'll dig in; otherwise I'd rather not build orderly single-path drain into a connect-only daemon without a concrete problem that needs it.
There was a problem hiding this comment.
discoverd now honours the exclusion list for NBFT controllers too.``
That's great to hear, thanks.
The multipath interface-down case.
This was mentiond as just another test case we've been performing, to test the link-up reconnection hooks. I.e. avoiding the kernel connection recovery mechanism and have a clean plate. This is obviously supposed to be handled by discoverd now, we'll align our tests accordingly. There are some more extreme use cases like in-place upgrades with intermediary userspace that is performing dark magic... unimportant for now I'd say.
| | `nvmf-connect-nbft.service` | systemd oneshot | On demand, started by the NM dispatcher on NBFT-interface up | **Replaced** — discoverd adopts/reconnects NBFT controllers from its NBFT cache (§7.1); see §12.5 | | ||
| | `80-nvmf-connect-nbft.sh` | NM dispatcher | NetworkManager interface-up for `nbft*`/HFI connections | **Replaced** — discoverd's retry loop, manager-agnostic (§12.5) | | ||
| | `70-nvmf-autoconnect.conf` | dracut conf | Build-time `install_items+=` snippet that copies `70-nvmf-autoconnect.rules` into the initramfs | **Remove outright** (pre-existing cruft). dracut has never used `70-nvmf-autoconnect.rules` and ships its own initramfs mechanism, so this snippet only copies an inert rule into the initrd. Safe to delete from nvme-cli independently of discoverd; early-boot connect is dracut's job (§7.1, §12.2) | | ||
| | `65-persistent-net-nbft.rules` | udev rule | Naming of `nbft*` interfaces | **Kept** — interface naming, not connect logic | |
There was a problem hiding this comment.
Not sure whether this ever served its purpose... @mwilck ?
There was a problem hiding this comment.
I suspect you're right that it never really served a purpose, and the reason fits the rest of the picture. This rule does one thing — pin nbft* interface names so udev doesn't rename them — and the only thing that ever consumed that name was the NM dispatcher (80-nvmf-connect-nbft.sh), which matches nbft* to trigger the late-NBFT connect — i.e. the very path you just described as never really working. So with no working consumer, the naming rule has nothing to serve. I currently list it as "Kept" (orthogonal naming), but I'm inclined to move it to "Remove" alongside the NM-dispatcher path: if the late-NBFT machinery goes, this goes with it. I'll defer the definitive history to @mwilck — he'd know if anything else ever relied on the nbft* name — and tie the final disposition to the broader question of whether discoverd carries NBFT handling at all.
Updates forllowing following peer reviews Signed-off-by: Martin Belanger <martin.belanger@dell.com>
This PR is for review only and will not be merged. It contains four RFC documents proposing a coordinated set of features for NVMe-oF orchestrator coexistence in nvme-cli 3.0. Please use inline comments to provide feedback on specific sections.
Background
As NVMe-oF deployments grow, hosts increasingly run multiple tools that manage NVMe-oF connections: nvme-stas, dracut/initramfs scripts, and soon nvme-discoverd. Without coordination, these orchestrators can conflict — one may disconnect a controller that another is actively managing, or connect to a controller that another has deliberately excluded.
This RFC set proposes two libnvme building blocks to solve the coexistence problem (ownership registry and exclusion list), and a new daemon orchestrator (nvme-discoverd) that puts them to use.
The four RFCs
rfc-nvme-orchestrator-coexistence.md — Start here
The top-level document. Frames the two distinct conflict scenarios (accidental disconnect, accidental connect), introduces the two prevention mechanisms (ownership registry and exclusion list), defines a three-tier orchestrator hierarchy (raw commands → manual orchestrators → daemon orchestrators), and shows how nvme-discoverd and nvme-stas naturally partition work without requiring IPC coupling between them.
rfc-nvme-registry.md — Ownership registry
A lightweight, cooperative registry under
/run/nvme/registry/that lets orchestrators declare ownership of connected controllers.nvme disconnect-allconsults the registry before acting so it never disconnects a controller managed by a running daemon. The registry is a libnvme building block: a C API (libnvmf_registry_*), a Python binding, and anvme registryCLI command family. An implementation PR already exists: #3425.rfc-nvme-exclusion.md — Exclusion list
A human-administered exclusion list at
/etc/nvme/exclusions/that prevents orchestrators from auto-connecting to controllers the administrator wants excluded. Its design center is auto-discovered controllers (mDNS, CDC DLP) where there is no configuration entry to remove. Managed vianvme exclusionCRUD commands. Enforcement is cooperative — each orchestrator reads the list and skips matching controllers; libnvme does not enforce it.rfc-nvme-discoverd.md — nvme-discoverd
A new daemon proposed for nvme-cli 3.0. It manages NVMe-oF connections for statically configured controllers, NBFT boot controllers, FC-discovered controllers, and (in a future release) mDNS-discovered controllers. Key design choices:
libnvmf_connect()directly; blocking/dev/nvme-fabricswrites happen in child processes managed by systemd. Achieves Daniel's no-threading goal.--owner discoverdon each generated unit; NBFT reconnects use--owner nbftto preserve the lifetime invariant set at boot.What feedback is sought
nvme exclusioncommand design?owner=nbftlifetime invariant, NBFT exclusion bypass, unowned-connections-are-fair-game semantics — do these cover the cases you care about?Related
registry-v2)