From d5bffccb2086ad3c889acd04db652ea0036adf83 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 20:09:22 +0000 Subject: [PATCH 01/46] feat(suidhelper): mix suidhelper.install wrapping cargo xtask install --- lib/mix/tasks/suidhelper.install.ex | 40 +++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) create mode 100644 lib/mix/tasks/suidhelper.install.ex diff --git a/lib/mix/tasks/suidhelper.install.ex b/lib/mix/tasks/suidhelper.install.ex new file mode 100644 index 00000000..de27fbce --- /dev/null +++ b/lib/mix/tasks/suidhelper.install.ex @@ -0,0 +1,40 @@ +defmodule Mix.Tasks.Suidhelper.Install do + @shortdoc "Build, stamp, and install the setuid helper (wraps `cargo xtask install`)" + @moduledoc """ + Builds, stamps, and installs the Rust setuid helper by wrapping + `cargo xtask install` in `native/suidhelper`. + + mix suidhelper.install + + The xtask first stamps the release binary (BLAKE3 self-checksum into + `.note.sum`, the same step the `:suidhelper_stamp` compiler runs) and then + installs it setuid-root to `/usr/local/bin/hyper-suidhelper` via `sudo + install`. `sudo` may prompt for a password on the controlling terminal. + + This is the privileged counterpart to `mix suidhelper.stamp`: that one only + rebuilds, stamps, and re-captures the embedded build identity; this one also + places the binary on `PATH` setuid-root. `cargo` and the helper's toolchain + (see `native/suidhelper/rust-toolchain.toml`) must be installed. + """ + + use Mix.Task + + @helper_dir "native/suidhelper" + + @impl Mix.Task + def run(argv) do + {_, 0} = + System.cmd("cargo", ["xtask", "install" | argv], + cd: @helper_dir, + into: IO.stream(:stdio, :line) + ) + rescue + MatchError -> + Mix.raise(""" + `cargo xtask install` failed installing the suidhelper. + + Ensure `cargo` and the helper's toolchain (see #{@helper_dir}/rust-toolchain.toml) + are installed, and that `sudo` is available for the setuid install step. + """) + end +end From 9201730a0f02f8bddba6cfb29c4c8dea28d60063 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 20:16:17 +0000 Subject: [PATCH 02/46] docs: document full node requirements (postgres, dm targets, kvm, cgroup v2, suidhelper) --- docs/cookbook/intro.md | 59 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 52 insertions(+), 7 deletions(-) diff --git a/docs/cookbook/intro.md b/docs/cookbook/intro.md index a0030d35..2c69380a 100644 --- a/docs/cookbook/intro.md +++ b/docs/cookbook/intro.md @@ -19,13 +19,58 @@ The absolute best way to get started with `Hyper` is to play with it. ### Requirements -Hyper requires the following software be installed on each node running it: - - - [`skopeo`](https://github.com/containers/skopeo) - - [`e2fsprogs`](https://github.com/tytso/e2fsprogs) - -Hyper has more runtime dependencies, but they are automatically redistributed -by Hyper. +#### External services + +Hyper needs a **PostgreSQL** server reachable from every node — it is the image +database and the only stateful external dependency. + +#### System binaries + +The following must be on each node's `PATH` (the bracketed override is the +`config :hyper` key you can set if the binary lives elsewhere): + + - [`skopeo`](https://github.com/containers/skopeo) — pulls OCI images + (`skopeo_path`) + - [`e2fsprogs`](https://github.com/tytso/e2fsprogs) — provides `mke2fs`, which + builds the ext4 rootfs (`mke2fs_path`) + - `losetup`, `blockdev` (from **util-linux**) — loop-device setup + (`losetup_path`, `blockdev_path`) + - `dmsetup` (from **lvm2** / device-mapper) — dm-snapshot and thin-pool + layering (`dmsetup_path`). Frequently *not* installed by default — check + this one first. + - `du`, `getent` (from **coreutils** and **glibc**) — rootfs sizing and user + resolution. Present on essentially every distro. + +#### Kernel features + +The host kernel must provide: + + - **KVM** — `/dev/kvm` must exist and be accessible to the per-VM users (see + the `uid_gid_range` configuration). + - **cgroup v2** — the unified hierarchy mounted at `/sys/fs/cgroup`. v1-only + hosts are not supported. + - **device-mapper targets** `snapshot`, `thin`, and `thin-pool` — load the + `dm_snapshot` and `dm_thin_pool` modules (`modprobe dm_snapshot + dm_thin_pool`). Hyper refuses to start its device helper without them. + - **loop devices** — the `loop` module, used to attach layer images as block + devices. + +#### Privileged setup + + - The **setuid-root device helper** (`hyper-suidhelper`) must be installed. + Run `mix suidhelper.install`, which builds, stamps, and places it + setuid-root on `PATH`. Every privileged operation (losetup, dmsetup, mknod, + chroot jails) routes through it; the BEAM itself runs unprivileged. + - A **parent cgroup** named by `cgroup_parent` (default `hyper`) must exist + under `/sys/fs/cgroup`; Hyper creates each VM's cgroup beneath it. + - The host UID/GID range given by `uid_gid_range` must be free for Hyper to + allocate per-VM users from. + +#### Auto-redistributed + +The remaining runtime dependencies — `firecracker`, `jailer`, `umoci`, and the +guest `vmlinux` kernels — are downloaded, checksum-verified, and managed by +Hyper itself; you do not install them. ### Installation From b68dd5e7ebb72d650229001369d08e366addf5bd Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 20:16:47 +0000 Subject: [PATCH 03/46] docs: replace em-dashes with ASCII hyphens in intro.md --- docs/cookbook/intro.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/cookbook/intro.md b/docs/cookbook/intro.md index 2c69380a..526c7e98 100644 --- a/docs/cookbook/intro.md +++ b/docs/cookbook/intro.md @@ -21,7 +21,7 @@ The absolute best way to get started with `Hyper` is to play with it. #### External services -Hyper needs a **PostgreSQL** server reachable from every node — it is the image +Hyper needs a **PostgreSQL** server reachable from every node - it is the image database and the only stateful external dependency. #### System binaries @@ -29,30 +29,30 @@ database and the only stateful external dependency. The following must be on each node's `PATH` (the bracketed override is the `config :hyper` key you can set if the binary lives elsewhere): - - [`skopeo`](https://github.com/containers/skopeo) — pulls OCI images + - [`skopeo`](https://github.com/containers/skopeo) - pulls OCI images (`skopeo_path`) - - [`e2fsprogs`](https://github.com/tytso/e2fsprogs) — provides `mke2fs`, which + - [`e2fsprogs`](https://github.com/tytso/e2fsprogs) - provides `mke2fs`, which builds the ext4 rootfs (`mke2fs_path`) - - `losetup`, `blockdev` (from **util-linux**) — loop-device setup + - `losetup`, `blockdev` (from **util-linux**) - loop-device setup (`losetup_path`, `blockdev_path`) - - `dmsetup` (from **lvm2** / device-mapper) — dm-snapshot and thin-pool - layering (`dmsetup_path`). Frequently *not* installed by default — check + - `dmsetup` (from **lvm2** / device-mapper) - dm-snapshot and thin-pool + layering (`dmsetup_path`). Frequently *not* installed by default - check this one first. - - `du`, `getent` (from **coreutils** and **glibc**) — rootfs sizing and user + - `du`, `getent` (from **coreutils** and **glibc**) - rootfs sizing and user resolution. Present on essentially every distro. #### Kernel features The host kernel must provide: - - **KVM** — `/dev/kvm` must exist and be accessible to the per-VM users (see + - **KVM** - `/dev/kvm` must exist and be accessible to the per-VM users (see the `uid_gid_range` configuration). - - **cgroup v2** — the unified hierarchy mounted at `/sys/fs/cgroup`. v1-only + - **cgroup v2** - the unified hierarchy mounted at `/sys/fs/cgroup`. v1-only hosts are not supported. - - **device-mapper targets** `snapshot`, `thin`, and `thin-pool` — load the + - **device-mapper targets** `snapshot`, `thin`, and `thin-pool` - load the `dm_snapshot` and `dm_thin_pool` modules (`modprobe dm_snapshot dm_thin_pool`). Hyper refuses to start its device helper without them. - - **loop devices** — the `loop` module, used to attach layer images as block + - **loop devices** - the `loop` module, used to attach layer images as block devices. #### Privileged setup @@ -68,8 +68,8 @@ The host kernel must provide: #### Auto-redistributed -The remaining runtime dependencies — `firecracker`, `jailer`, `umoci`, and the -guest `vmlinux` kernels — are downloaded, checksum-verified, and managed by +The remaining runtime dependencies - `firecracker`, `jailer`, `umoci`, and the +guest `vmlinux` kernels - are downloaded, checksum-verified, and managed by Hyper itself; you do not install them. ### Installation From e2253f61146e677a9432e12c3aec863e89325fb0 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 20:51:04 +0000 Subject: [PATCH 04/46] fix(suidhelper): make mix suidhelper.install tty-safe Mix spawns subprocesses in their own session with no controlling terminal (erl_child_setup calls setsid), so the nested `sudo install` could never prompt for a password and always failed with "a terminal is required". Split the task: `cargo xtask stamp` builds + stamps unprivileged, then the privileged copy runs via sudo only when it is already non-interactive (`sudo -n` succeeds); otherwise it prints the exact command to run by hand. --- lib/mix/tasks/suidhelper.install.ex | 101 +++++++++++++++++++++------- 1 file changed, 77 insertions(+), 24 deletions(-) diff --git a/lib/mix/tasks/suidhelper.install.ex b/lib/mix/tasks/suidhelper.install.ex index de27fbce..2085de3c 100644 --- a/lib/mix/tasks/suidhelper.install.ex +++ b/lib/mix/tasks/suidhelper.install.ex @@ -1,40 +1,93 @@ defmodule Mix.Tasks.Suidhelper.Install do - @shortdoc "Build, stamp, and install the setuid helper (wraps `cargo xtask install`)" + @shortdoc "Build, stamp, and install the setuid helper" @moduledoc """ - Builds, stamps, and installs the Rust setuid helper by wrapping - `cargo xtask install` in `native/suidhelper`. + Builds, stamps, and installs the Rust setuid helper. mix suidhelper.install - The xtask first stamps the release binary (BLAKE3 self-checksum into - `.note.sum`, the same step the `:suidhelper_stamp` compiler runs) and then - installs it setuid-root to `/usr/local/bin/hyper-suidhelper` via `sudo - install`. `sudo` may prompt for a password on the controlling terminal. + Two steps: - This is the privileged counterpart to `mix suidhelper.stamp`: that one only - rebuilds, stamps, and re-captures the embedded build identity; this one also - places the binary on `PATH` setuid-root. `cargo` and the helper's toolchain - (see `native/suidhelper/rust-toolchain.toml`) must be installed. + 1. `cargo xtask stamp` in `native/suidhelper` builds the release binary and + writes its BLAKE3 self-checksum into `.note.sum` (the same step the + `:suidhelper_stamp` compiler runs). + 2. The stamped binary is copied setuid-root (mode `4755`) to + `/usr/local/bin/hyper-suidhelper`. + + The copy needs root, but Mix runs every subprocess in its own session with no + controlling terminal (`erl_child_setup` calls `setsid`), so a nested `sudo` + cannot open `/dev/tty` to prompt for a password. This task therefore only runs + `sudo` itself when it is already non-interactive (`sudo -n` succeeds, e.g. + `NOPASSWD` or a usable cached credential). Otherwise it prints the exact + privileged command for you to run in your own terminal. + + This is the privileged counterpart to `mix suidhelper.stamp`, which stamps + only. `cargo` and the helper's toolchain (see + `native/suidhelper/rust-toolchain.toml`) must be installed. """ use Mix.Task @helper_dir "native/suidhelper" + @source Path.join(@helper_dir, "target/release/hyper-suidhelper") + # Must match `Hyper.Config`'s default `suid_helper` path and the xtask's + # `INSTALL_PATH`: a `PATH` location the unprivileged node can exec. + @install_path "/usr/local/bin/hyper-suidhelper" @impl Mix.Task def run(argv) do - {_, 0} = - System.cmd("cargo", ["xtask", "install" | argv], - cd: @helper_dir, - into: IO.stream(:stdio, :line) - ) - rescue - MatchError -> - Mix.raise(""" - `cargo xtask install` failed installing the suidhelper. - - Ensure `cargo` and the helper's toolchain (see #{@helper_dir}/rust-toolchain.toml) - are installed, and that `sudo` is available for the setuid install step. - """) + stamp!(argv) + install_privileged() + end + + defp stamp!(argv) do + case System.cmd("cargo", ["xtask", "stamp" | argv], + cd: @helper_dir, + into: IO.stream(:stdio, :line) + ) do + {_, 0} -> + :ok + + {_, _} -> + Mix.raise(""" + `cargo xtask stamp` failed building the suidhelper. + + Ensure `cargo` and the helper's toolchain (see #{@helper_dir}/rust-toolchain.toml) + are installed. + """) + end + end + + defp install_privileged do + if passwordless_sudo?() do + Mix.shell().info("Installing #{@source} -> #{@install_path} (setuid root)") + + case System.cmd("sudo", install_argv(), into: IO.stream(:stdio, :line)) do + {_, 0} -> Mix.shell().info("installed #{@install_path} (setuid root)") + {_, _} -> Mix.raise(manual_instructions()) + end + else + Mix.shell().info(manual_instructions()) + end + end + + # `sudo -n true` exits 0 only when sudo can run without prompting. With no + # controlling terminal a cached `tty_tickets` credential is invisible, so this + # is true essentially only under `NOPASSWD` -- exactly the case where the + # nested `sudo install` below can succeed. + defp passwordless_sudo? do + match?({_, 0}, System.cmd("sudo", ["-n", "true"], stderr_to_stdout: true)) + end + + defp install_argv, + do: ["install", "-o", "root", "-g", "root", "-m", "4755", @source, @install_path] + + defp manual_instructions do + """ + + The binary is built and stamped, but installing it setuid-root needs a + password and `sudo` has no terminal to prompt on here. Run the copy yourself: + + sudo #{Enum.join(install_argv(), " ")} + """ end end From 4ddcd4d5a18f29eb27464394d8805e77b979f7ab Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 20:51:14 +0000 Subject: [PATCH 05/46] feat(suidhelper): source device binaries from config, drop caller --bin The unprivileged node used to pass `--bin ` to the setuid helper for every losetup/dmsetup/blockdev op. Letting the caller name the binary the helper escalates to run is a needless trust hole, even with SafeBin checks. Move the paths into the helper-owned config (/etc/hyper/config.toml) with sane defaults (/usr/sbin/{losetup,dmsetup,blockdev}); the helper validates each as a SafeBin (absolute, root-owned, non-writable, exact basename) at dispatch, as the real uid, before acquiring root. The `--bin` argument is gone from both sides. Also: - Add a `dmsetup targets` op so the dm-target readiness probe runs through the helper (it opens /dev/mapper/control, which needs root) instead of shelling dmsetup directly as the BEAM user. - An absent config file now falls back to the built-in defaults (trusted, compiled into the root-owned binary); a present-but-untrusted file stays fatal. Drop the Elixir-side *_path config and per-tool presence checks. - Add :mix to the dialyzer PLT so the Mix tasks resolve Mix.raise/shell. --- lib/hyper/config.ex | 27 +++--- lib/hyper/suid_helper.ex | 18 ++-- lib/hyper/suid_helper/blockdev.ex | 11 +-- lib/hyper/suid_helper/dmsetup.ex | 43 +++------- lib/hyper/suid_helper/losetup.ex | 22 +---- mix.exs | 4 + native/suidhelper/src/config.rs | 76 ++++++++++++++++- native/suidhelper/src/tools/dmsetup/mod.rs | 10 +++ native/suidhelper/src/tools/mod.rs | 47 ++++++----- native/suidhelper/src/util/safe_bin.rs | 60 ++++++------- native/suidhelper/tests/e2e/argv.rs | 97 ++++++++++++++-------- native/suidhelper/tests/e2e/config.rs | 25 ++++-- native/suidhelper/tests/util/safe_bin.rs | 4 +- 13 files changed, 265 insertions(+), 179 deletions(-) diff --git a/lib/hyper/config.ex b/lib/hyper/config.ex index 1332d9ee..e3c056d2 100644 --- a/lib/hyper/config.ex +++ b/lib/hyper/config.ex @@ -16,9 +16,6 @@ defmodule Hyper.Config do @parent_cgroup Application.compile_env(:hyper, :cgroup_parent, "hyper") @uid_gid_range Application.compile_env!(:hyper, :uid_gid_range) @layer_dir Application.compile_env!(:hyper, :layer_dir) - @losetup_path Application.compile_env(:hyper, :losetup_path, "losetup") - @dmsetup_path Application.compile_env(:hyper, :dmsetup_path, "dmsetup") - @blockdev_path Application.compile_env(:hyper, :blockdev_path, "blockdev") @skopeo_path Application.compile_env(:hyper, :skopeo_path, "skopeo") @umoci_path Application.compile_env(:hyper, :umoci_path, nil) @mke2fs_path Application.compile_env(:hyper, :mke2fs_path, "mke2fs") @@ -111,15 +108,6 @@ defmodule Hyper.Config do @spec layer_dir :: Path.t() def layer_dir, do: @layer_dir - @doc "Path to the losetup binary." - def losetup_path, do: @losetup_path - - @doc "Path to the dmsetup binary." - def dmsetup_path, do: @dmsetup_path - - @doc "Path to the blockdev binary." - def blockdev_path, do: @blockdev_path - @doc "Path to the skopeo binary (used by `Hyper.Img.OciLoader` to pull OCI images)." def skopeo_path, do: @skopeo_path @@ -132,15 +120,20 @@ defmodule Hyper.Config do @doc "Path to the mke2fs binary (used by `Hyper.Img.OciLoader` to build the ext4 rootfs)." def mke2fs_path, do: @mke2fs_path + # Where `cargo xtask install` (via `mix suidhelper.install`) drops the helper. + @default_suid_helper "/usr/local/bin/hyper-suidhelper" + @doc """ - Path to the setuid-root device helper (`hyper-suidhelper`). Required: the node - runs unprivileged and routes every `losetup`/`dmsetup`/`blockdev` operation - through it. + Path to the setuid-root device helper (`hyper-suidhelper`). The node runs + unprivileged and routes every `losetup`/`dmsetup`/`blockdev` operation through + it. - Runtime config (host-specific), so it can be set per node without recompiling. + Defaults to `#{@default_suid_helper}`, the install path used by `mix + suidhelper.install`. Runtime config (host-specific), so an operator who + installs it elsewhere can override per node without recompiling. """ @spec suid_helper :: Path.t() - def suid_helper, do: Application.fetch_env!(:hyper, :suid_helper) + def suid_helper, do: Application.get_env(:hyper, :suid_helper, @default_suid_helper) @doc """ Directory for per-VM scratch (writable-layer COW) files. Must be node-local and diff --git a/lib/hyper/suid_helper.ex b/lib/hyper/suid_helper.ex index 690f004a..6d4ff56d 100644 --- a/lib/hyper/suid_helper.ex +++ b/lib/hyper/suid_helper.ex @@ -16,7 +16,7 @@ defmodule Hyper.SuidHelper do self-test and reports the base path it was compiled against. """ - alias Hyper.SuidHelper.{Blockdev, Dmsetup, Expected, Losetup} + alias Hyper.SuidHelper.{Dmsetup, Expected} use OpenTelemetryDecorator @@ -51,18 +51,20 @@ defmodule Hyper.SuidHelper do end @doc """ - Check that the setuid helper and every tool it execs are usable on this - machine: the helper binary is present, is the build this release expects - (`verify_version/0`), then each tool submodule's own check. + Check that the setuid helper is usable on this machine: the helper binary is + present, is the build this release expects (`verify_version/0`), and the kernel + exposes the device-mapper targets we need (`Dmsetup.test_system/0`, which also + exercises the helper's configured `dmsetup` binary). + + The `losetup`/`blockdev` binaries are validated by the helper the first time + each is used; their paths live in the helper's own config, not here. """ @spec test_system() :: :ok | {:error, term()} @decorate with_span("Hyper.SuidHelper.test_system") def test_system do with :ok <- helper_present(), - :ok <- verify_version(), - :ok <- Losetup.test_system(), - :ok <- Dmsetup.test_system() do - Blockdev.test_system() + :ok <- verify_version() do + Dmsetup.test_system() end end diff --git a/lib/hyper/suid_helper/blockdev.ex b/lib/hyper/suid_helper/blockdev.ex index 9f675ea5..760fae84 100644 --- a/lib/hyper/suid_helper/blockdev.ex +++ b/lib/hyper/suid_helper/blockdev.ex @@ -11,18 +11,9 @@ defmodule Hyper.SuidHelper.Blockdev do @spec device_sectors(Path.t()) :: {:ok, pos_integer()} | {:error, err()} @decorate with_span("Hyper.SuidHelper.Blockdev.device_sectors", include: [:path]) def device_sectors(path) do - case SuidHelper.exec(["blockdev", "--bin", Hyper.Config.blockdev_path(), "--getsz", path]) do + case SuidHelper.exec(["blockdev", "--getsz", path]) do {:ok, %{"sectors" => n}} -> {:ok, n} {:error, _} = err -> err end end - - @doc "Check the blockdev binary is present." - @spec test_system() :: :ok | {:error, :blockdev_not_found} - @decorate with_span("Hyper.SuidHelper.Blockdev.test_system") - def test_system do - if System.find_executable(Hyper.Config.blockdev_path()), - do: :ok, - else: {:error, :blockdev_not_found} - end end diff --git a/lib/hyper/suid_helper/dmsetup.ex b/lib/hyper/suid_helper/dmsetup.ex index ef670634..e32ec847 100644 --- a/lib/hyper/suid_helper/dmsetup.ex +++ b/lib/hyper/suid_helper/dmsetup.ex @@ -59,14 +59,7 @@ defmodule Hyper.SuidHelper.Dmsetup do @spec remove(String.t()) :: :ok | {:error, err()} @decorate with_span("Hyper.SuidHelper.Dmsetup.remove", include: [:name]) def remove(name) do - case SuidHelper.exec([ - "dmsetup", - "--bin", - Hyper.Config.dmsetup_path(), - "remove", - "--retry", - name - ]) do + case SuidHelper.exec(["dmsetup", "remove", "--retry", name]) do {:ok, _} -> :ok {:error, _} = err -> err end @@ -76,39 +69,31 @@ defmodule Hyper.SuidHelper.Dmsetup do @spec message(String.t(), String.t()) :: :ok | {:error, err()} @decorate with_span("Hyper.SuidHelper.Dmsetup.message", include: [:name, :message]) def message(name, message) do - argv = - ["dmsetup", "--bin", Hyper.Config.dmsetup_path(), "message", name, "--message", message] - - case SuidHelper.exec(argv) do + case SuidHelper.exec(["dmsetup", "message", name, "--message", message]) do {:ok, _} -> :ok {:error, _} = err -> err end end @doc """ - Check the dmsetup binary is present and the kernel exposes the dm targets we - use (snapshot, thin, thin-pool). + Verify the kernel exposes the dm targets we use (snapshot, thin, thin-pool). + + Routes through the setuid helper: `dmsetup targets` opens `/dev/mapper/control`, + which needs root, and the BEAM runs unprivileged. The helper validates its + configured `dmsetup` binary before running it, so a missing or unsafe binary + surfaces here too. """ @spec test_system() :: :ok | {:error, term()} @decorate with_span("Hyper.SuidHelper.Dmsetup.test_system") def test_system do - if System.find_executable(Hyper.Config.dmsetup_path()), - do: test_targets(), - else: {:error, :dmsetup_not_found} - end - - @doc "Verify the kernel exposes the dm targets we use (snapshot, thin, thin-pool)." - @spec test_targets() :: :ok | {:error, term()} - @decorate with_span("Hyper.SuidHelper.Dmsetup.test_targets") - def test_targets do - case System.cmd(Hyper.Config.dmsetup_path(), ["targets"], stderr_to_stdout: true) do - {out, 0} -> + case SuidHelper.exec(["dmsetup", "targets"]) do + {:ok, %{"output" => out}} -> have = parse_targets(out) missing = Enum.reject(@required_targets, &MapSet.member?(have, &1)) if missing == [], do: :ok, else: {:error, {:missing_dm_targets, missing}} - {out, code} -> - {:error, {:dmsetup_targets_failed, code, String.trim(out)}} + {:error, {code, msg}} -> + {:error, {:dmsetup_targets_failed, code, msg}} end end @@ -145,9 +130,7 @@ defmodule Hyper.SuidHelper.Dmsetup do # create flags (e.g. `--readonly`). Returns the `/dev/mapper/` path. @spec create(String.t(), String.t(), [String.t()]) :: {:ok, Path.t()} | {:error, err()} defp create(name, table, flags) do - argv = - ["dmsetup", "--bin", Hyper.Config.dmsetup_path(), "create", name] ++ - flags ++ ["--table", table] + argv = ["dmsetup", "create", name] ++ flags ++ ["--table", table] case SuidHelper.exec(argv) do {:ok, %{"device" => dev}} -> {:ok, dev} diff --git a/lib/hyper/suid_helper/losetup.ex b/lib/hyper/suid_helper/losetup.ex index d825b731..405c5cad 100644 --- a/lib/hyper/suid_helper/losetup.ex +++ b/lib/hyper/suid_helper/losetup.ex @@ -11,7 +11,7 @@ defmodule Hyper.SuidHelper.Losetup do @spec attach_ro(Path.t()) :: {:ok, Path.t()} | {:error, err()} @decorate with_span("Hyper.SuidHelper.Losetup.attach_ro", include: [:path]) def attach_ro(path) do - case SuidHelper.exec(["losetup", "--bin", Hyper.Config.losetup_path(), "attach", path]) do + case SuidHelper.exec(["losetup", "attach", path]) do {:ok, %{"device" => dev}} -> {:ok, dev} {:error, _} = err -> err end @@ -21,14 +21,7 @@ defmodule Hyper.SuidHelper.Losetup do @spec attach_rw(Path.t()) :: {:ok, Path.t()} | {:error, err()} @decorate with_span("Hyper.SuidHelper.Losetup.attach_rw", include: [:path]) def attach_rw(path) do - case SuidHelper.exec([ - "losetup", - "--bin", - Hyper.Config.losetup_path(), - "attach", - "--rw", - path - ]) do + case SuidHelper.exec(["losetup", "attach", "--rw", path]) do {:ok, %{"device" => dev}} -> {:ok, dev} {:error, _} = err -> err end @@ -38,18 +31,9 @@ defmodule Hyper.SuidHelper.Losetup do @spec detach(Path.t()) :: :ok | {:error, err()} @decorate with_span("Hyper.SuidHelper.Losetup.detach", include: [:dev]) def detach(dev) do - case SuidHelper.exec(["losetup", "--bin", Hyper.Config.losetup_path(), "detach", dev]) do + case SuidHelper.exec(["losetup", "detach", dev]) do {:ok, _} -> :ok {:error, _} = err -> err end end - - @doc "Check the losetup binary is present." - @spec test_system() :: :ok | {:error, :losetup_not_found} - @decorate with_span("Hyper.SuidHelper.Losetup.test_system") - def test_system do - if System.find_executable(Hyper.Config.losetup_path()), - do: :ok, - else: {:error, :losetup_not_found} - end end diff --git a/mix.exs b/mix.exs index 7906c6a1..5c620b81 100644 --- a/mix.exs +++ b/mix.exs @@ -24,6 +24,10 @@ defmodule Hyper.MixProject do # Cache the PLTs in a stable, gitignored dir so CI can cache them. plt_local_path: "priv/plts", plt_core_path: "priv/plts", + # `:mix` is needed so the Mix tasks under `lib/mix/tasks` (which call + # `Mix.raise/1`, `Mix.shell/0`, and implement the `Mix.Task` behaviour) + # resolve instead of tripping `unknown_function`. + plt_add_apps: [:mix], # Verify @specs against actual returns, and flag ignored return values. flags: [:unmatched_returns, :extra_return, :missing_return] ] diff --git a/native/suidhelper/src/config.rs b/native/suidhelper/src/config.rs index 425ba386..2c94ba03 100644 --- a/native/suidhelper/src/config.rs +++ b/native/suidhelper/src/config.rs @@ -1,6 +1,7 @@ // SPDX-License-Identifier: AGPL-3.0-only //! Runtime host configuration, read from a single root-owned TOML file. +use crate::util::safe_bin::{self, SafeBin}; use crate::util::safe_file::{self, IsRegularFile, OnlyRootWritable, RootOwner, SafeFile}; use crate::util::safe_path::{self, IsAbsolute, SafePath, StrictComponents}; use nix::fcntl::OFlag; @@ -41,16 +42,57 @@ fn config_path() -> PathBuf { } /// Hyper's /etc/hyper/config.toml file format. +/// +/// The device-tool paths are read from here (never from the unprivileged +/// caller, which is why there is no `--bin` argument): the helper alone decides +/// which binary it escalates to run. Each defaults to its usual location and is +/// validated as a [`SafeBin`] before use. #[derive(Debug, Clone, Deserialize)] pub struct Config { work_dir: PathBuf, + #[serde(default = "default_dmsetup")] + dmsetup: PathBuf, + #[serde(default = "default_losetup")] + losetup: PathBuf, + #[serde(default = "default_blockdev")] + blockdev: PathBuf, +} + +// The default data root. Must match the Elixir node's `@dev_work_dir`, which it +// uses when the same config file is absent, so both sides agree (see +// `Hyper.Node.check_helper_base`). +fn default_work_dir() -> PathBuf { + PathBuf::from("/srv/hyper") +} + +fn default_dmsetup() -> PathBuf { + PathBuf::from("/usr/sbin/dmsetup") +} + +fn default_losetup() -> PathBuf { + PathBuf::from("/usr/sbin/losetup") +} + +fn default_blockdev() -> PathBuf { + PathBuf::from("/usr/sbin/blockdev") +} + +impl Default for Config { + fn default() -> Self { + Self { + work_dir: default_work_dir(), + dmsetup: default_dmsetup(), + losetup: default_losetup(), + blockdev: default_blockdev(), + } + } } impl Config { /// The process-wide config, loaded once (and forced unprivileged by - /// [`Config::init`]). A load failure is fatal: the helper cannot safely - /// operate without a trusted data root, so it prints the error and exits - /// rather than guessing a default. + /// [`Config::init`]). An absent file yields the built-in defaults; a + /// *present but untrusted* file (wrong owner/mode, malformed) is fatal - + /// the helper prints the error and exits rather than trusting it. pub fn get() -> &'static Config { LazyLock::force(&CONFIG) } @@ -74,6 +116,21 @@ impl Config { self.work_dir.join("jails") } + /// The validated `dmsetup` binary the helper will run. + pub fn dmsetup(&self) -> Result, safe_bin::Error> { + SafeBin::from_path(&self.dmsetup) + } + + /// The validated `losetup` binary the helper will run. + pub fn losetup(&self) -> Result, safe_bin::Error> { + SafeBin::from_path(&self.losetup) + } + + /// The validated `blockdev` binary the helper will run. + pub fn blockdev(&self) -> Result, safe_bin::Error> { + SafeBin::from_path(&self.blockdev) + } + /// Read, ownership-check, parse, and validate the config file. See the module /// docs for the trust model. pub fn safe_load() -> Result { @@ -82,7 +139,18 @@ impl Config { let safe_path: SafePath = path.clone().try_into()?; let file: SafeFile = - SafeFile::open(&safe_path, OFlag::O_RDONLY)?; + match SafeFile::open(&safe_path, OFlag::O_RDONLY) { + Ok(file) => file, + // A genuinely-absent file means "use the built-in defaults": those + // are compiled into this root-owned binary, so they are trusted. Any + // OTHER failure - a present but wrong-owner/mode file, an I/O error - + // stays fatal, because it is a signal (someone put an untrusted file + // there), not an absence. + Err(safe_file::ValidationError::Open(nix::errno::Errno::ENOENT)) => { + return Ok(Self::default()) + } + Err(e) => return Err(e.into()), + }; let body = std::io::read_to_string(std::fs::File::from(file.into_owned_fd())) .map_err(|_| LoadingError::Unreadable(path.clone()))?; diff --git a/native/suidhelper/src/tools/dmsetup/mod.rs b/native/suidhelper/src/tools/dmsetup/mod.rs index eef016ad..ca405610 100644 --- a/native/suidhelper/src/tools/dmsetup/mod.rs +++ b/native/suidhelper/src/tools/dmsetup/mod.rs @@ -52,6 +52,9 @@ enum DmOp { #[arg(long)] message: ThinMessage, }, + /// List the target types the kernel device-mapper exposes. Read-only, but + /// still needs root: it opens `/dev/mapper/control`. + Targets, } #[derive(Serialize)] @@ -60,6 +63,7 @@ pub enum DmsetupOut { Created { device: PathBuf }, Removed, Messaged, + Targets { output: String }, } pub struct Dmsetup { @@ -105,6 +109,9 @@ impl IsTool for Dmsetup { .arg("0") .arg(message.to_string()); } + DmOp::Targets => { + cmd.arg("targets"); + } } cmd.env_clear().output() @@ -124,6 +131,9 @@ impl IsTool for Dmsetup { }, DmOp::Remove { .. } => DmsetupOut::Removed, DmOp::Message { .. } => DmsetupOut::Messaged, + DmOp::Targets => DmsetupOut::Targets { + output: String::from_utf8_lossy(&out.stdout).into_owned(), + }, }) } } diff --git a/native/suidhelper/src/tools/mod.rs b/native/suidhelper/src/tools/mod.rs index 10c53c47..24a2be42 100644 --- a/native/suidhelper/src/tools/mod.rs +++ b/native/suidhelper/src/tools/mod.rs @@ -1,7 +1,8 @@ //! Per-tool CLI fragments and their `IsTool` implementations. Each tool lives in -//! its own submodule and owns its own error type, operand validation, and `--bin` -//! parser; this module owns the shared trait, the `Tool` subcommand tree, and the -//! privilege boundary. +//! its own submodule and owns its own error type and operand validation; this +//! module owns the shared trait, the `Tool` subcommand tree, and the privilege +//! boundary. The binary each tool runs is resolved from the trusted config here, +//! never passed by the caller. mod blockdev; pub mod chroot_jail; @@ -13,15 +14,15 @@ pub use chroot_jail::ChrootJailOp; pub use dmsetup::{DmTable, Dmsetup, DmsetupArgs, ThinMessage}; pub use losetup::{Losetup, LosetupArgs}; -use crate::util::safe_bin::SafeBin; +use crate::config::Config; use crate::util::setuid_privileged::{self, Privileged}; use clap::Subcommand; use serde::Serialize; use thiserror::Error as ThisError; -/// Errors of the dispatch layer: whatever the privilege guard or the chosen tool -/// raises on the way out. (`--bin` and operand validation are handled by clap at -/// parse time, so they never reach here.) +/// Errors of the dispatch layer: an invalid configured binary (`SafeBin`), the +/// privilege guard, or the chosen tool's own failure on the way out. (Operand +/// validation is handled by clap at parse time, so it never reaches here.) #[derive(Debug, ThisError)] pub enum Error { #[error(transparent)] @@ -65,28 +66,23 @@ pub trait IsTool { } } -/// The subcommand tree: one subcommand per tool, each taking its own `--bin` -/// with the tool-specific args flattened in from the submodule. +/// The subcommand tree: one subcommand per tool, with the tool-specific args +/// flattened in from the submodule. The binary each tool runs is not a caller +/// argument - it comes from the root-owned config (see [`Config`]). #[derive(Subcommand)] pub enum Tool { /// Attach/detach loop devices. Losetup { - #[arg(long)] - bin: SafeBin<"losetup">, #[command(flatten)] args: LosetupArgs, }, /// Create/remove device-mapper snapshot devices. Dmsetup { - #[arg(long)] - bin: SafeBin<"dmsetup">, #[command(flatten)] args: DmsetupArgs, }, /// Query a block device's size. Blockdev { - #[arg(long)] - bin: SafeBin<"blockdev">, #[command(flatten)] args: BlockdevArgs, }, @@ -99,13 +95,24 @@ pub enum Tool { impl Tool { /// Dispatch to the selected tool's `run` (or, for `chroot-jail`, its nested - /// op), returning its already-serialized `Value`. The `--bin` is already - /// validated (it is a `SafeBin`, constructed only by its value parser). + /// op), returning its already-serialized `Value`. The binary path is taken + /// from the trusted config and validated (`SafeBin`) here, as the real uid, + /// before any privilege is acquired. pub fn run(self) -> Result { + let config = Config::get(); match self { - Tool::Losetup { bin, args } => Losetup::new(bin.into(), args).run(), - Tool::Dmsetup { bin, args } => Dmsetup::new(bin.into(), args).run(), - Tool::Blockdev { bin, args } => Blockdev::new(bin.into(), args).run(), + Tool::Losetup { args } => { + let bin = config.losetup().map_err(|e| Error::Tool(Box::new(e)))?; + Losetup::new(bin.into(), args).run() + } + Tool::Dmsetup { args } => { + let bin = config.dmsetup().map_err(|e| Error::Tool(Box::new(e)))?; + Dmsetup::new(bin.into(), args).run() + } + Tool::Blockdev { args } => { + let bin = config.blockdev().map_err(|e| Error::Tool(Box::new(e)))?; + Blockdev::new(bin.into(), args).run() + } Tool::ChrootJail { op } => op.run(), } } diff --git a/native/suidhelper/src/util/safe_bin.rs b/native/suidhelper/src/util/safe_bin.rs index 8f323aa7..91084407 100644 --- a/native/suidhelper/src/util/safe_bin.rs +++ b/native/suidhelper/src/util/safe_bin.rs @@ -1,17 +1,16 @@ -//! A validated `--bin` path. +//! A validated tool-binary path. //! -//! The caller names the binary to run, but it must be the expected tool and a -//! binary only root could have produced. [`SafeBin`] is a newtype whose only -//! constructor runs those checks, so holding one is proof the path was -//! validated. The const string parameter `NAME` is the basename it was validated -//! against - a `SafeBin<"losetup">` can never be passed where a -//! `SafeBin<"dmsetup">` is wanted. Combined with the [`FromStr`] impl (see -//! `tools`), clap validates the path at argument-parse time with no per-tool -//! boilerplate. +//! The path of each device tool (`losetup`, `dmsetup`, `blockdev`) comes from +//! the root-owned config file, never from the unprivileged caller. [`SafeBin`] +//! is a newtype whose only constructor runs the safety checks, so holding one is +//! proof the path was validated. The const string parameter `NAME` is the +//! basename it was validated against - a `SafeBin<"losetup">` can never be +//! passed where a `SafeBin<"dmsetup">` is wanted. //! -//! These checks are what keep this from being arbitrary-root-execution: an -//! unprivileged caller cannot point us at a binary it controls (must be -//! root-owned, not group/other-writable, not a symlink, exact basename). +//! These checks are what keep this from being arbitrary-root-execution: even a +//! mistaken config entry cannot point us at a binary a non-root user controls +//! (must be an absolute path, the exact basename, root-owned, not a symlink, not +//! group/other-writable). use std::ffi::OsStr; use std::fs; @@ -23,14 +22,14 @@ use thiserror::Error as ThisError; #[derive(Debug, ThisError)] pub enum Error { - #[error("--bin must be an absolute path: {0}")] + #[error("binary path must be absolute: {0}")] NotAbsolute(PathBuf), - #[error("--bin basename must be `{expected}`: {got}")] + #[error("binary basename must be `{expected}`: {got}")] Name { expected: &'static str, got: PathBuf, }, - #[error("--bin {path}: {source}")] + #[error("binary {path}: {source}")] Stat { path: PathBuf, #[source] @@ -44,22 +43,17 @@ pub enum Error { Writable(PathBuf), } -/// A `--bin` path validated to have basename `NAME`. The wrapped path is private -/// and the only constructor is the [`FromStr`] impl, so a `SafeBin` value cannot -/// exist without having been checked. +/// A tool-binary path validated to have basename `NAME`. The wrapped path is +/// private and the only constructor is [`SafeBin::from_path`], so a `SafeBin` +/// value cannot exist without having been checked. #[derive(Debug, Clone)] pub struct SafeBin(PathBuf); -// Lets clap validate `--bin` at parse time straight into a `SafeBin`, with -// no per-tool value parser: the const basename is the whole spec. Validates that -// `s` is an absolute path with basename `NAME`, a regular root-owned file no -// non-root user could have written. -impl FromStr for SafeBin { - type Err = Error; - - fn from_str(s: &str) -> Result { - let bin = Path::new(s); - +impl SafeBin { + /// Validate `bin` as the `NAME` tool binary: an absolute path with basename + /// `NAME`, a real (non-symlink) regular file owned by root that no non-root + /// user could have written. These checks are the whole point of the type. + pub fn from_path(bin: &Path) -> Result { if !bin.is_absolute() { return Err(Error::NotAbsolute(bin.to_path_buf())); } @@ -93,6 +87,16 @@ impl FromStr for SafeBin { } } +// Lets a string parse straight into a validated `SafeBin` (used by the +// test suite); delegates to the single `from_path` constructor. +impl FromStr for SafeBin { + type Err = Error; + + fn from_str(s: &str) -> Result { + Self::from_path(Path::new(s)) + } +} + // Read the validated path back out; the "validated" guarantee stays attached to // the `SafeBin` type until this conversion. impl From> for PathBuf { diff --git a/native/suidhelper/tests/e2e/argv.rs b/native/suidhelper/tests/e2e/argv.rs index 6ebf997a..519d664e 100644 --- a/native/suidhelper/tests/e2e/argv.rs +++ b/native/suidhelper/tests/e2e/argv.rs @@ -1,7 +1,7 @@ //! L4: prove the exact argv (and empty env) the helper hands to the child tool — -//! the one thing the design deliberately hides from the caller. We point `--bin` -//! at a root-owned fake that writes its argv+env to a file as JSON, then assert -//! on the reconstructed command line. +//! the one thing the design deliberately hides from the caller. We point the +//! tool's config path at a root-owned fake that writes its argv+env to a file as +//! JSON, then assert on the reconstructed command line. #![cfg(feature = "insecure_test_seams")] use std::fs; @@ -31,9 +31,16 @@ fn install_fake(dir: &Path, basename: &str, record: &Path, stdout_line: &str) -> path // root-owned because this test runs as root } -fn write_root_config(dir: &Path) -> PathBuf { +/// Write a root-owned config that points the named tools at the given (fake) +/// binaries, so the helper resolves each tool's path from config rather than a +/// caller argument. +fn write_root_config(dir: &Path, bins: &[(&str, &Path)]) -> PathBuf { let p = dir.join("config.toml"); - fs::write(&p, "work_dir = \"/srv/hyper\"\n").unwrap(); + let mut body = String::from("work_dir = \"/srv/hyper\"\n"); + for (key, path) in bins { + body.push_str(&format!("{key} = \"{}\"\n", path.display())); + } + fs::write(&p, body).unwrap(); fs::set_permissions(&p, fs::Permissions::from_mode(0o644)).unwrap(); p } @@ -60,17 +67,15 @@ fn dmsetup_create_snapshot_reconstructs_canonical_table_as_root() { return; } let tmp = tempfile::tempdir().unwrap(); - let cfg = write_root_config(tmp.path()); let rec = tmp.path().join("argv.json"); let bin = install_fake(tmp.path(), "dmsetup", &rec, ""); + let cfg = write_root_config(tmp.path(), &[("dmsetup", &bin)]); // Deliberately weird inner spacing; the helper must re-render canonically. let out = run( &cfg, &[ "dmsetup", - "--bin", - bin.to_str().unwrap(), "create", "hyper-vm1", "--readonly", @@ -106,21 +111,11 @@ fn dmsetup_remove_retry_toggle_as_root() { return; } let tmp = tempfile::tempdir().unwrap(); - let cfg = write_root_config(tmp.path()); let rec = tmp.path().join("argv.json"); let bin = install_fake(tmp.path(), "dmsetup", &rec, ""); + let cfg = write_root_config(tmp.path(), &[("dmsetup", &bin)]); - let out = run( - &cfg, - &[ - "dmsetup", - "--bin", - bin.to_str().unwrap(), - "remove", - "--retry", - "hyper-vm1", - ], - ); + let out = run(&cfg, &["dmsetup", "remove", "--retry", "hyper-vm1"]); assert_eq!(out.status.code(), Some(0)); assert_eq!(recorded_argv(&rec), vec!["remove", "--retry", "hyper-vm1"]); } @@ -132,16 +127,14 @@ fn dmsetup_message_create_thin_as_root() { return; } let tmp = tempfile::tempdir().unwrap(); - let cfg = write_root_config(tmp.path()); let rec = tmp.path().join("argv.json"); let bin = install_fake(tmp.path(), "dmsetup", &rec, ""); + let cfg = write_root_config(tmp.path(), &[("dmsetup", &bin)]); let out = run( &cfg, &[ "dmsetup", - "--bin", - bin.to_str().unwrap(), "message", "hyper-pool", "--message", @@ -156,6 +149,51 @@ fn dmsetup_message_create_thin_as_root() { ); } +#[test] +fn dmsetup_targets_argv_and_parse_as_root() { + if !is_root() { + eprintln!("SKIP dmsetup_targets: needs root"); + return; + } + let tmp = tempfile::tempdir().unwrap(); + let rec = tmp.path().join("argv.json"); + // Fake prints one `dmsetup targets` row; the helper returns it verbatim. + let bin = install_fake(tmp.path(), "dmsetup", &rec, "snapshot v1.16.0"); + let cfg = write_root_config(tmp.path(), &[("dmsetup", &bin)]); + + let out = run(&cfg, &["dmsetup", "targets"]); + assert_eq!( + out.status.code(), + Some(0), + "stderr: {}", + String::from_utf8_lossy(&out.stderr) + ); + assert_eq!(recorded_argv(&rec), vec!["targets"]); + let json: serde_json::Value = serde_json::from_slice(&out.stdout).unwrap(); + assert_eq!(json["result"], "targets"); + assert_eq!(json["output"], "snapshot v1.16.0\n"); +} + +#[test] +fn dmsetup_rejects_configured_bin_with_wrong_basename_as_root() { + if !is_root() { + eprintln!("SKIP dmsetup_rejects_bin: needs root to own the config file"); + return; + } + let tmp = tempfile::tempdir().unwrap(); + // A real, root-owned system file, but the wrong basename for `dmsetup`. + let cfg = write_root_config(tmp.path(), &[("dmsetup", Path::new("/usr/bin/env"))]); + + let out = run(&cfg, &["dmsetup", "targets"]); + assert_ne!( + out.status.code(), + Some(0), + "a configured binary with the wrong basename must be refused" + ); + let err = String::from_utf8_lossy(&out.stderr); + assert!(err.contains("basename must be"), "stderr: {err}"); +} + #[test] fn blockdev_getsz_argv_and_parse_as_root() { if !is_root() { @@ -163,21 +201,12 @@ fn blockdev_getsz_argv_and_parse_as_root() { return; } let tmp = tempfile::tempdir().unwrap(); - let cfg = write_root_config(tmp.path()); let rec = tmp.path().join("argv.json"); // Fake prints "2048" as the sector count for the helper to parse. let bin = install_fake(tmp.path(), "blockdev", &rec, "2048"); + let cfg = write_root_config(tmp.path(), &[("blockdev", &bin)]); - let out = run( - &cfg, - &[ - "blockdev", - "--bin", - bin.to_str().unwrap(), - "--getsz", - "/dev/loop0", - ], - ); + let out = run(&cfg, &["blockdev", "--getsz", "/dev/loop0"]); assert_eq!( out.status.code(), Some(0), diff --git a/native/suidhelper/tests/e2e/config.rs b/native/suidhelper/tests/e2e/config.rs index cd3ba333..100a686a 100644 --- a/native/suidhelper/tests/e2e/config.rs +++ b/native/suidhelper/tests/e2e/config.rs @@ -34,17 +34,28 @@ fn write_root_config(dir: &Path, body: &str) -> std::path::PathBuf { p } -/// A missing config file is refused at SafeFile::open with exit code 2. -/// The error is LoadingError::File(ValidationError::Open(ENOENT)), -/// which displays as "open failed: ". +/// A genuinely-absent config file is NOT an error: the helper falls back to the +/// built-in defaults (compiled into this root-owned binary, hence trusted). The +/// default `work_dir` is `/srv/hyper`. Needs root because `sys-test` then +/// acquires privileges to prove it can promote. #[test] -fn missing_config_exits_2() { +fn missing_config_falls_back_to_defaults_as_root() { + if !is_root() { + eprintln!("SKIP missing_config defaults: sys-test needs root"); + return; + } let tmp = tempfile::tempdir().unwrap(); let missing = tmp.path().join("nope.toml"); let out = run_with_config(&missing, &["sys-test"]); - assert_eq!(out.status.code(), Some(2), "missing config must exit 2"); - let err = String::from_utf8_lossy(&out.stderr); - assert!(err.contains("open failed"), "stderr: {err}"); + assert_eq!( + out.status.code(), + Some(0), + "absent config should use defaults; stderr: {}", + String::from_utf8_lossy(&out.stderr) + ); + let json: serde_json::Value = serde_json::from_slice(&out.stdout).expect("stdout is JSON"); + assert_eq!(json["sys_test"], "ok"); + assert_eq!(json["hyper_base"], "/srv/hyper"); } #[test] diff --git a/native/suidhelper/tests/util/safe_bin.rs b/native/suidhelper/tests/util/safe_bin.rs index 950e06b7..e7522439 100644 --- a/native/suidhelper/tests/util/safe_bin.rs +++ b/native/suidhelper/tests/util/safe_bin.rs @@ -1,5 +1,5 @@ -//! `SafeBin` is what stops `--bin` from pointing the helper at an -//! attacker-controlled binary it would then run as root. The constructor demands +//! `SafeBin` is what stops a configured path from pointing the helper at +//! an attacker-controlled binary it would then run as root. The constructor demands //! an absolute path, exact basename, a real (non-symlink) regular file owned by //! root and not group/other-writable. These assert the refusal axes; the symlink //! axis is root-independent, the owner axis is asserted both ways. From f340571089761f7ae686f630ac4bfbba03f1ed23 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 20:51:19 +0000 Subject: [PATCH 06/46] docs: Postgres quickstart, optional helper config, install caveats Document the Docker Postgres setup matching config.exs defaults, the now config-sourced (and optional) device-binary paths, and the sudo/tty caveat for mix suidhelper.install. --- docs/cookbook/intro.md | 72 ++++++++++++++++++++++++++++++++++++++---- 1 file changed, 65 insertions(+), 7 deletions(-) diff --git a/docs/cookbook/intro.md b/docs/cookbook/intro.md index 526c7e98..7da19091 100644 --- a/docs/cookbook/intro.md +++ b/docs/cookbook/intro.md @@ -24,23 +24,70 @@ The absolute best way to get started with `Hyper` is to play with it. Hyper needs a **PostgreSQL** server reachable from every node - it is the image database and the only stateful external dependency. +For local development the quickest path is Docker. The connection details below +match the defaults in `config/config.exs` (`Hyper.Img.Db.Repo`): + +```sh +docker run -d --name hyper-pg \ + -e POSTGRES_USER=postgres \ + -e POSTGRES_PASSWORD=postgres \ + -e POSTGRES_DB=hyper_dev \ + -p 5432:5432 \ + postgres:16 +``` + +Once it is up, create and migrate the schema (the repo is not in `ecto_repos`, +so pass it with `-r`): + +```sh +mix ecto.create -r Hyper.Img.Db.Repo +mix ecto.migrate -r Hyper.Img.Db.Repo +``` + +The container is ephemeral; `docker start hyper-pg` brings it back after a +reboot. To point Hyper at an existing server instead, override the +`Hyper.Img.Db.Repo` block in your `config.exs`. + #### System binaries -The following must be on each node's `PATH` (the bracketed override is the -`config :hyper` key you can set if the binary lives elsewhere): +These are used by the unprivileged node directly; each must be on the node's +`PATH` (the bracketed override is the `config :hyper` key you can set if the +binary lives elsewhere): - [`skopeo`](https://github.com/containers/skopeo) - pulls OCI images (`skopeo_path`) - [`e2fsprogs`](https://github.com/tytso/e2fsprogs) - provides `mke2fs`, which builds the ext4 rootfs (`mke2fs_path`) - - `losetup`, `blockdev` (from **util-linux**) - loop-device setup - (`losetup_path`, `blockdev_path`) - - `dmsetup` (from **lvm2** / device-mapper) - dm-snapshot and thin-pool - layering (`dmsetup_path`). Frequently *not* installed by default - check - this one first. - `du`, `getent` (from **coreutils** and **glibc**) - rootfs sizing and user resolution. Present on essentially every distro. +The privileged device binaries - `losetup`, `blockdev` (from **util-linux**) +and `dmsetup` (from **lvm2** / device-mapper) - are run only by the setuid +helper, never named by the unprivileged caller. Their paths therefore live in +the helper's own config, `/etc/hyper/config.toml`, and default to +`/usr/sbin/{losetup,blockdev,dmsetup}`. + +**The config file is optional.** If it is absent the helper uses the built-in +defaults below (and `work_dir = "/srv/hyper"`, matching the node's own +fallback). Create one only to override a default - and if you do, it must be +root-owned and not group/other-writable, or the helper refuses to start (a +present-but-untrusted file is treated as an attack signal, unlike a missing +one): + +```toml +# /etc/hyper/config.toml (root-owned, mode 0644) - every line optional +work_dir = "/srv/hyper" + +# Each must be an absolute path to a root-owned, non-world-writable binary; +# the helper validates this before it will exec the tool. +dmsetup = "/usr/sbin/dmsetup" +losetup = "/usr/sbin/losetup" +blockdev = "/usr/sbin/blockdev" +``` + +`dmsetup` (lvm2) is frequently *not* installed by default - check that one +first. + #### Kernel features The host kernel must provide: @@ -61,6 +108,17 @@ The host kernel must provide: Run `mix suidhelper.install`, which builds, stamps, and places it setuid-root on `PATH`. Every privileged operation (losetup, dmsetup, mknod, chroot jails) routes through it; the BEAM itself runs unprivileged. + + The final `sudo install` step runs without a controlling terminal (Mix + captures the nested `cargo` output), so on a typical `tty_tickets` sudo + setup it cannot prompt for a password. If it fails, the build has already + stamped the binary -- just run the copy yourself: + + ```sh + sudo install -o root -g root -m 4755 \ + native/suidhelper/target/release/hyper-suidhelper \ + /usr/local/bin/hyper-suidhelper + ``` - A **parent cgroup** named by `cgroup_parent` (default `hyper`) must exist under `/sys/fs/cgroup`; Hyper creates each VM's cgroup beneath it. - The host UID/GID range given by `uid_gid_range` must be free for Hyper to From 23a58830bb91aadf146f8199ad09bcf9d85b735a Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 21:12:00 +0000 Subject: [PATCH 07/46] docs: how to load device-mapper modules for the dm targets Spell out modprobe + dmsetup targets verify, the linux-modules-extra fallback for stripped cloud kernels, and persisting via modules-load.d - the missing_dm_targets boot failure points here. --- docs/cookbook/intro.md | 29 ++++++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/docs/cookbook/intro.md b/docs/cookbook/intro.md index 7da19091..57615f63 100644 --- a/docs/cookbook/intro.md +++ b/docs/cookbook/intro.md @@ -96,12 +96,35 @@ The host kernel must provide: the `uid_gid_range` configuration). - **cgroup v2** - the unified hierarchy mounted at `/sys/fs/cgroup`. v1-only hosts are not supported. - - **device-mapper targets** `snapshot`, `thin`, and `thin-pool` - load the - `dm_snapshot` and `dm_thin_pool` modules (`modprobe dm_snapshot - dm_thin_pool`). Hyper refuses to start its device helper without them. + - **device-mapper targets** `snapshot`, `thin`, and `thin-pool` - from the + `dm_snapshot` (provides `snapshot`) and `dm_thin_pool` (provides `thin` and + `thin-pool`) modules. Hyper refuses to start without all three; on boot it + fails with `{:missing_dm_targets, [...]}` listing whichever are absent. - **loop devices** - the `loop` module, used to attach layer images as block devices. +Load the modules and confirm the targets are present: + +```sh +sudo modprobe dm_snapshot dm_thin_pool loop +sudo dmsetup targets # must list snapshot, thin, and thin-pool +``` + +If `modprobe` reports the module is missing, the running kernel lacks it - +minimal cloud images often strip device-mapper. On Debian/Ubuntu, install the +extra modules for the running kernel, then load them: + +```sh +sudo apt-get install -y linux-modules-extra-$(uname -r) +sudo modprobe dm_snapshot dm_thin_pool loop +``` + +Make the modules load on every boot: + +```sh +printf 'dm_snapshot\ndm_thin_pool\nloop\n' | sudo tee /etc/modules-load.d/hyper.conf +``` + #### Privileged setup - The **setuid-root device helper** (`hyper-suidhelper`) must be installed. From b0c0b5f882520510104fb5abf0d0dd8185d8aafb Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 21:25:57 +0000 Subject: [PATCH 08/46] docs: how to create the parent cgroup + delegate controllers mkdir under the cgroup-v2 hierarchy, enable cpu/memory in subtree_control (root-level fallback), and persist via systemd-tmpfiles since cgroupfs is memory-backed. The :missing_parent_cgroup boot failure points here. --- docs/cookbook/intro.md | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/docs/cookbook/intro.md b/docs/cookbook/intro.md index 57615f63..99c80123 100644 --- a/docs/cookbook/intro.md +++ b/docs/cookbook/intro.md @@ -143,7 +143,31 @@ printf 'dm_snapshot\ndm_thin_pool\nloop\n' | sudo tee /etc/modules-load.d/hyper. /usr/local/bin/hyper-suidhelper ``` - A **parent cgroup** named by `cgroup_parent` (default `hyper`) must exist - under `/sys/fs/cgroup`; Hyper creates each VM's cgroup beneath it. + under the cgroup-v2 hierarchy; Hyper creates each VM's cgroup beneath it and + fails to boot with `:missing_parent_cgroup` if it is absent. Create it and + delegate the `cpu` and `memory` controllers so the per-VM cgroups can set + `cpu.max` / `memory.max`: + + ```sh + sudo mkdir -p /sys/fs/cgroup/hyper + echo '+cpu +memory' | sudo tee /sys/fs/cgroup/hyper/cgroup.subtree_control + ``` + + If that last write errors, the root hierarchy is not delegating those + controllers down yet - enable them there first, then retry the line above: + + ```sh + echo '+cpu +memory' | sudo tee /sys/fs/cgroup/cgroup.subtree_control + ``` + + The cgroup hierarchy is memory-backed, so `/sys/fs/cgroup/hyper` does **not** + survive a reboot. Re-create it each boot, or persist it with + `systemd-tmpfiles`: + + ```sh + echo 'd /sys/fs/cgroup/hyper 0755 root root -' \ + | sudo tee /etc/tmpfiles.d/hyper-cgroup.conf + ``` - The host UID/GID range given by `uid_gid_range` must be free for Hyper to allocate per-VM users from. From 5761a4ef82f4aa3a49c5f9d07e523229e5ef8caf Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 21:27:52 +0000 Subject: [PATCH 09/46] fix(node): start Layer before Budget.Supervisor Budget.Advertiser advertises on init, which calls Hyper.Node.Layer.active/0 and selects on Hyper.Node.Layer.Registry. That registry is owned by Hyper.Node.Layer, which was ordered after Budget.Supervisor - so boot crashed with `unknown registry: Hyper.Node.Layer.Registry`. Order Layer first. --- lib/hyper/node.ex | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/lib/hyper/node.ex b/lib/hyper/node.ex index 6282eaa7..6892eef1 100644 --- a/lib/hyper/node.ex +++ b/lib/hyper/node.ex @@ -49,9 +49,12 @@ defmodule Hyper.Node do def init(_opts) do children = [ Hyper.Node.Users, + # Layer owns Hyper.Node.Layer.Registry, which Budget.Advertiser queries + # (via Hyper.Node.Layer.active/0) as it advertises on init - so Layer must + # be up first. + Hyper.Node.Layer, Hyper.Node.Budget.Supervisor, {DynamicSupervisor, name: @vm_sup, strategy: :one_for_one}, - Hyper.Node.Layer, Hyper.Node.Img ] From 8dc61d35439af8c27c56348492897c664e067462 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 21:33:47 +0000 Subject: [PATCH 10/46] fix: quiet benign ThinPool port exits and keyless OTLP export ThinPool traps exits (for terminate/2 device teardown), so the normal close of each System.cmd port arrived as an unhandled {:EXIT, port, :normal} and logged as an error. Add a handle_info clause to ignore normal exits. Tracing defaulted to Honeycombs endpoint but only sent the auth header when HONEYCOMB_API_KEY was set, so keyless runs 401d on every batch. Export only when a key or a custom OTEL endpoint is configured; otherwise disable the exporter. --- config/runtime.exs | 32 +++++++++++++++++++++----------- lib/hyper/node/img/thin_pool.ex | 7 +++++++ 2 files changed, 28 insertions(+), 11 deletions(-) diff --git a/config/runtime.exs b/config/runtime.exs index a3b4fd0f..c9ca89fb 100644 --- a/config/runtime.exs +++ b/config/runtime.exs @@ -14,17 +14,27 @@ config :hyper, Hyper.Node.Config.Budget, # Where to send traces. Defaults to Honeycomb; override OTEL_EXPORTER_OTLP_* # to point at any OTLP/HTTP backend (Collector, Grafana, etc). if config_env() != :test do - endpoint = System.get_env("OTEL_EXPORTER_OTLP_ENDPOINT", "https://api.honeycomb.io") + custom_endpoint = System.get_env("OTEL_EXPORTER_OTLP_ENDPOINT") + api_key = System.get_env("HONEYCOMB_API_KEY") - headers = - case System.get_env("HONEYCOMB_API_KEY") do - nil -> [] - "" -> [] - key -> [{"x-honeycomb-team", key}] - end + cond do + api_key not in [nil, ""] -> + config :opentelemetry_exporter, + otlp_protocol: :http_protobuf, + otlp_endpoint: custom_endpoint || "https://api.honeycomb.io", + otlp_headers: [{"x-honeycomb-team", api_key}] - config :opentelemetry_exporter, - otlp_protocol: :http_protobuf, - otlp_endpoint: endpoint, - otlp_headers: headers + custom_endpoint not in [nil, ""] -> + # A custom OTLP backend (e.g. a local Collector) needs no Honeycomb key. + config :opentelemetry_exporter, + otlp_protocol: :http_protobuf, + otlp_endpoint: custom_endpoint, + otlp_headers: [] + + true -> + # No backend configured: exporting to the Honeycomb default with no key + # 401s on every batch. Stay silent instead (typical for local dev). Set + # HONEYCOMB_API_KEY or OTEL_EXPORTER_OTLP_ENDPOINT to enable tracing. + config :opentelemetry, traces_exporter: :none + end end diff --git a/lib/hyper/node/img/thin_pool.ex b/lib/hyper/node/img/thin_pool.ex index f3ec3e19..9599da23 100644 --- a/lib/hyper/node/img/thin_pool.ex +++ b/lib/hyper/node/img/thin_pool.ex @@ -94,6 +94,13 @@ defmodule Hyper.Node.Img.ThinPool do {:reply, :ok, id_free(state, id)} end + @impl true + # Each privileged command runs through `System.cmd`, which links a transient + # port to this process; because we trap exits (for `terminate/2` teardown), + # the port's normal close is delivered here. Ignore it. An abnormal exit + # carries a non-`:normal` reason and falls through to the default handler. + def handle_info({:EXIT, _port, :normal}, state), do: {:noreply, state} + @impl true def terminate(_reason, state) do _ = SuidHelper.Dmsetup.remove(@pool_name) From 1320700f0d3d088828382a0dd9d40b6ace50a255 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 21:42:40 +0000 Subject: [PATCH 11/46] fix(thin_pool): reclaim a stale dm pool on init An unclean shutdown (SIGKILL, so terminate/2 never ran) leaves hyper-thinpool behind; the next boot failed with "device-mapper: create ioctl ... Device or resource busy". Best-effort remove any pre-existing pool before rebuilding, ahead of zero_metadata so a still-live pool is never corrupted. --- lib/hyper/node/img/thin_pool.ex | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/lib/hyper/node/img/thin_pool.ex b/lib/hyper/node/img/thin_pool.ex index 9599da23..2302df85 100644 --- a/lib/hyper/node/img/thin_pool.ex +++ b/lib/hyper/node/img/thin_pool.ex @@ -53,6 +53,7 @@ defmodule Hyper.Node.Img.ThinPool do with :ok <- File.mkdir_p(Hyper.Config.scratch_dir()), {:ok, meta} <- ensure_backing(@meta_file, ImgConfig.thin_pool_meta_size()), {:ok, data} <- ensure_backing(@data_file, ImgConfig.thin_pool_data_size()), + :ok <- reclaim_stale_pool(), :ok <- zero_metadata(meta), {:ok, meta_loop} <- SuidHelper.Losetup.attach_rw(meta), {:ok, data_loop} <- SuidHelper.Losetup.attach_rw(data), @@ -118,6 +119,17 @@ defmodule Hyper.Node.Img.ThinPool do @spec id_free(map(), non_neg_integer()) :: map() def id_free(%{freed: freed} = s, id), do: %{s | freed: [id | freed]} + # A run that did not shut down cleanly (e.g. SIGKILL, so `terminate/2` never + # ran) leaves the dm pool behind, and recreating it fails with "Device or + # resource busy". Best-effort remove any stale pool of our name so this boot + # can build a fresh one. Runs before `zero_metadata`, which would otherwise + # corrupt a still-live pool. + @spec reclaim_stale_pool() :: :ok + defp reclaim_stale_pool do + _ = SuidHelper.Dmsetup.remove(@pool_name) + :ok + end + # Create a sparse file of `size` if absent; reuse it if already present. @spec ensure_backing(String.t(), Information.t()) :: {:ok, Path.t()} | {:error, term()} defp ensure_backing(file, size) do From 6ef56d28ea96d95a4d41c7a230f2b64bafe8bb9d Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 21:50:14 +0000 Subject: [PATCH 12/46] fix(node): validate OCI loader tools at boot Hyper.Img.OciLoader.test_system/0 (checks skopeo/umoci/mke2fs) existed but was never called from Hyper.Node.test_system, so a missing skopeo only surfaced at image-load time. Wire it in after Umoci.ensure_installed so the node refuses to boot when an OCI loader tool is absent, like every other host requirement. --- lib/hyper/node.ex | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/hyper/node.ex b/lib/hyper/node.ex index 6892eef1..c9ab6ad9 100644 --- a/lib/hyper/node.ex +++ b/lib/hyper/node.ex @@ -151,6 +151,7 @@ defmodule Hyper.Node do :ok <- Hyper.Node.FireVMM.VmLinux.Provider.ensure_installed(), :ok <- Hyper.Node.Vmlinux.test_system(), :ok <- Hyper.Img.OciLoader.Umoci.ensure_installed(), + :ok <- Hyper.Img.OciLoader.test_system(), :ok <- Hyper.Node.Users.test_system(), :ok <- Hyper.Node.Layer.Repo.test_system(), :ok <- Hyper.SuidHelper.test_system(), From 06cc05bab14841dc843a14f02acc589594cdc800 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 21:55:07 +0000 Subject: [PATCH 13/46] deslop --- native/suidhelper/src/config.rs | 5 ----- 1 file changed, 5 deletions(-) diff --git a/native/suidhelper/src/config.rs b/native/suidhelper/src/config.rs index 2c94ba03..42079a11 100644 --- a/native/suidhelper/src/config.rs +++ b/native/suidhelper/src/config.rs @@ -42,11 +42,6 @@ fn config_path() -> PathBuf { } /// Hyper's /etc/hyper/config.toml file format. -/// -/// The device-tool paths are read from here (never from the unprivileged -/// caller, which is why there is no `--bin` argument): the helper alone decides -/// which binary it escalates to run. Each defaults to its usual location and is -/// validated as a [`SafeBin`] before use. #[derive(Debug, Clone, Deserialize)] pub struct Config { work_dir: PathBuf, From c0685ae553d4a113aa19ebcde76df2923cbb1bcc Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 21:57:07 +0000 Subject: [PATCH 14/46] fix: ignore benign port exits in the remaining trap_exit servers Layer.Server, Img.Mutable and Img.Server all trap exits (for terminate/2 device teardown) and shell privileged commands through System.cmd, which links a transient port to the process. The port's normal close arrived as {:EXIT, port, :normal} with no matching handle_info clause - Layer.Server crash-looped on it while mounting a layer, which surfaced to create_vm as :no_capacity. Add the same ignore-clause already in ThinPool to each. --- lib/hyper/node/img/mutable.ex | 6 ++++++ lib/hyper/node/img/server.ex | 6 ++++++ lib/hyper/node/layer/server.ex | 6 ++++++ 3 files changed, 18 insertions(+) diff --git a/lib/hyper/node/img/mutable.ex b/lib/hyper/node/img/mutable.ex index 386e36f0..4dbab644 100644 --- a/lib/hyper/node/img/mutable.ex +++ b/lib/hyper/node/img/mutable.ex @@ -121,6 +121,12 @@ defmodule Hyper.Node.Img.Mutable do @impl true def handle_info(:idle_timeout, state), do: {:noreply, state} + @impl true + # Each privileged command runs through `System.cmd`, which links a transient + # port to this process; because we trap exits (for `terminate/2` teardown), + # the port's normal close is delivered here. Ignore it. + def handle_info({:EXIT, _port, :normal}, state), do: {:noreply, state} + @impl true def terminate(_reason, state) do # Destroy the thin volume, then release the image (its monitor on us also diff --git a/lib/hyper/node/img/server.ex b/lib/hyper/node/img/server.ex index d51fe3ba..de6579f5 100644 --- a/lib/hyper/node/img/server.ex +++ b/lib/hyper/node/img/server.ex @@ -131,6 +131,12 @@ defmodule Hyper.Node.Img.Server do {:noreply, state} end + @impl true + # Each privileged command runs through `System.cmd`, which links a transient + # port to this process; because we trap exits (for `terminate/2` teardown), + # the port's normal close is delivered here. Ignore it. + def handle_info({:EXIT, _port, :normal}, state), do: {:noreply, state} + @impl true def terminate(_reason, %State{dm_names: dm_names}) do # Remove top-down (a snapshot's origin is the device below it). Layers are diff --git a/lib/hyper/node/layer/server.ex b/lib/hyper/node/layer/server.ex index 188e135b..f717e155 100644 --- a/lib/hyper/node/layer/server.ex +++ b/lib/hyper/node/layer/server.ex @@ -122,6 +122,12 @@ defmodule Hyper.Node.Layer.Server do {:noreply, state} end + @impl true + # Each privileged command runs through `System.cmd`, which links a transient + # port to this process; because we trap exits (for `terminate/2` teardown), + # the port's normal close is delivered here. Ignore it. + def handle_info({:EXIT, _port, :normal}, state), do: {:noreply, state} + @impl true def terminate(_reason, %State{blk_path: blk_path}) do case SuidHelper.Losetup.detach(blk_path) do From 7fa0480ce7ec495f5c2b14f6c9c376a496ba85fc Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 22:04:37 +0000 Subject: [PATCH 15/46] fix(fire_vmm): child_spec key must be :id, not :vm_id DynamicSupervisor.start_child rejected the FireVMM child spec with {:invalid_child_spec, ...} because the map used a :vm_id key where a supervisor child spec requires :id - so no VM could ever boot. Use :id. Add @spec child_spec(Opts.t()) :: Supervisor.child_spec(). child_spec/1 is a plain def (not a typed @callback), so without a spec dialyzer never compared the returned map against the child-spec contract and the typo passed the gate. With the spec, the bad key fails dialyzer as invalid_contract. --- lib/hyper/node/fire_vmm.ex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/hyper/node/fire_vmm.ex b/lib/hyper/node/fire_vmm.ex index 7e5603b2..59388626 100644 --- a/lib/hyper/node/fire_vmm.ex +++ b/lib/hyper/node/fire_vmm.ex @@ -50,11 +50,12 @@ defmodule Hyper.Node.FireVMM do Supervisor.start_link(__MODULE__, opts, name: via(opts.vm_id)) end + @spec child_spec(Opts.t()) :: Supervisor.child_spec() def child_spec(opts) do # Keyed by VM id and :transient so a cleanly-stopped VM is not rebooted by # the node-level DynamicSupervisor. %{ - vm_id: {__MODULE__, opts.vm_id}, + id: {__MODULE__, opts.vm_id}, start: {__MODULE__, :start_link, [opts]}, type: :supervisor, restart: :transient From 2ac546ad5f2e23f01c7adc93e6c2ea271ea53fb1 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 22:04:37 +0000 Subject: [PATCH 16/46] fix(scheduler): log why a candidate refused placement place/3 reduced every candidate failure into a blanket {:error, :no_capacity}, so a real boot error on the only candidate was indistinguishable from genuine lack of capacity. Log the actual refusal reason at each candidate. --- lib/hyper/cluster/scheduler.ex | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/lib/hyper/cluster/scheduler.ex b/lib/hyper/cluster/scheduler.ex index 1412a0b2..4877658c 100644 --- a/lib/hyper/cluster/scheduler.ex +++ b/lib/hyper/cluster/scheduler.ex @@ -16,6 +16,8 @@ defmodule Hyper.Cluster.Scheduler do alias Hyper.Vm.Instance.Spec alias Unit.Information + require Logger + use OpenTelemetryDecorator @type layer_sizes :: [{Hyper.Layer.id(), Unit.Information.t()}] @@ -45,8 +47,15 @@ defmodule Hyper.Cluster.Scheduler do |> candidates(layers) |> Enum.reduce_while({:error, :no_capacity}, fn node, acc -> case attempt.(node) do - {:ok, result} -> {:halt, {:ok, {node, result}}} - {:error, _reason} -> {:cont, acc} + {:ok, result} -> + {:halt, {:ok, {node, result}}} + + {:error, reason} -> + # The candidate fit the snapshot but refused at confirmation time. + # Log the real reason: otherwise an actual boot failure on the only + # candidate is indistinguishable from genuine `:no_capacity`. + Logger.warning("scheduler: #{inspect(node)} refused placement: #{inspect(reason)}") + {:cont, acc} end end) end From 68c8d4b189ddd28463d4e6f563d829bd5cd23b56 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 22:25:29 +0000 Subject: [PATCH 17/46] feat(node): reclaim orphaned dm/loop devices at boot An unclean shutdown (SIGKILL / :erlang.halt, where terminate/2 never runs) leaves hyper dm devices and loop devices behind. The next boot then crashed in ThinPool.init with "device-mapper: create ioctl on hyper-thinpool: Device or resource busy", and a leaked thin volume held the pool open so the previous remove-the-pool-only reclaim could not clear it. Add read-only `dmsetup ls` and `losetup --list` ops to the suidhelper, and a Hyper.Node.Reclaim pass that runs once before any device GenServer starts: it removes every hyper-prefixed dm device leaf-first (retrying leftovers until a pass clears nothing new, so stacked snapshots and the pool-under-volume case resolve) then detaches loop devices backing files under the data dirs. Replaces ThinPool.init's pool-only reclaim. Best-effort; logs and continues. --- lib/hyper/node.ex | 10 +- lib/hyper/node/img/thin_pool.ex | 12 --- lib/hyper/node/reclaim.ex | 99 +++++++++++++++++++ lib/hyper/suid_helper/dmsetup.ex | 26 +++++ lib/hyper/suid_helper/losetup.ex | 25 +++++ native/suidhelper/src/tools/dmsetup/mod.rs | 9 ++ native/suidhelper/src/tools/losetup.rs | 15 +++ native/suidhelper/tests/e2e/argv.rs | 65 ++++++++++++ .../suid_helper/dmsetup_properties_test.exs | 13 +++ test/hyper/suid_helper/losetup_test.exs | 37 +++++++ 10 files changed, 297 insertions(+), 14 deletions(-) create mode 100644 lib/hyper/node/reclaim.ex create mode 100644 test/hyper/suid_helper/losetup_test.exs diff --git a/lib/hyper/node.ex b/lib/hyper/node.ex index c9ab6ad9..eb3bb738 100644 --- a/lib/hyper/node.ex +++ b/lib/hyper/node.ex @@ -40,8 +40,14 @@ defmodule Hyper.Node do def start_link(opts \\ []) do case test_system() do - :ok -> Supervisor.start_link(__MODULE__, opts, name: __MODULE__) - {:error, reason} -> {:error, reason} + :ok -> + # Clear any dm/loop devices a previous unclean shutdown left behind, + # before the device-owning children start and collide with them. + :ok = Hyper.Node.Reclaim.run() + Supervisor.start_link(__MODULE__, opts, name: __MODULE__) + + {:error, reason} -> + {:error, reason} end end diff --git a/lib/hyper/node/img/thin_pool.ex b/lib/hyper/node/img/thin_pool.ex index 2302df85..9599da23 100644 --- a/lib/hyper/node/img/thin_pool.ex +++ b/lib/hyper/node/img/thin_pool.ex @@ -53,7 +53,6 @@ defmodule Hyper.Node.Img.ThinPool do with :ok <- File.mkdir_p(Hyper.Config.scratch_dir()), {:ok, meta} <- ensure_backing(@meta_file, ImgConfig.thin_pool_meta_size()), {:ok, data} <- ensure_backing(@data_file, ImgConfig.thin_pool_data_size()), - :ok <- reclaim_stale_pool(), :ok <- zero_metadata(meta), {:ok, meta_loop} <- SuidHelper.Losetup.attach_rw(meta), {:ok, data_loop} <- SuidHelper.Losetup.attach_rw(data), @@ -119,17 +118,6 @@ defmodule Hyper.Node.Img.ThinPool do @spec id_free(map(), non_neg_integer()) :: map() def id_free(%{freed: freed} = s, id), do: %{s | freed: [id | freed]} - # A run that did not shut down cleanly (e.g. SIGKILL, so `terminate/2` never - # ran) leaves the dm pool behind, and recreating it fails with "Device or - # resource busy". Best-effort remove any stale pool of our name so this boot - # can build a fresh one. Runs before `zero_metadata`, which would otherwise - # corrupt a still-live pool. - @spec reclaim_stale_pool() :: :ok - defp reclaim_stale_pool do - _ = SuidHelper.Dmsetup.remove(@pool_name) - :ok - end - # Create a sparse file of `size` if absent; reuse it if already present. @spec ensure_backing(String.t(), Information.t()) :: {:ok, Path.t()} | {:error, term()} defp ensure_backing(file, size) do diff --git a/lib/hyper/node/reclaim.ex b/lib/hyper/node/reclaim.ex new file mode 100644 index 00000000..f9513f89 --- /dev/null +++ b/lib/hyper/node/reclaim.ex @@ -0,0 +1,99 @@ +defmodule Hyper.Node.Reclaim do + @moduledoc """ + Boot-time reclamation of device-mapper and loop devices orphaned by an unclean + shutdown (SIGKILL or `:erlang.halt`, where the owning GenServers' `terminate/2` + never ran to tear them down). + + Hyper names every dm device it creates with a `hyper-` prefix (`hyper-thinpool`, + `hyper-rw-`, `hyper-img--`), so this removes exactly those - never an + operator's unrelated dm devices. Removal is leaf-first: a device still open by + another (the pool under a thin volume, a snapshot under the next in its chain) + refuses until its dependents are gone, so leftovers are retried until a pass + removes nothing new. Loop devices backing files under Hyper's data dirs are then + detached (the dm devices that held them are gone by that point). + + Entirely best-effort: every failure is logged and boot continues. It runs once, + before any device-owning GenServer starts, so the freshly-booting node never + collides with its own previous instance's leftovers. + """ + + alias Hyper.SuidHelper.{Dmsetup, Losetup} + + require Logger + + @dm_prefix "hyper-" + + @spec run() :: :ok + def run do + reclaim_dm() + reclaim_loops() + :ok + end + + defp reclaim_dm do + case Dmsetup.list() do + {:ok, names} -> + case Enum.filter(names, &String.starts_with?(&1, @dm_prefix)) do + [] -> + :ok + + stale -> + Logger.warning( + "reclaim: removing #{length(stale)} stale dm device(s): #{inspect(stale)}" + ) + + remove_dm(stale) + end + + {:error, reason} -> + Logger.warning("reclaim: could not list dm devices: #{inspect(reason)}") + end + end + + @spec remove_dm([String.t()]) :: :ok + defp remove_dm([]), do: :ok + + defp remove_dm(names) do + {failed, removed_any?} = + Enum.reduce(names, {[], false}, fn name, {failed, any?} -> + case Dmsetup.remove(name) do + :ok -> {failed, true} + {:error, _} -> {[name | failed], any?} + end + end) + + cond do + failed == [] -> :ok + # A pass made progress: a retry may now clear the devices that were still + # held by the ones just removed. + removed_any? -> remove_dm(failed) + true -> Logger.error("reclaim: could not remove dm devices: #{inspect(failed)}") + end + end + + defp reclaim_loops do + case Losetup.list() do + {:ok, pairs} -> + for {dev, backing} <- pairs, under_data_dirs?(backing) do + case Losetup.detach(dev) do + :ok -> + :ok + + {:error, reason} -> + Logger.warning("reclaim: could not detach #{dev} (#{backing}): #{inspect(reason)}") + end + end + + :ok + + {:error, reason} -> + Logger.warning("reclaim: could not list loop devices: #{inspect(reason)}") + end + end + + @spec under_data_dirs?(Path.t()) :: boolean() + defp under_data_dirs?(path) do + String.starts_with?(path, Hyper.Config.scratch_dir() <> "/") or + String.starts_with?(path, Hyper.Config.layer_dir() <> "/") + end +end diff --git a/lib/hyper/suid_helper/dmsetup.ex b/lib/hyper/suid_helper/dmsetup.ex index e32ec847..3c57c6b5 100644 --- a/lib/hyper/suid_helper/dmsetup.ex +++ b/lib/hyper/suid_helper/dmsetup.ex @@ -107,6 +107,32 @@ defmodule Hyper.SuidHelper.Dmsetup do |> MapSet.new() end + @doc "Names of every device-mapper device currently present on this host." + @spec list() :: {:ok, [String.t()]} | {:error, err()} + @decorate with_span("Hyper.SuidHelper.Dmsetup.list") + def list do + case SuidHelper.exec(["dmsetup", "ls"]) do + {:ok, %{"output" => out}} -> {:ok, parse_names(out)} + {:error, _} = err -> err + end + end + + @doc false + @spec parse_names(String.t()) :: [String.t()] + def parse_names(out) do + case String.trim(out) do + # `dmsetup ls` prints this sentinel (not a device row) when there are none. + "No devices found" -> + [] + + _ -> + out + |> String.split("\n", trim: true) + |> Enum.map(&(&1 |> String.split() |> List.first())) + |> Enum.reject(&is_nil/1) + end + end + @doc false @spec snapshot_table(Path.t(), Path.t(), pos_integer(), pos_integer()) :: String.t() def snapshot_table(origin_dev, cow_dev, sectors, chunk_sectors) do diff --git a/lib/hyper/suid_helper/losetup.ex b/lib/hyper/suid_helper/losetup.ex index 405c5cad..3abf4fba 100644 --- a/lib/hyper/suid_helper/losetup.ex +++ b/lib/hyper/suid_helper/losetup.ex @@ -36,4 +36,29 @@ defmodule Hyper.SuidHelper.Losetup do {:error, _} = err -> err end end + + @doc "Currently-attached loop devices as `{device, backing_file}` pairs." + @spec list() :: {:ok, [{Path.t(), Path.t()}]} | {:error, err()} + @decorate with_span("Hyper.SuidHelper.Losetup.list") + def list do + case SuidHelper.exec(["losetup", "list"]) do + {:ok, %{"output" => out}} -> {:ok, parse_list(out)} + {:error, _} = err -> err + end + end + + @doc false + @spec parse_list(String.t()) :: [{Path.t(), Path.t()}] + def parse_list(out) do + out + |> String.split("\n", trim: true) + |> Enum.flat_map(fn line -> + # `NAME BACK-FILE` rows; a loop with no backing file has only one column + # (nothing for us to reclaim by file), so skip it. + case String.split(line, " ", parts: 2, trim: true) do + [dev, backing] -> [{dev, String.trim(backing)}] + _ -> [] + end + end) + end end diff --git a/native/suidhelper/src/tools/dmsetup/mod.rs b/native/suidhelper/src/tools/dmsetup/mod.rs index ca405610..d77674df 100644 --- a/native/suidhelper/src/tools/dmsetup/mod.rs +++ b/native/suidhelper/src/tools/dmsetup/mod.rs @@ -55,6 +55,8 @@ enum DmOp { /// List the target types the kernel device-mapper exposes. Read-only, but /// still needs root: it opens `/dev/mapper/control`. Targets, + /// List the names of existing dm devices (for stale-device reclaim). + Ls, } #[derive(Serialize)] @@ -64,6 +66,7 @@ pub enum DmsetupOut { Removed, Messaged, Targets { output: String }, + Listed { output: String }, } pub struct Dmsetup { @@ -112,6 +115,9 @@ impl IsTool for Dmsetup { DmOp::Targets => { cmd.arg("targets"); } + DmOp::Ls => { + cmd.arg("ls"); + } } cmd.env_clear().output() @@ -134,6 +140,9 @@ impl IsTool for Dmsetup { DmOp::Targets => DmsetupOut::Targets { output: String::from_utf8_lossy(&out.stdout).into_owned(), }, + DmOp::Ls => DmsetupOut::Listed { + output: String::from_utf8_lossy(&out.stdout).into_owned(), + }, }) } } diff --git a/native/suidhelper/src/tools/losetup.rs b/native/suidhelper/src/tools/losetup.rs index fa8a49d9..aadb5b67 100644 --- a/native/suidhelper/src/tools/losetup.rs +++ b/native/suidhelper/src/tools/losetup.rs @@ -66,6 +66,8 @@ enum LosetupOp { Attach(AttachArgs), /// Detach a loop device. Detach { dev: LoopDev }, + /// List loop devices as `NAME BACK-FILE` rows (for stale-device reclaim). + List, } #[derive(Serialize)] @@ -73,6 +75,7 @@ enum LosetupOp { pub enum LosetupOut { Attached { device: PathBuf }, Detached, + Listed { output: String }, } pub struct Losetup { @@ -101,6 +104,15 @@ impl IsTool for Losetup { let dev: &Path = dev.as_ref(); cmd.arg("-d").arg(dev); } + LosetupOp::List => { + cmd.args([ + "--list", + "--noheadings", + "--raw", + "--output", + "NAME,BACK-FILE", + ]); + } } cmd.env_clear().output() @@ -119,6 +131,9 @@ impl IsTool for Losetup { device: PathBuf::from(String::from_utf8_lossy(&out.stdout).trim()), }, LosetupOp::Detach { .. } => LosetupOut::Detached, + LosetupOp::List => LosetupOut::Listed { + output: String::from_utf8_lossy(&out.stdout).into_owned(), + }, }) } } diff --git a/native/suidhelper/tests/e2e/argv.rs b/native/suidhelper/tests/e2e/argv.rs index 519d664e..dcda8db9 100644 --- a/native/suidhelper/tests/e2e/argv.rs +++ b/native/suidhelper/tests/e2e/argv.rs @@ -174,6 +174,71 @@ fn dmsetup_targets_argv_and_parse_as_root() { assert_eq!(json["output"], "snapshot v1.16.0\n"); } +#[test] +fn dmsetup_ls_argv_and_parse_as_root() { + if !is_root() { + eprintln!("SKIP dmsetup_ls: needs root"); + return; + } + let tmp = tempfile::tempdir().unwrap(); + let rec = tmp.path().join("argv.json"); + let bin = install_fake(tmp.path(), "dmsetup", &rec, "hyper-thinpool\\nhyper-rw-abc"); + let cfg = write_root_config(tmp.path(), &[("dmsetup", &bin)]); + + let out = run(&cfg, &["dmsetup", "ls"]); + assert_eq!( + out.status.code(), + Some(0), + "stderr: {}", + String::from_utf8_lossy(&out.stderr) + ); + assert_eq!(recorded_argv(&rec), vec!["ls"]); + let json: serde_json::Value = serde_json::from_slice(&out.stdout).unwrap(); + assert_eq!(json["result"], "listed"); + assert_eq!(json["output"], "hyper-thinpool\nhyper-rw-abc\n"); +} + +#[test] +fn losetup_list_argv_and_parse_as_root() { + if !is_root() { + eprintln!("SKIP losetup_list: needs root"); + return; + } + let tmp = tempfile::tempdir().unwrap(); + let rec = tmp.path().join("argv.json"); + let bin = install_fake( + tmp.path(), + "losetup", + &rec, + "/dev/loop0 /srv/hyper/scratch/thinpool.meta", + ); + let cfg = write_root_config(tmp.path(), &[("losetup", &bin)]); + + let out = run(&cfg, &["losetup", "list"]); + assert_eq!( + out.status.code(), + Some(0), + "stderr: {}", + String::from_utf8_lossy(&out.stderr) + ); + assert_eq!( + recorded_argv(&rec), + vec![ + "--list", + "--noheadings", + "--raw", + "--output", + "NAME,BACK-FILE" + ] + ); + let json: serde_json::Value = serde_json::from_slice(&out.stdout).unwrap(); + assert_eq!(json["result"], "listed"); + assert_eq!( + json["output"], + "/dev/loop0 /srv/hyper/scratch/thinpool.meta\n" + ); +} + #[test] fn dmsetup_rejects_configured_bin_with_wrong_basename_as_root() { if !is_root() { diff --git a/test/hyper/suid_helper/dmsetup_properties_test.exs b/test/hyper/suid_helper/dmsetup_properties_test.exs index 73858fc6..a7e49d81 100644 --- a/test/hyper/suid_helper/dmsetup_properties_test.exs +++ b/test/hyper/suid_helper/dmsetup_properties_test.exs @@ -73,4 +73,17 @@ defmodule Hyper.SuidHelper.DmsetupPropertiesTest do assert Dmsetup.parse_targets(out) == MapSet.new(targets) end end + + property "parse_names recovers the device name from each `dmsetup ls` row" do + check all(names <- uniq_list_of(dev(), min_length: 1, max_length: 6)) do + # `dmsetup ls` rows are "\t(major:minor)"; order is preserved. + out = Enum.map_join(names, "\n", fn n -> "#{n}\t(254:0)" end) + assert Dmsetup.parse_names(out) == names + end + end + + test "parse_names reads the empty-table sentinel as no devices" do + assert Dmsetup.parse_names("No devices found\n") == [] + assert Dmsetup.parse_names("") == [] + end end diff --git a/test/hyper/suid_helper/losetup_test.exs b/test/hyper/suid_helper/losetup_test.exs new file mode 100644 index 00000000..4cdc6f0f --- /dev/null +++ b/test/hyper/suid_helper/losetup_test.exs @@ -0,0 +1,37 @@ +defmodule Hyper.SuidHelper.LosetupTest do + @moduledoc """ + `parse_list/1` turns `losetup --list --output NAME,BACK-FILE` rows into + `{device, backing_file}` pairs. The reclaim pass relies on it to recognise loop + devices backing Hyper's files, so the edges that matter are: a loop with no + backing file (skipped, nothing to reclaim by file) and a `(deleted)` backing + suffix (kept, so the data-dir prefix still matches). + """ + use ExUnit.Case, async: true + + alias Hyper.SuidHelper.Losetup + + test "pairs device with backing file and skips rows that have no backing file" do + out = """ + /dev/loop0 /srv/hyper/scratch/thinpool.meta + /dev/loop1 /srv/hyper/layers/blob + /dev/loop2 + """ + + assert Losetup.parse_list(out) == [ + {"/dev/loop0", "/srv/hyper/scratch/thinpool.meta"}, + {"/dev/loop1", "/srv/hyper/layers/blob"} + ] + end + + test "keeps a `(deleted)` backing suffix so data-dir prefix matching still works" do + out = "/dev/loop0 /srv/hyper/scratch/thinpool.data (deleted)\n" + + assert Losetup.parse_list(out) == [ + {"/dev/loop0", "/srv/hyper/scratch/thinpool.data (deleted)"} + ] + end + + test "an empty listing yields no pairs" do + assert Losetup.parse_list("") == [] + end +end From b6ce604a0e2231a4e04fd50b665c03287983fa1b Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 23:07:40 +0000 Subject: [PATCH 18/46] feat(suidhelper): add firecracker/jailer/uid_gid_range to config MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extends the suidhelper config with four new fields needed before the jailer exec path can be wired up: - firecracker / jailer: Option — no built-in default; accessors return BinError::Unconfigured when absent so callers get a clear "operator must configure this" signal rather than a path-validation error. - parent_cgroup: String — defaults to "hyper", matching Elixir's @parent_cgroup (operators narrowing must keep both in sync). - uid_gid_range: Option — total accessor returning (900_000, 999_999) when absent; a present range with min==0 or min>max is fatal at safe_load time, consistent with the "present-but-untrusted is fatal" trust model. Adds BinError (Unconfigured | Bin) so firecracker/jailer accessors have a richer error type than the existing tool accessors (which can never be Unconfigured). Adds validate_uid_gid_range as a public pure function so the refusal contract can be property-tested without touching the file system. New test target config_uid_gid_range exercises: absent → default, valid range round-trips via TOML, min==0 always rejected, min>max always rejected. Extends e2e/config with: Unconfigured on absent keys, basename mismatch rejected, Ok on root-owned correct-name binaries (root-guarded), bad uid_gid_range binary exit 2 (root-guarded). --- native/suidhelper/Cargo.toml | 5 + native/suidhelper/src/config.rs | 108 +++++++++++++++++- .../suidhelper/tests/config_uid_gid_range.rs | 59 ++++++++++ native/suidhelper/tests/e2e/config.rs | 94 +++++++++++++++ 4 files changed, 263 insertions(+), 3 deletions(-) create mode 100644 native/suidhelper/tests/config_uid_gid_range.rs diff --git a/native/suidhelper/Cargo.toml b/native/suidhelper/Cargo.toml index f5b4b1b7..550ba2db 100644 --- a/native/suidhelper/Cargo.toml +++ b/native/suidhelper/Cargo.toml @@ -72,6 +72,10 @@ path = "tests/e2e/argv.rs" name = "e2e_chroot_jail" path = "tests/e2e/chroot_jail.rs" +[[test]] +name = "config_uid_gid_range" +path = "tests/config_uid_gid_range.rs" + [dependencies] clap = { version = "4", features = ["derive"] } hyper-suidhelper-meta = { path = "meta" } @@ -84,6 +88,7 @@ toml = { version = "0.8", default-features = false, features = ["parse"] } [dev-dependencies] proptest = "1" tempfile = "3" +toml = { version = "0.8", default-features = false, features = ["parse"] } [profile.release] strip = true diff --git a/native/suidhelper/src/config.rs b/native/suidhelper/src/config.rs index 42079a11..76a7f221 100644 --- a/native/suidhelper/src/config.rs +++ b/native/suidhelper/src/config.rs @@ -1,5 +1,12 @@ // SPDX-License-Identifier: AGPL-3.0-only //! Runtime host configuration, read from a single root-owned TOML file. +//! +//! ## UID/GID range divergence +//! +//! Elixir keeps `compile_env` default `{900_000, 999_999}` that governs which +//! UIDs the node hands *out*; this helper reads `[uid_gid_range]` from +//! config.toml to decide which UIDs it *accepts* (default `{900_000, 999_999}` +//! when the key is absent). Operators narrowing the range must set **both**. use crate::util::safe_bin::{self, SafeBin}; use crate::util::safe_file::{self, IsRegularFile, OnlyRootWritable, RootOwner, SafeFile}; @@ -22,11 +29,48 @@ pub enum LoadingError { Malformed(PathBuf), #[error("work_dir in {0:?} must be an absolute path")] Relative(PathBuf), + #[error("uid_gid_range.min must be >= 1 and <= max (got min={min}, max={max})")] + BadUidGidRange { min: u32, max: u32 }, +} + +/// Error returned by config accessors for tool binaries derived from config. +#[derive(Debug, Error)] +pub enum BinError { + #[error("required binary `{0}` is not configured in /etc/hyper/config.toml")] + Unconfigured(&'static str), + #[error(transparent)] + Bin(#[from] safe_bin::Error), } const CONFIG_PATHSTR: &str = "/etc/hyper/config.toml"; const INSECURE_CONFIG_PATH_ENV: &str = "HYPER_SETUIDHELPER_CONFIG_PATH"; +/// UID/GID allocation band, read from `[uid_gid_range]` in config.toml. +/// Controls which UIDs the helper *accepts* from the BEAM — see module docs. +#[derive(Debug, Clone, Copy, Deserialize)] +pub struct UidGidRange { + pub min: u32, + pub max: u32, +} + +// Band defaults match Elixir's `compile_env` allocation defaults so that an +// unconfigured helper and an unconfigured node agree out of the box. +const DEFAULT_UID_GID: (u32, u32) = (900_000, 999_999); + +/// Validate a uid_gid_range value. A present range where min==0 or min>max is +/// treated as a config trust violation — fatal at load, consistent with the +/// "present but untrusted" model. Exposed so tests can verify the contract +/// without touching the file system. +pub fn validate_uid_gid_range(r: &UidGidRange) -> Result<(), LoadingError> { + if r.min == 0 || r.min > r.max { + return Err(LoadingError::BadUidGidRange { + min: r.min, + max: r.max, + }); + } + Ok(()) +} + /// The config file path. In production this is the fixed `/etc/hyper/config.toml`. /// Only in INSECURE TEST MODE (both gates open) may an env var redirect it — the /// secure arm is always the hardcoded path, so a release build cannot be steered. @@ -51,6 +95,14 @@ pub struct Config { losetup: PathBuf, #[serde(default = "default_blockdev")] blockdev: PathBuf, + #[serde(default)] + firecracker: Option, + #[serde(default)] + jailer: Option, + #[serde(default = "default_parent_cgroup")] + parent_cgroup: String, + #[serde(default)] + uid_gid_range: Option, } // The default data root. Must match the Elixir node's `@dev_work_dir`, which it @@ -72,6 +124,11 @@ fn default_blockdev() -> PathBuf { PathBuf::from("/usr/sbin/blockdev") } +fn default_parent_cgroup() -> String { + // Must match Elixir node's `@parent_cgroup`; operators need to keep them in sync. + "hyper".into() +} + impl Default for Config { fn default() -> Self { Self { @@ -79,6 +136,10 @@ impl Default for Config { dmsetup: default_dmsetup(), losetup: default_losetup(), blockdev: default_blockdev(), + firecracker: None, + jailer: None, + parent_cgroup: default_parent_cgroup(), + uid_gid_range: None, } } } @@ -86,7 +147,7 @@ impl Default for Config { impl Config { /// The process-wide config, loaded once (and forced unprivileged by /// [`Config::init`]). An absent file yields the built-in defaults; a - /// *present but untrusted* file (wrong owner/mode, malformed) is fatal - + /// *present but untrusted* file (wrong owner/mode, malformed) is fatal — /// the helper prints the error and exits rather than trusting it. pub fn get() -> &'static Config { LazyLock::force(&CONFIG) @@ -95,7 +156,7 @@ impl Config { /// Force the config to load now. Call this once at the very start of `main`, /// after privileges have already been dropped (the `.preinit_array` entry in /// `setuid_privileged` runs before `main`), so the file is never first read - /// lazily from inside a `Privileged` scope - i.e. it is guaranteed to be read + /// lazily from inside a `Privileged` scope — i.e. it is guaranteed to be read /// as the real uid, not as root. pub fn init() { let _ = Self::get(); @@ -126,6 +187,42 @@ impl Config { SafeBin::from_path(&self.blockdev) } + /// The Firecracker VMM binary, validated as root-owned and correctly named. + /// Errors [`BinError::Unconfigured`] when absent from config — an operator + /// must set this key before any VM can be launched. + pub fn firecracker(&self) -> Result, BinError> { + self.firecracker + .as_deref() + .ok_or(BinError::Unconfigured("firecracker")) + .and_then(|p| SafeBin::from_path(p).map_err(BinError::Bin)) + } + + /// The Firecracker jailer binary, validated as root-owned and correctly named. + /// Errors [`BinError::Unconfigured`] when absent from config — an operator + /// must set this key before any VM can be launched. + pub fn jailer(&self) -> Result, BinError> { + self.jailer + .as_deref() + .ok_or(BinError::Unconfigured("jailer")) + .and_then(|p| SafeBin::from_path(p).map_err(BinError::Bin)) + } + + /// The jailer `--parent-cgroup` value. Defaults to `"hyper"`, matching the + /// Elixir node's `@parent_cgroup`. + pub fn parent_cgroup(&self) -> &str { + &self.parent_cgroup + } + + /// The UID/GID band the helper accepts from the BEAM. Defaults to + /// `(900_000, 999_999)` when the key is absent (matching Elixir's defaults). + /// A present range with min==0 or min>max is rejected at load time by + /// [`Config::safe_load`], so this accessor is always total. + pub fn uid_gid_range(&self) -> (u32, u32) { + self.uid_gid_range + .map(|r| (r.min, r.max)) + .unwrap_or(DEFAULT_UID_GID) + } + /// Read, ownership-check, parse, and validate the config file. See the module /// docs for the trust model. pub fn safe_load() -> Result { @@ -138,7 +235,7 @@ impl Config { Ok(file) => file, // A genuinely-absent file means "use the built-in defaults": those // are compiled into this root-owned binary, so they are trusted. Any - // OTHER failure - a present but wrong-owner/mode file, an I/O error - + // OTHER failure — a present but wrong-owner/mode file, an I/O error — // stays fatal, because it is a signal (someone put an untrusted file // there), not an absence. Err(safe_file::ValidationError::Open(nix::errno::Errno::ENOENT)) => { @@ -155,6 +252,11 @@ impl Config { if !config.work_dir.is_absolute() { return Err(LoadingError::Relative(path)); } + + if let Some(r) = &config.uid_gid_range { + validate_uid_gid_range(r)?; + } + Ok(config) } } diff --git a/native/suidhelper/tests/config_uid_gid_range.rs b/native/suidhelper/tests/config_uid_gid_range.rs new file mode 100644 index 00000000..1f176e9f --- /dev/null +++ b/native/suidhelper/tests/config_uid_gid_range.rs @@ -0,0 +1,59 @@ +//! Properties of the `uid_gid_range` configuration field. +//! +//! Contracts under test: +//! - A valid range (min >= 1, min <= max) is always accepted. +//! - Absent range yields the built-in default (900_000, 999_999). +//! - A valid range round-trips through TOML deserialization + uid_gid_range(). +//! - min == 0 is always rejected (uid 0 means root; the jailer must never +//! receive it — it skips its privilege drop when uid == 0). +//! - min > max is always rejected (incoherent range; likely a config typo). + +use hyper_suidhelper::config::{validate_uid_gid_range, Config, LoadingError, UidGidRange}; +use proptest::prelude::*; + +#[test] +fn absent_range_yields_default() { + assert_eq!(Config::default().uid_gid_range(), (900_000, 999_999)); +} + +proptest! { + #[test] + fn valid_range_accepted(min in 1u32.., delta in 0u32..) { + // max = min + delta, saturating so it never wraps past u32::MAX. + let max = min.saturating_add(delta); + let r = UidGidRange { min, max }; + prop_assert!(validate_uid_gid_range(&r).is_ok()); + } + + #[test] + fn valid_range_round_trips_via_toml(min in 1u32.., delta in 0u32..) { + let max = min.saturating_add(delta); + let body = format!( + "work_dir = \"/srv/hyper\"\n[uid_gid_range]\nmin = {min}\nmax = {max}\n" + ); + let config: Config = toml::from_str(&body).expect("valid TOML"); + prop_assert_eq!(config.uid_gid_range(), (min, max)); + } + + #[test] + fn zero_min_always_rejected(max in 0u32..) { + let r = UidGidRange { min: 0, max }; + let rejected = matches!( + validate_uid_gid_range(&r), + Err(LoadingError::BadUidGidRange { min: 0, .. }) + ); + prop_assert!(rejected); + } + + #[test] + fn min_exceeds_max_always_rejected(max in 0u32..u32::MAX) { + // min = max + 1 is always strictly greater than max and always >= 1. + let min = max + 1; + let r = UidGidRange { min, max }; + let rejected = matches!( + validate_uid_gid_range(&r), + Err(LoadingError::BadUidGidRange { .. }) + ); + prop_assert!(rejected); + } +} diff --git a/native/suidhelper/tests/e2e/config.rs b/native/suidhelper/tests/e2e/config.rs index 100a686a..ef57c97b 100644 --- a/native/suidhelper/tests/e2e/config.rs +++ b/native/suidhelper/tests/e2e/config.rs @@ -3,6 +3,8 @@ //! tempfile instead of /etc/hyper, so tests never touch the real host. #![cfg(feature = "insecure_test_seams")] +use hyper_suidhelper::config::{BinError, Config}; +use hyper_suidhelper::util::safe_bin; use std::fs; use std::os::unix::fs::PermissionsExt; use std::path::Path; @@ -126,3 +128,95 @@ fn valid_config_and_setuid_yields_sys_test_ok_as_root() { assert_eq!(json["sys_test"], "ok"); assert_eq!(json["hyper_base"], "/srv/hyper"); } + +#[test] +fn firecracker_unconfigured_when_absent() { + // Config::default() has firecracker == None; the accessor must signal this + // distinctly so callers can report a missing-configuration error rather than + // a safe_bin validation error. + let err = Config::default() + .firecracker() + .expect_err("absent firecracker must return Unconfigured"); + assert!( + matches!(err, BinError::Unconfigured("firecracker")), + "expected Unconfigured(\"firecracker\"), got {err:?}", + ); +} + +#[test] +fn jailer_unconfigured_when_absent() { + let err = Config::default() + .jailer() + .expect_err("absent jailer must return Unconfigured"); + assert!( + matches!(err, BinError::Unconfigured("jailer")), + "expected Unconfigured(\"jailer\"), got {err:?}", + ); +} + +#[test] +fn jailer_basename_mismatch_rejected() { + // The basename check in SafeBin::from_path precedes the stat, so we do not + // need a real file — any absolute path with the wrong leaf name is enough. + let body = "work_dir = \"/srv/hyper\"\njailer = \"/usr/local/bin/not-jailer\"\n"; + let config: Config = toml::from_str(body).unwrap(); + let err = config + .jailer() + .expect_err("wrong-basename jailer path must be rejected"); + assert!( + matches!(err, BinError::Bin(safe_bin::Error::Name { .. })), + "expected a Name error, got {err:?}", + ); +} + +#[test] +fn firecracker_and_jailer_return_ok_when_root_owned_as_root() { + if !is_root() { + eprintln!("SKIP firecracker_jailer_configured: needs root to create root-owned binaries"); + return; + } + let tmp = tempfile::tempdir().unwrap(); + let fc = tmp.path().join("firecracker"); + let jr = tmp.path().join("jailer"); + // 0o755: root-owned, not group/other-writable — satisfies SafeBin's checks. + for p in [&fc, &jr] { + fs::write(p, b"#!/bin/sh\n").unwrap(); + fs::set_permissions(p, fs::Permissions::from_mode(0o755)).unwrap(); + } + let body = format!( + "work_dir = \"/srv/hyper\"\nfirecracker = \"{}\"\njailer = \"{}\"\n", + fc.display(), + jr.display(), + ); + let config: Config = toml::from_str(&body).unwrap(); + assert!( + config.firecracker().is_ok(), + "root-owned firecracker with correct basename must be accepted" + ); + assert!( + config.jailer().is_ok(), + "root-owned jailer with correct basename must be accepted" + ); +} + +#[test] +fn bad_uid_gid_range_exits_2_as_root() { + if !is_root() { + eprintln!("SKIP bad_uid_gid_range: needs root to own the config file"); + return; + } + let tmp = tempfile::tempdir().unwrap(); + // min = 0 is the clearest violation: uid 0 is root, which the jailer must + // never receive because it skips its privilege drop when uid == 0. + let p = write_root_config( + tmp.path(), + "work_dir = \"/srv/hyper\"\n[uid_gid_range]\nmin = 0\nmax = 100\n", + ); + let out = run_with_config(&p, &["sys-test"]); + assert_eq!(out.status.code(), Some(2)); + let err = String::from_utf8_lossy(&out.stderr); + assert!( + err.contains("uid_gid_range"), + "expected a uid_gid_range error in stderr, got: {err}", + ); +} From 8f671a34906f32a1990d0ea1bb46c794c541415b Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 23:22:41 +0000 Subject: [PATCH 19/46] feat(suidhelper): add jailer subcommand that execs the jailer as root --- native/suidhelper/Cargo.toml | 10 +- native/suidhelper/src/main.rs | 13 +- native/suidhelper/src/tools/jailer.rs | 357 ++++++++++++++++++ native/suidhelper/src/tools/mod.rs | 1 + .../suidhelper/src/util/setuid_privileged.rs | 39 +- native/suidhelper/tests/e2e/jailer.rs | 193 ++++++++++ native/suidhelper/tests/tools/jailer.rs | 162 ++++++++ 7 files changed, 770 insertions(+), 5 deletions(-) create mode 100644 native/suidhelper/src/tools/jailer.rs create mode 100644 native/suidhelper/tests/e2e/jailer.rs create mode 100644 native/suidhelper/tests/tools/jailer.rs diff --git a/native/suidhelper/Cargo.toml b/native/suidhelper/Cargo.toml index 550ba2db..9b2b6cbe 100644 --- a/native/suidhelper/Cargo.toml +++ b/native/suidhelper/Cargo.toml @@ -48,6 +48,10 @@ path = "tests/util/safe_bin.rs" name = "tools_dmsetup_parsers" path = "tests/tools/dmsetup_parsers.rs" +[[test]] +name = "tools_jailer" +path = "tests/tools/jailer.rs" + [[test]] name = "util_confinement" path = "tests/util/confinement.rs" @@ -68,6 +72,10 @@ path = "tests/e2e/config.rs" name = "e2e_argv" path = "tests/e2e/argv.rs" +[[test]] +name = "e2e_jailer" +path = "tests/e2e/jailer.rs" + [[test]] name = "e2e_chroot_jail" path = "tests/e2e/chroot_jail.rs" @@ -79,7 +87,7 @@ path = "tests/config_uid_gid_range.rs" [dependencies] clap = { version = "4", features = ["derive"] } hyper-suidhelper-meta = { path = "meta" } -nix = { version = "0.29", features = ["user", "fs", "dir"] } +nix = { version = "0.29", features = ["user", "fs", "dir", "process"] } serde = { version = "1", features = ["derive"] } serde_json = "1" thiserror = "1" diff --git a/native/suidhelper/src/main.rs b/native/suidhelper/src/main.rs index 617fd6d5..665370bc 100644 --- a/native/suidhelper/src/main.rs +++ b/native/suidhelper/src/main.rs @@ -15,6 +15,7 @@ use clap::{Parser, Subcommand}; use hyper_suidhelper::config; +use hyper_suidhelper::tools::jailer::{self, JailerArgs}; use hyper_suidhelper::tools::Tool; use hyper_suidhelper::util::setuid_privileged::{self, Privileged}; use serde::Serialize; @@ -37,6 +38,9 @@ enum Command { Tool(Tool), /// Check the helper is correctly installed (can promote to root). SysTest, + /// Validate the caller's args, become root, and `execve` the firecracker + /// jailer in our place. Prints nothing on success (the image is replaced). + Jailer(JailerArgs), /// Print the build version and BLAKE3 checksum of this binary. Version, } @@ -101,12 +105,19 @@ fn main() { config::Config::init(); // Each command yields a serializable value (errors stringified to unify); we - // render the final JSON line here. + // render the final JSON line here. The jailer is the exception: it `execve`s + // in place, so on success it never returns and never emits JSON, and on + // failure it reports to stderr and exits before reaching the JSON pipeline. let output = match command { Command::Tool(tool) => tool.run().map(Output::Tool).map_err(|e| e.to_string()), Command::SysTest => SysTest::perform() .map(Output::SysTest) .map_err(|e| e.to_string()), + Command::Jailer(args) => { + let e = jailer::run(args).expect_err("jailer::run only returns on error"); + eprintln!("{e}"); + std::process::exit(2); + } Command::Version => unreachable!("handled above"), }; diff --git a/native/suidhelper/src/tools/jailer.rs b/native/suidhelper/src/tools/jailer.rs new file mode 100644 index 00000000..d816de11 --- /dev/null +++ b/native/suidhelper/src/tools/jailer.rs @@ -0,0 +1,357 @@ +// SPDX-License-Identifier: AGPL-3.0-only +//! The `jailer` subcommand: validate the BEAM's arguments, re-acquire root +//! permanently, and `execve` the firecracker jailer in our place. +//! +//! Unlike the device tools this is **not** an [`crate::tools::IsTool`]: it does +//! not run a child and parse JSON, it *becomes* the jailer via `execve`, so the +//! unprivileged BEAM's MuonTrap port keeps supervising the resulting process +//! across the image replacement. There is no output and no return on success. +//! +//! Threat model: the BEAM is untrusted. It supplies only `--id`, `--uid`, +//! `--gid`, repeated `--cgroup KEY=VALUE`, and `--api-sock`. Every privileged +//! path (the jailer binary, the firecracker `--exec-file`, the chroot base, the +//! parent cgroup) comes from the root-owned config, never the caller. The +//! refusal contracts on the newtypes below are the security core: a compromised +//! BEAM must not be able to name a privileged path, request uid 0, traverse out +//! of the chroot base, inject a flag, or smuggle an environment/fd into root. +//! +//! ## Validator laws (property-tested in `tests/tools/jailer.rs`) +//! - [`validate_id_number`] accepts iff `n != 0 && lo <= n <= hi`; 0 is rejected +//! for *every* range (uid 0 makes the jailer skip its privilege drop). +//! - [`VmId`] round-trips exactly the allowed charset/length and rejects any +//! separator, dot, NUL, whitespace, leading dash, empty, or over-long input. +//! - [`CgroupSetting`] re-renders a valid pair to its canonical `key=value` and +//! rejects unknown keys and values outside the per-key grammar. +//! - [`JailSock`] accepts exactly `/` + one filename and rejects multi-component, +//! relative, `..`, and NUL/whitespace inputs. + +use crate::config::{BinError, Config}; +use crate::util::setuid_privileged; +use clap::Args; +use nix::errno::Errno; +use std::ffi::CString; +use std::fmt; +use std::os::unix::ffi::OsStrExt; +use std::path::{Path, PathBuf}; +use std::str::FromStr; +use thiserror::Error as ThisError; + +/// The jailer protects at most a handful of controllers; an unbounded `--cgroup` +/// list is a caller trying something. Cap it well above any legitimate need. +const MAX_CGROUPS: usize = 16; + +/// `sun_path` in `sockaddr_un` is 108 bytes on Linux; a socket path longer than +/// that can never be bound, so reject it up front. +const MAX_SOCK_LEN: usize = 108; + +#[derive(Debug, ThisError)] +pub enum Error { + #[error("invalid --id {0:?}: must be 1..=64 chars of [A-Za-z0-9_-], not starting with '-'")] + VmId(String), + #[error("invalid --cgroup {0:?}: unknown key or value outside its grammar")] + Cgroup(String), + #[error("invalid --api-sock {0:?}: must be / with name in [A-Za-z0-9_.-]")] + Sock(String), + #[error("--uid/--gid {value} is zero or outside the configured range [{lo}, {hi}]")] + IdNumber { value: u32, lo: u32, hi: u32 }, + #[error("too many --cgroup settings: {0} (max {MAX_CGROUPS})")] + TooManyCgroups(usize), + #[error(transparent)] + Bin(#[from] BinError), + #[error(transparent)] + Privilege(#[from] setuid_privileged::Error), + #[error("argument contains an interior NUL byte")] + NulArgument, + #[error("execve {path:?} failed: {errno}")] + Exec { path: PathBuf, errno: Errno }, +} + +/// `n != 0 && lo <= n <= hi`. uid/gid 0 is rejected unconditionally: a jailer run +/// with uid 0 skips its privilege drop and leaves firecracker running as root. +pub fn validate_id_number(n: u32, range: (u32, u32)) -> Result { + let (lo, hi) = range; + if n != 0 && lo <= n && n <= hi { + Ok(n) + } else { + Err(Error::IdNumber { value: n, lo, hi }) + } +} + +/// A VM id used as a chroot subdirectory name: `[A-Za-z0-9_-]`, length `1..=64`, +/// first character not `-` (so it can never be read as a flag). No `/`, `.`, NUL, +/// or whitespace can appear, so it can never traverse out of the chroot base. +#[derive(Debug, Clone)] +pub struct VmId(String); + +impl FromStr for VmId { + type Err = Error; + + fn from_str(s: &str) -> Result { + let reject = || Error::VmId(s.to_string()); + let bytes = s.as_bytes(); + if !(1..=64).contains(&bytes.len()) { + return Err(reject()); + } + if bytes[0] == b'-' { + return Err(reject()); + } + if !bytes + .iter() + .all(|&b| b.is_ascii_alphanumeric() || b == b'_' || b == b'-') + { + return Err(reject()); + } + Ok(Self(s.to_string())) + } +} + +impl fmt::Display for VmId { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + f.write_str(&self.0) + } +} + +/// `1..=20` ASCII digits — bounds a numeric cgroup limit without pulling in a +/// regex engine. The upper bound comfortably exceeds `u64::MAX`'s 20 digits. +fn is_digits_1_20(s: &str) -> bool { + !s.is_empty() && s.len() <= 20 && s.bytes().all(|b| b.is_ascii_digit()) +} + +/// A single `KEY=VALUE` cgroup setting from an allowlist. The helper re-emits +/// `key=value` itself from the canonical key, so the jailer never sees the +/// caller's raw bytes. Per-key value grammar: +/// - `memory.max` : `[0-9]{1,20}` or the literal `max` +/// - `cpu.max` : `[0-9]{1,20} [0-9]{1,20}` or `max [0-9]{1,20}` +#[derive(Debug, Clone)] +pub struct CgroupSetting { + key: &'static str, + value: String, +} + +impl FromStr for CgroupSetting { + type Err = Error; + + fn from_str(s: &str) -> Result { + let reject = || Error::Cgroup(s.to_string()); + // Split on the FIRST `=`. None of the value grammars contains a `=`, so a + // second `=` lands in `value` and is rejected by the grammar check below. + let (raw_key, value) = s.split_once('=').ok_or_else(reject)?; + + let key: &'static str = match raw_key { + "memory.max" => "memory.max", + "cpu.max" => "cpu.max", + _ => return Err(reject()), + }; + + let valid = match key { + "memory.max" => value == "max" || is_digits_1_20(value), + "cpu.max" => match value.split_once(' ') { + Some((quota, period)) => { + (quota == "max" || is_digits_1_20(quota)) && is_digits_1_20(period) + } + None => false, + }, + _ => false, + }; + + if valid { + Ok(Self { + key, + value: value.to_string(), + }) + } else { + Err(reject()) + } + } +} + +impl fmt::Display for CgroupSetting { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + write!(f, "{}={}", self.key, self.value) + } +} + +/// The firecracker API socket path: an absolute path that is exactly `/` plus one +/// filename in `[A-Za-z0-9_.-]`. The charset excludes `/`, so the value is always +/// a direct child of `/` with no extra components and no traversal; `.`/`..` as +/// the whole filename are rejected explicitly. +#[derive(Debug, Clone)] +pub struct JailSock(String); + +impl FromStr for JailSock { + type Err = Error; + + fn from_str(s: &str) -> Result { + let reject = || Error::Sock(s.to_string()); + if s.len() > MAX_SOCK_LEN { + return Err(reject()); + } + let name = s.strip_prefix('/').ok_or_else(reject)?; + if name.is_empty() || name == "." || name == ".." { + return Err(reject()); + } + if !name + .bytes() + .all(|b| b.is_ascii_alphanumeric() || b == b'_' || b == b'.' || b == b'-') + { + return Err(reject()); + } + Ok(Self(s.to_string())) + } +} + +impl fmt::Display for JailSock { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + f.write_str(&self.0) + } +} + +#[derive(Args)] +pub struct JailerArgs { + /// Microvm id; becomes the chroot subdirectory name. + #[arg(long)] + id: VmId, + /// Unprivileged uid the jailer drops to (rejected if 0 or out of range). + #[arg(long)] + uid: u32, + /// Unprivileged gid the jailer drops to (rejected if 0 or out of range). + #[arg(long)] + gid: u32, + /// Repeatable `KEY=VALUE` cgroup setting from the allowlist. + #[arg(long = "cgroup")] + cgroup: Vec, + /// Absolute firecracker API socket path (single filename under `/`). + #[arg(long = "api-sock")] + api_sock: JailSock, +} + +fn cstr_bytes(bytes: &[u8]) -> Result { + CString::new(bytes).map_err(|_| Error::NulArgument) +} + +fn cstr_str(s: &str) -> Result { + cstr_bytes(s.as_bytes()) +} + +fn cstr_path(p: &Path) -> Result { + cstr_bytes(p.as_os_str().as_bytes()) +} + +/// Build the exact argv handed to the jailer. argv[0] is the jailer path. The +/// caller never names the jailer, the `--exec-file`, the chroot base, the cgroup +/// version, or the parent cgroup — those are derived from trusted config here. +#[allow(clippy::too_many_arguments)] +fn build_argv( + jailer: &Path, + id: &VmId, + firecracker: &Path, + uid: u32, + gid: u32, + jail_base: &Path, + parent_cgroup: &str, + cgroups: &[CgroupSetting], + api_sock: &JailSock, +) -> Result, Error> { + let mut argv = vec![ + cstr_path(jailer)?, + cstr_str("--id")?, + cstr_str(&id.to_string())?, + cstr_str("--exec-file")?, + cstr_path(firecracker)?, + cstr_str("--uid")?, + cstr_str(&uid.to_string())?, + cstr_str("--gid")?, + cstr_str(&gid.to_string())?, + cstr_str("--chroot-base-dir")?, + cstr_path(jail_base)?, + cstr_str("--cgroup-version")?, + cstr_str("2")?, + cstr_str("--parent-cgroup")?, + cstr_str(parent_cgroup)?, + ]; + + for cg in cgroups { + argv.push(cstr_str("--cgroup")?); + argv.push(cstr_str(&cg.to_string())?); + } + + argv.push(cstr_str("--")?); + argv.push(cstr_str("--api-sock")?); + argv.push(cstr_str(&api_sock.to_string())?); + + Ok(argv) +} + +/// Close every inherited fd above stdio so a compromised BEAM cannot smuggle an +/// open fd into the root jailer. Keep 0/1/2: MuonTrap supervises the jailer +/// through stdio, and stderr carries our exec-failure message. `close_range(2)` +/// needs Linux 5.9+; on ENOSYS we fall back to closing each fd up to the limit. +fn close_inherited_fds() { + const FIRST: u32 = 3; + // SAFETY: raw syscall with no memory operands; closing fds has no UB. + let rc = unsafe { nix::libc::close_range(FIRST, u32::MAX, 0) }; + if rc == 0 || Errno::last() != Errno::ENOSYS { + return; + } + + // SAFETY: sysconf is a pure query of a system limit. + let max = unsafe { nix::libc::sysconf(nix::libc::_SC_OPEN_MAX) }; + let max = if max < 0 { 4096 } else { max as i32 }; + for fd in (FIRST as i32)..max { + // SAFETY: closing an arbitrary fd is safe; EBADF for unused fds is ignored. + unsafe { + nix::libc::close(fd); + } + } +} + +/// Validate the caller's args, then permanently become root and `execve` the +/// jailer. On success this never returns (the process image is replaced); the +/// `Ok` arm is therefore [`std::convert::Infallible`]. Every failure is returned +/// as [`Error`] for the caller to print and exit non-zero. +pub fn run(args: JailerArgs) -> Result { + let config = Config::get(); + + // Resolve everything that can fail as the REAL uid, before any privilege is + // raised: config accessors, binary validation, range, and arg validation. + let jailer: PathBuf = config.jailer()?.into(); + let firecracker: PathBuf = config.firecracker()?.into(); + let jail_base = config.jail_base(); + let parent_cgroup = config.parent_cgroup(); + let range = config.uid_gid_range(); + + let uid = validate_id_number(args.uid, range)?; + let gid = validate_id_number(args.gid, range)?; + + if args.cgroup.len() > MAX_CGROUPS { + return Err(Error::TooManyCgroups(args.cgroup.len())); + } + + let argv = build_argv( + &jailer, + &args.id, + &firecracker, + uid, + gid, + &jail_base, + parent_cgroup, + &args.cgroup, + &args.api_sock, + )?; + let jailer_cstr = cstr_path(&jailer)?; + + // Point of no return: from here we are permanently root and must execve. + setuid_privileged::become_root_permanently()?; + close_inherited_fds(); + + // Empty envp: once ruid==0 the kernel clears AT_SECURE, so a smuggled + // LD_PRELOAD/LD_LIBRARY_PATH would be honored by the dynamic loader and + // hijack the root jailer. We pass nothing and let the jailer build its own. + let empty_env: [CString; 0] = []; + let errno = nix::unistd::execve(&jailer_cstr, &argv, &empty_env) + .expect_err("execve only returns on failure"); + Err(Error::Exec { + path: jailer, + errno, + }) +} diff --git a/native/suidhelper/src/tools/mod.rs b/native/suidhelper/src/tools/mod.rs index 24a2be42..9f9fedae 100644 --- a/native/suidhelper/src/tools/mod.rs +++ b/native/suidhelper/src/tools/mod.rs @@ -7,6 +7,7 @@ mod blockdev; pub mod chroot_jail; mod dmsetup; +pub mod jailer; mod losetup; pub use blockdev::{Blockdev, BlockdevArgs}; diff --git a/native/suidhelper/src/util/setuid_privileged.rs b/native/suidhelper/src/util/setuid_privileged.rs index 3c58bd39..e334b115 100644 --- a/native/suidhelper/src/util/setuid_privileged.rs +++ b/native/suidhelper/src/util/setuid_privileged.rs @@ -12,17 +12,20 @@ //! constructor nor a Drop impl can report an error, and silently staying root //! would defeat the point. -use nix::unistd::{getuid, seteuid, Uid}; +use nix::unistd::{getuid, seteuid, setgroups, setresgid, setresuid, Gid, Uid}; use thiserror::Error as ThisError; -/// Failures of the privilege transitions. Both are fatal: if we can't raise we -/// aren't installed setuid root, and if we can't lower we refuse to keep running. +/// Failures of the privilege transitions. All fatal: if we can't raise we aren't +/// installed setuid root, if we can't lower we refuse to keep running, and if we +/// can't seal permanent root we refuse to hand off to the execve target. #[derive(Debug, ThisError)] pub enum Error { #[error("not installed setuid root")] NotSetuidRoot, #[error("failed to drop privileges")] DropPrivileges, + #[error("failed to acquire permanent root for execve handoff")] + PermanentRoot, } /// `.preinit_array` runs before `.init_array` and before any shared-library @@ -76,6 +79,36 @@ impl Privileged { } } +/// Re-acquire full root **permanently** for an `execve` handoff and return — +/// there is deliberately no [`Privileged`] Drop guard, because `execve` replaces +/// the entire process image: nothing of this process survives to run a destructor, +/// and the new image (the firecracker jailer) is responsible for dropping its own +/// privileges. We must hand it a *genuine* root process (all of real, effective +/// and saved uid == 0) so the jailer's own privilege-drop is the real thing. +/// +/// Order matters and each step needs the privilege the previous one preserves: +/// 1. `seteuid(0)` — regain effective root; without it the rest are EPERM. +/// 2. `setresgid(0,0,0)` — set every gid to root *before* touching uids, while +/// we still hold euid 0 (gid changes require privilege). +/// 3. `setgroups([0])` — `drop_to_real` only lowered euid; it never touched the +/// caller's supplementary groups, so the jailer would otherwise inherit them. +/// Reset to just {0} now, while still privileged. +/// 4. `setresuid(0,0,0)` LAST — this seals real+effective+saved uid to root. It +/// goes last because once the saved uid is 0 there is no escape hatch left +/// (which is the point: permanent), and because it must follow the gid/group +/// changes that needed our euid-0 to be permitted. +pub fn become_root_permanently() -> Result<(), Error> { + let root_uid = Uid::from_raw(0); + let root_gid = Gid::from_raw(0); + + seteuid(root_uid).map_err(|_| Error::NotSetuidRoot)?; + setresgid(root_gid, root_gid, root_gid).map_err(|_| Error::PermanentRoot)?; + setgroups(&[root_gid]).map_err(|_| Error::PermanentRoot)?; + setresuid(root_uid, root_uid, root_uid).map_err(|_| Error::PermanentRoot)?; + + Ok(()) +} + impl Drop for Privileged { fn drop(&mut self) { if lower().is_err() { diff --git a/native/suidhelper/tests/e2e/jailer.rs b/native/suidhelper/tests/e2e/jailer.rs new file mode 100644 index 00000000..87084aef --- /dev/null +++ b/native/suidhelper/tests/e2e/jailer.rs @@ -0,0 +1,193 @@ +//! L4 for the jailer handoff: prove the exact argv (and an EMPTY environment) the +//! helper hands to the jailer via `execve`, and that the jailer's exit status +//! propagates. We point the config's `jailer` at a root-owned recorder that +//! writes its argv and its `/proc/self/environ` to files, then exits with a known +//! code. Root-only: `become_root_permanently` requires we are already root. +#![cfg(feature = "insecure_test_seams")] + +use std::fs; +use std::os::unix::fs::PermissionsExt; +use std::path::{Path, PathBuf}; +use std::process::Command; + +const HELPER: &str = env!("CARGO_BIN_EXE_hyper-suidhelper"); +const RECORDER_EXIT: i32 = 7; + +fn is_root() -> bool { + nix::unistd::geteuid().is_root() +} + +fn cat_bin() -> &'static str { + ["/bin/cat", "/usr/bin/cat"] + .into_iter() + .find(|p| Path::new(p).exists()) + .expect("a `cat` binary for the recorder") +} + +/// Install a root-owned recorder named `jailer` that writes its argv (minus +/// argv[0]) as a JSON array to `argv_rec` and copies its `/proc/self/environ` to +/// `env_rec`, then exits `RECORDER_EXIT`. Paths are baked into the script text +/// because the recorder runs with an empty environment and absolute `cat` so it +/// needs no `PATH`. Returns the recorder's absolute path. +fn install_recorder(dir: &Path, argv_rec: &Path, env_rec: &Path) -> PathBuf { + let path = dir.join("jailer"); + let script = format!( + "#!/bin/sh\n\ + (\n printf '['\n sep=''\n for a in \"$@\"; do printf '%s\"%s\"' \"$sep\" \"$a\"; sep=','; done\n printf ']'\n) > '{argv}'\n\ + {cat} /proc/self/environ > '{env}'\n\ + exit {code}\n", + argv = argv_rec.display(), + cat = cat_bin(), + env = env_rec.display(), + code = RECORDER_EXIT, + ); + fs::write(&path, script).unwrap(); + fs::set_permissions(&path, fs::Permissions::from_mode(0o755)).unwrap(); + path +} + +/// A root-owned plain file with basename `firecracker` — the `--exec-file`. It is +/// never executed by us (the jailer would), only validated as a `SafeBin`. +fn install_firecracker(dir: &Path) -> PathBuf { + let path = dir.join("firecracker"); + fs::write(&path, b"#!/bin/true\n").unwrap(); + fs::set_permissions(&path, fs::Permissions::from_mode(0o644)).unwrap(); + path +} + +fn write_root_config(dir: &Path, jailer: &Path, firecracker: &Path) -> PathBuf { + let p = dir.join("config.toml"); + let body = format!( + "work_dir = \"/srv/hyper\"\njailer = \"{}\"\nfirecracker = \"{}\"\n", + jailer.display(), + firecracker.display(), + ); + fs::write(&p, body).unwrap(); + fs::set_permissions(&p, fs::Permissions::from_mode(0o644)).unwrap(); + p +} + +fn run(config: &Path, args: &[&str]) -> std::process::Output { + Command::new(HELPER) + .args(args) + .env_clear() + .env("HYPER_SETUIDHELPER_IS_INSECURE_MODE", "1") + .env("HYPER_SETUIDHELPER_CONFIG_PATH", config) + .output() + .expect("spawn helper") +} + +#[test] +fn execs_jailer_with_canonical_argv_and_empty_env_as_root() { + if !is_root() { + eprintln!("SKIP jailer exec: needs root to become_root_permanently + own the fakes"); + return; + } + let tmp = tempfile::tempdir().unwrap(); + let argv_rec = tmp.path().join("argv.json"); + let env_rec = tmp.path().join("environ.bin"); + let jailer = install_recorder(tmp.path(), &argv_rec, &env_rec); + let firecracker = install_firecracker(tmp.path()); + let cfg = write_root_config(tmp.path(), &jailer, &firecracker); + + let out = run( + &cfg, + &[ + "jailer", + "--id", + "vm1", + "--uid", + "900001", + "--gid", + "900002", + "--cgroup", + "memory.max=1048576", + "--cgroup", + "cpu.max=100000 100000", + "--api-sock", + "/api.sock", + ], + ); + + // The jailer's own exit status must propagate through the execve handoff. + assert_eq!( + out.status.code(), + Some(RECORDER_EXIT), + "exit status did not propagate; stderr: {}", + String::from_utf8_lossy(&out.stderr), + ); + + let argv: Vec = + serde_json::from_str(&fs::read_to_string(&argv_rec).expect("recorded argv")).unwrap(); + assert_eq!( + argv, + vec![ + "--id", + "vm1", + "--exec-file", + &firecracker.to_string_lossy(), + "--uid", + "900001", + "--gid", + "900002", + "--chroot-base-dir", + "/srv/hyper/jails", + "--cgroup-version", + "2", + "--parent-cgroup", + "hyper", + "--cgroup", + "memory.max=1048576", + "--cgroup", + "cpu.max=100000 100000", + "--", + "--api-sock", + "/api.sock", + ], + "helper handed the jailer a non-canonical argv", + ); + + // The execve envp must be empty: once ruid==0 a smuggled LD_PRELOAD would be + // honored, so nothing of the caller's environment may reach the root jailer. + let environ = fs::read(&env_rec).expect("recorded environ"); + assert!( + environ.is_empty(), + "jailer inherited a non-empty environment: {environ:?}", + ); +} + +#[test] +fn refuses_uid_zero_without_exec_as_root() { + if !is_root() { + eprintln!("SKIP jailer uid 0: needs root"); + return; + } + let tmp = tempfile::tempdir().unwrap(); + let argv_rec = tmp.path().join("argv.json"); + let env_rec = tmp.path().join("environ.bin"); + let jailer = install_recorder(tmp.path(), &argv_rec, &env_rec); + let firecracker = install_firecracker(tmp.path()); + let cfg = write_root_config(tmp.path(), &jailer, &firecracker); + + let out = run( + &cfg, + &[ + "jailer", + "--id", + "vm1", + "--uid", + "0", + "--gid", + "900002", + "--api-sock", + "/api.sock", + ], + ); + + assert_ne!(out.status.code(), Some(0), "uid 0 must be refused"); + assert_eq!(out.status.code(), Some(2), "validation failure exits 2"); + assert!( + !argv_rec.exists(), + "the jailer must never have been exec'd for uid 0", + ); +} diff --git a/native/suidhelper/tests/tools/jailer.rs b/native/suidhelper/tests/tools/jailer.rs new file mode 100644 index 00000000..9ab75447 --- /dev/null +++ b/native/suidhelper/tests/tools/jailer.rs @@ -0,0 +1,162 @@ +//! Refusal contracts for the jailer's pure validators — the security core. A +//! valid input must round-trip to its canonical form; an invalid one must +//! *always* be rejected, never silently accepted. These properties are what stop +//! a compromised BEAM from naming uid 0, a privileged path, a traversal, or a +//! flag through the jailer subcommand. + +use hyper_suidhelper::tools::jailer::{validate_id_number, CgroupSetting, JailSock, VmId}; +use proptest::prelude::*; +use std::str::FromStr; + +proptest! { + /// uid/gid 0 is rejected for EVERY range — a jailer run with uid 0 skips its + /// privilege drop and leaves firecracker running as root. + #[test] + fn id_zero_always_rejected(lo in any::(), span in any::()) { + let hi = lo.saturating_add(span); + prop_assert!(validate_id_number(0, (lo, hi)).is_err()); + } + + /// Any nonzero value inside the (nonempty) range is accepted unchanged. + #[test] + fn id_in_range_nonzero_accepted( + lo in 1u32..=1_000_000, + span in 0u32..=1_000_000, + off in 0u32..=1_000_000, + ) { + let hi = lo + span; + let n = lo + (off % (span + 1)); // off bounded into [0, span] => n in [lo, hi] + prop_assert_eq!(validate_id_number(n, (lo, hi)).ok(), Some(n)); + } + + /// Values just below `lo` (still nonzero) and just above `hi` are rejected. + #[test] + fn id_out_of_range_rejected(lo in 2u32..=1_000_000, span in 0u32..=1_000_000) { + let hi = lo + span; + prop_assert!(validate_id_number(lo - 1, (lo, hi)).is_err()); + if hi < u32::MAX { + prop_assert!(validate_id_number(hi + 1, (lo, hi)).is_err()); + } + } + + /// Every string over the allowed charset/length with a non-dash leader parses + /// and renders back to itself. + #[test] + fn vmid_valid_round_trips(s in "[A-Za-z0-9_][A-Za-z0-9_-]{0,63}") { + prop_assert_eq!(VmId::from_str(&s).unwrap().to_string(), s); + } + + /// A leading dash is always rejected (no flag injection via `--id`). + #[test] + fn vmid_leading_dash_rejected(s in "-[A-Za-z0-9_-]{0,63}") { + prop_assert!(VmId::from_str(&s).is_err()); + } + + /// Any embedded path separator is rejected (no chroot traversal via `--id`). + #[test] + fn vmid_slash_rejected(s in "[A-Za-z0-9_]{0,10}/[A-Za-z0-9_]{0,10}") { + prop_assert!(VmId::from_str(&s).is_err()); + } + + /// Over-length ids (>64) are rejected. + #[test] + fn vmid_too_long_rejected(s in "[A-Za-z][A-Za-z0-9_-]{64,90}") { + prop_assert!(VmId::from_str(&s).is_err()); + } + + /// A valid `memory.max` setting re-renders to exactly `key=value`. + #[test] + fn cgroup_memory_round_trips(s in "memory[.]max=([0-9]{1,20}|max)") { + prop_assert_eq!(CgroupSetting::from_str(&s).unwrap().to_string(), s); + } + + /// A valid `cpu.max` setting re-renders to exactly `key=value`. + #[test] + fn cgroup_cpu_round_trips(s in "cpu[.]max=([0-9]{1,20} [0-9]{1,20}|max [0-9]{1,20})") { + prop_assert_eq!(CgroupSetting::from_str(&s).unwrap().to_string(), s); + } + + /// A single-filename absolute socket path round-trips; `.`/`..` filenames are + /// rejected even though they are within the charset. + #[test] + fn jailsock_single_filename(name in "[A-Za-z0-9_.-]{1,40}") { + let s = format!("/{name}"); + let res = JailSock::from_str(&s); + if name == "." || name == ".." { + prop_assert!(res.is_err()); + } else { + prop_assert_eq!(res.unwrap().to_string(), s); + } + } + + /// A second path component is always rejected (the socket must be a direct + /// child of `/`). + #[test] + fn jailsock_multi_component_rejected( + a in "[A-Za-z0-9_]{1,10}", + b in "[A-Za-z0-9_]{1,10}", + ) { + let s = format!("/{a}/{b}"); + prop_assert!(JailSock::from_str(&s).is_err()); + } +} + +#[test] +fn vmid_rejects_known_bad_shapes() { + for bad in [ + "", // empty + "-leading", // leading dash + "a/b", // separator + "a.b", // dot + "a b", // whitespace + "a\tb", // tab + "a\0b", // NUL + "naïve", // non-ascii + &"x".repeat(65), // too long + ] { + assert!(VmId::from_str(bad).is_err(), "VmId accepted {bad:?}"); + } +} + +#[test] +fn cgroup_rejects_known_bad_shapes() { + for bad in [ + "linear=10", // unknown key + "memory.high=10", // unknown key + "memory.max=", // empty value + "memory.max=12x", // non-digit + "memory.max=1=2", // second '=' + "memory.max", // no '=' + "cpu.max=100000", // missing period field + "cpu.max=100000 100000 5", // extra field + "cpu.max=x 100000", // bad quota + "cpu.max=max max", // bad period + "cpu.max=max", // missing period + ] { + assert!( + CgroupSetting::from_str(bad).is_err(), + "CgroupSetting accepted {bad:?}" + ); + } +} + +#[test] +fn jailsock_rejects_known_bad_shapes() { + for bad in [ + "", // empty + "relative", // not absolute + "/", // no filename + "/a/b", // multi-component + "/../etc", // traversal + "/..", // traversal as whole filename + "/.", // current dir + "/a b", // whitespace + "/a\0b", // NUL + "//x", // empty leading component + ] { + assert!( + JailSock::from_str(bad).is_err(), + "JailSock accepted {bad:?}" + ); + } +} From a5e098c732ef4e2609f92f3bae347bce8fe47c4c Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 23:27:22 +0000 Subject: [PATCH 20/46] fix(suidhelper): fail closed on close_range error; _exit on jailer exec failure --- native/suidhelper/src/main.rs | 4 +++- native/suidhelper/src/tools/jailer.rs | 5 +++-- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/native/suidhelper/src/main.rs b/native/suidhelper/src/main.rs index 665370bc..11b08319 100644 --- a/native/suidhelper/src/main.rs +++ b/native/suidhelper/src/main.rs @@ -116,7 +116,9 @@ fn main() { Command::Jailer(args) => { let e = jailer::run(args).expect_err("jailer::run only returns on error"); eprintln!("{e}"); - std::process::exit(2); + // _exit bypasses atexit handlers; safe because we are permanently root + // at this point and must not run any cleanup registered before escalation. + unsafe { nix::libc::_exit(2) } } Command::Version => unreachable!("handled above"), }; diff --git a/native/suidhelper/src/tools/jailer.rs b/native/suidhelper/src/tools/jailer.rs index d816de11..ec29ac8a 100644 --- a/native/suidhelper/src/tools/jailer.rs +++ b/native/suidhelper/src/tools/jailer.rs @@ -285,12 +285,13 @@ fn build_argv( /// Close every inherited fd above stdio so a compromised BEAM cannot smuggle an /// open fd into the root jailer. Keep 0/1/2: MuonTrap supervises the jailer /// through stdio, and stderr carries our exec-failure message. `close_range(2)` -/// needs Linux 5.9+; on ENOSYS we fall back to closing each fd up to the limit. +/// needs Linux 5.9+; on any failure (ENOSYS or otherwise) we fall back to +/// closing each fd individually — fail closed before handing root to the jailer. fn close_inherited_fds() { const FIRST: u32 = 3; // SAFETY: raw syscall with no memory operands; closing fds has no UB. let rc = unsafe { nix::libc::close_range(FIRST, u32::MAX, 0) }; - if rc == 0 || Errno::last() != Errno::ENOSYS { + if rc == 0 { return; } From 5f9f36b9ba0addca6dbd6d95e2e1e560a637d384 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 23:41:35 +0000 Subject: [PATCH 21/46] refactor(fire_vmm): drive jailer through suidhelper; drop Provider MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The BEAM no longer names a privileged binary path. `Jailer.command/1` now launches `hyper-suidhelper jailer` with only the id, uid/gid, cgroup flags, and api-sock; the helper reads firecracker/jailer binary paths, chroot base, parent cgroup, and cgroup version from its trusted /etc/hyper/config.toml, re-acquires root, and execve's the jailer (same pid, so MuonTrap owns the lifetime). Changes: - config.ex: add config_toml/0 (file → persistent_term cache); rewrite work_dir/0 over it; add firecracker_bin/0 + jailer_bin/0 via fetch_bin!/1 (raises when key absent); remove firecracker_install_dir/0. - provider.ex: deleted. - jailer.ex: rewire command/1 and exec_name/0; drop Provider alias. - daemon.ex: note that supervised process is the helper (execve → jailer). - node.ex: replace Provider.ensure_installed() with check_firecracker_bins/0. - hyper.ex: prefix gen_vm_id/0 with "v" so ids never start with "-". - jailer_test.exs: new pure test suite; load-bearing assertion that args contain no --exec-file / --chroot-base-dir / -- flags. --- lib/hyper.ex | 2 +- lib/hyper/config.ex | 79 +++++++++++++------- lib/hyper/node.ex | 15 +++- lib/hyper/node/fire_vmm/daemon.ex | 10 +-- lib/hyper/node/fire_vmm/jailer.ex | 48 +++++-------- lib/hyper/node/fire_vmm/provider.ex | 92 ------------------------ test/hyper/node/fire_vmm/jailer_test.exs | 81 +++++++++++++++++++++ 7 files changed, 171 insertions(+), 156 deletions(-) delete mode 100644 lib/hyper/node/fire_vmm/provider.ex create mode 100644 test/hyper/node/fire_vmm/jailer_test.exs diff --git a/lib/hyper.ex b/lib/hyper.ex index fb0a626e..c29d2f03 100644 --- a/lib/hyper.ex +++ b/lib/hyper.ex @@ -36,7 +36,7 @@ defmodule Hyper do @doc "Generate a fresh VM id (url-safe base64, dm-name compatible)." @spec gen_vm_id() :: Hyper.Vm.id() - def gen_vm_id, do: Base.url_encode64(:crypto.strong_rand_bytes(9), padding: false) + def gen_vm_id, do: "v" <> Base.url_encode64(:crypto.strong_rand_bytes(9), padding: false) @spec resolve_arch(Hyper.Vm.Instance.arch() | nil) :: {:ok, Hyper.Vm.Instance.arch()} | {:error, term()} diff --git a/lib/hyper/config.ex b/lib/hyper/config.ex index e3c056d2..250bfa01 100644 --- a/lib/hyper/config.ex +++ b/lib/hyper/config.ex @@ -2,10 +2,10 @@ defmodule Hyper.Config do @moduledoc """ Host configuration, read from `config :hyper, ...` (see `config/config.exs`). - `work_dir` is the one value shared with the setuid helper - (`native/suidhelper`); both sides read it from `/etc/hyper/config.toml` so the - data root has a single source of truth (see `work_dir/0`). Everything else is - compile-time. + Runtime values shared with the setuid helper (`native/suidhelper`) — `work_dir`, + `firecracker_bin`, `jailer_bin` — are read from `/etc/hyper/config.toml` (the + single source of truth for both sides) the first time they are needed, then + cached in `:persistent_term`. Everything else is compile-time. """ # The shared data-root config file, read by both this node and the setuid @@ -24,39 +24,66 @@ defmodule Hyper.Config do Root work directory for this node. All firecracker paths derive from it. Read from `#{@config_path}` (the single source of truth shared with the setuid - helper) the first time it is needed, then cached in `:persistent_term`. Falls - back to `#{@dev_work_dir}` when the file is absent (local dev / CI, where the - helper is not installed anyway). + helper) the first time it is needed, then cached via `config_toml/0`. Falls back + to `#{@dev_work_dir}` when the file is absent (local dev / CI, where the helper + is not installed anyway). """ @spec work_dir :: Path.t() - def work_dir do - case :persistent_term.get({__MODULE__, :work_dir}, nil) do + def work_dir, do: Map.get(config_toml(), "work_dir", @dev_work_dir) + + @doc "Directory holding redistributable binaries downloaded by the node." + @spec redist_dir :: Path.t() + def redist_dir, do: Path.join(work_dir(), "redist") + + @doc """ + Absolute path to the firecracker binary, as set in `#{@config_path}` under the + `firecracker_bin` key. Raises if the key is absent — the operator must configure + it; there is no default. + """ + @spec firecracker_bin :: Path.t() + def firecracker_bin, do: fetch_bin!("firecracker_bin") + + @doc """ + Absolute path to the jailer binary, as set in `#{@config_path}` under the + `jailer_bin` key. Raises if the key is absent — the operator must configure it; + there is no default. + """ + @spec jailer_bin :: Path.t() + def jailer_bin, do: fetch_bin!("jailer_bin") + + @spec fetch_bin!(String.t()) :: Path.t() + defp fetch_bin!(key) do + case Map.fetch(config_toml(), key) do + {:ok, path} -> + path + + :error -> + raise "#{@config_path}: key #{inspect(key)} is not set; " <> + "operator must configure it before starting the node" + end + end + + @spec config_toml :: map() + defp config_toml do + case :persistent_term.get({__MODULE__, :config_toml}, nil) do nil -> - work_dir = load_work_dir() - :persistent_term.put({__MODULE__, :work_dir}, work_dir) - work_dir + cfg = load_config_toml() + :persistent_term.put({__MODULE__, :config_toml}, cfg) + cfg - work_dir -> - work_dir + cfg -> + cfg end end - @spec load_work_dir :: Path.t() - defp load_work_dir do + @spec load_config_toml :: map() + defp load_config_toml do case File.read(@config_path) do - {:ok, body} -> body |> Toml.decode!() |> Map.fetch!("work_dir") - {:error, _} -> @dev_work_dir + {:ok, body} -> Toml.decode!(body) + {:error, _} -> %{} end end - @doc "Directory holding redistributable binaries downloaded by the node." - @spec redist_dir :: Path.t() - def redist_dir, do: Path.join(work_dir(), "redist") - - @doc "Directory where `Hyper.Node.FireVMM.Provider` installs the firecracker release." - @spec firecracker_install_dir :: Path.t() - def firecracker_install_dir, do: Path.join(redist_dir(), "firecracker") - @doc "Directory where `Hyper.Node.FireVMM.VmLinux.Provider` installs guest kernels." @spec vmlinux_install_dir :: Path.t() def vmlinux_install_dir, do: Path.join(redist_dir(), "vmlinux") diff --git a/lib/hyper/node.ex b/lib/hyper/node.ex index eb3bb738..5c2bd3fe 100644 --- a/lib/hyper/node.ex +++ b/lib/hyper/node.ex @@ -153,7 +153,7 @@ defmodule Hyper.Node do @spec test_system :: :ok | {:error, term()} def test_system do with {:ok, _} <- Hyper.Node.Config.Budget.load(), - :ok <- Hyper.Node.FireVMM.Provider.ensure_installed(), + :ok <- check_firecracker_bins(), :ok <- Hyper.Node.FireVMM.VmLinux.Provider.ensure_installed(), :ok <- Hyper.Node.Vmlinux.test_system(), :ok <- Hyper.Img.OciLoader.Umoci.ensure_installed(), @@ -167,6 +167,19 @@ defmodule Hyper.Node do end end + @spec check_firecracker_bins :: + :ok | {:error, {:firecracker_bin_missing | :jailer_bin_missing, Path.t()}} + defp check_firecracker_bins do + fc = Hyper.Config.firecracker_bin() + jail = Hyper.Config.jailer_bin() + + cond do + not Sys.Posix.executable?(fc) -> {:error, {:firecracker_bin_missing, fc}} + not Sys.Posix.executable?(jail) -> {:error, {:jailer_bin_missing, jail}} + true -> :ok + end + end + @spec check_helper_base(Path.t()) :: :ok | {:error, {:suid_helper_base_mismatch, Path.t(), Path.t()}} defp check_helper_base(base) do diff --git a/lib/hyper/node/fire_vmm/daemon.ex b/lib/hyper/node/fire_vmm/daemon.ex index 7cbc159c..c139340b 100644 --- a/lib/hyper/node/fire_vmm/daemon.ex +++ b/lib/hyper/node/fire_vmm/daemon.ex @@ -4,15 +4,17 @@ defmodule Hyper.Node.FireVMM.Daemon do `Hyper.Node.FireVMM.Core`. Lifecycle is supervisor-owned. On every (re)start it first resets any stale - jail left by a prior incarnation - the firecracker jailer refuses to reuse an - existing chroot - then builds the jailer command and runs it under + jail left by a prior incarnation — the firecracker jailer refuses to reuse an + existing chroot — then builds the jailer command and runs it under `MuonTrap.Daemon`, which kills the OS process when its port closes (controller crash, container teardown, or BEAM death). So no firecracker process outlives the supervisor, and `Core`'s `:one_for_all` restarting this child (e.g. after a firecracker crash) cleanly cold-boots against a fresh jail. - The supervised process *is* the `MuonTrap.Daemon` - `start_link/1` does the - reset, then delegates and returns that pid. + The supervised process is `hyper-suidhelper jailer ...`, which `execve`s into + the jailer (same pid) so MuonTrap owns the firecracker lifetime without needing + to know the jailer path. `start_link/1` does the reset, then delegates and + returns that pid. """ alias Hyper.Node.FireVMM.{Jailer, Opts} diff --git a/lib/hyper/node/fire_vmm/jailer.ex b/lib/hyper/node/fire_vmm/jailer.ex index 6dd1b800..355ad0d7 100644 --- a/lib/hyper/node/fire_vmm/jailer.ex +++ b/lib/hyper/node/fire_vmm/jailer.ex @@ -1,27 +1,26 @@ defmodule Hyper.Node.FireVMM.Jailer do @moduledoc """ - Builds the firecracker - [jailer](https://github.com/firecracker-microvm/firecracker/blob/main/docs/jailer.md) - command for one VM. + Builds the `hyper-suidhelper jailer` command for one VM. - The jailer sets up the chroot, namespaces, cgroup (via `Hyper.Vm.Instance` - flags) and drops privileges, then exec's firecracker. We run the jailer (not - firecracker directly) under `MuonTrap.Daemon`; MuonTrap only supervises the OS - process, the jailer owns isolation. + The BEAM does not invoke the jailer directly. Instead it calls the setuid helper + with the `jailer` subcommand; the helper reads the firecracker binary path, chroot + base, parent cgroup, and cgroup version from its trusted `/etc/hyper/config.toml`, + re-acquires root, and `execve`s the jailer (same pid, so `MuonTrap.Daemon` keeps + supervising it). + + This means the BEAM passes only untrusted-origin values: `--id`, `--uid`, `--gid`, + repeated `--cgroup KEY=VALUE`, and `--api-sock`. The helper derives and validates + everything else; it also inserts the `--` separator between its own flags and + firecracker's flags. Because firecracker is chrooted to `///root`, the API - socket it opens at `/api.socket` lives at `host_socket` on the host - that's the + socket it opens at `/api.socket` lives at `host_socket` on the host — that's the path the controller connects to. - - Host config: paths are derived from `config :hyper, work_dir: ...`. The - firecracker + jailer binaries are installed under `/redist/firecracker` - by `Hyper.Node.FireVMM.Provider`; the chroot base is `/jails`. """ use OpenTelemetryDecorator alias Hyper.Node.FireVMM - alias Hyper.Node.FireVMM.Provider alias Hyper.Vm.Instance # firecracker's API socket path *inside* the chroot. @@ -94,26 +93,11 @@ defmodule Hyper.Node.FireVMM.Jailer do @spec command(FireVMM.Opts.t()) :: t() def command(opts) do args = - [ - "--id", - opts.vm_id, - "--exec-file", - Provider.firecracker_bin(), - "--uid", - to_string(opts.uid), - "--gid", - to_string(opts.gid), - "--chroot-base-dir", - Hyper.Config.chroot_base(), - "--cgroup-version", - "2", - "--parent-cgroup", - Hyper.Config.parent_cgroup() - ] ++ + ["jailer", "--id", opts.vm_id, "--uid", to_string(opts.uid), "--gid", to_string(opts.gid)] ++ cgroup_flags(opts.type) ++ - ["--", "--api-sock", "/" <> @jail_socket] + ["--api-sock", "/" <> @jail_socket] - %{binary: Provider.jailer_bin(), args: args, host_socket: host_socket(opts.vm_id)} + %{binary: Hyper.Config.suid_helper(), args: args, host_socket: host_socket(opts.vm_id)} end # Find the appropriate jailer cgroup flags for the given instance type. @@ -166,5 +150,5 @@ defmodule Hyper.Node.FireVMM.Jailer do ]) end - defp exec_name, do: Path.basename(Provider.firecracker_bin()) + defp exec_name, do: Path.basename(Hyper.Config.firecracker_bin()) end diff --git a/lib/hyper/node/fire_vmm/provider.ex b/lib/hyper/node/fire_vmm/provider.ex deleted file mode 100644 index e50aee16..00000000 --- a/lib/hyper/node/fire_vmm/provider.ex +++ /dev/null @@ -1,92 +0,0 @@ -defmodule Hyper.Node.FireVMM.Provider do - @moduledoc """ - Installs the firecracker release for the current architecture into - `Hyper.Config.firecracker_install_dir/0` (`/redist/firecracker`). - - `ensure_installed/0` is idempotent: if the binaries are already present and - executable it returns `:ok` without touching the network. Otherwise it fetches - the official firecracker release tarball for the detected architecture via - `Redist.Targz` (download, SHA-256 verify, extract). The archive is - extracted as-is - the binaries live under `release-v-/` exactly as - firecracker ships them, and `firecracker_bin/0` / `jailer_bin/0` resolve those - paths. - """ - - alias Redist.Targz - - @downloads %{ - x86_64: %{ - url: - "https://github.com/firecracker-microvm/firecracker/releases/download/v1.16.0/firecracker-v1.16.0-x86_64.tgz", - sha256: "bd04e26952d4e158085778c6230a0b383d2619c319182e27eaa9d61a212e92d6", - firecracker: "release-v1.16.0-x86_64/firecracker-v1.16.0-x86_64", - jailer: "release-v1.16.0-x86_64/jailer-v1.16.0-x86_64" - }, - aarch64: %{ - url: - "https://github.com/firecracker-microvm/firecracker/releases/download/v1.16.0/firecracker-v1.16.0-aarch64.tgz", - sha256: "531c713cdbc37d4b8bc2533d851aabc0267096afa1768086a37672abb668efd7", - firecracker: "release-v1.16.0-aarch64/firecracker-v1.16.0-aarch64", - jailer: "release-v1.16.0-aarch64/jailer-v1.16.0-aarch64" - } - } - - @doc "Absolute path to the installed firecracker binary." - @spec firecracker_bin() :: Path.t() - def firecracker_bin, do: bin_path(:firecracker) - - @doc "Absolute path to the installed jailer binary." - @spec jailer_bin() :: Path.t() - def jailer_bin, do: bin_path(:jailer) - - @doc "Ensure the firecracker release is installed for this node." - @spec ensure_installed() :: :ok | {:error, term()} - def ensure_installed do - with {:ok, arch} <- Sys.Arch.current() do - dl = Map.fetch!(@downloads, arch) - - case check_install(dl) do - :ok -> :ok - {:error, :not_installed} -> install(dl) - {:error, :bad_install} -> reinstall(dl) - end - end - end - - # `:ok` if `dl`'s version-specific binaries are present and executable; - # `{:error, :not_installed}` if the install dir is empty/absent; otherwise - # `{:error, :bad_install}` - something is there but it's the wrong version, - # partial, or corrupt, which we cannot fix in place because `Targz` keeps - # existing files. The remedy is to wipe and reinstall. - @spec check_install(map()) :: :ok | {:error, :not_installed | :bad_install} - defp check_install(dl) do - fc = Path.join(install_dir(), dl.firecracker) - jail = Path.join(install_dir(), dl.jailer) - - cond do - Sys.Posix.executable?(fc) and Sys.Posix.executable?(jail) -> - :ok - - File.dir?(install_dir()) and File.ls!(install_dir()) != [] -> - {:error, :bad_install} - - true -> - {:error, :not_installed} - end - end - - defp install(dl), do: Targz.install(dl.url, dl.sha256, install_dir()) - - defp reinstall(dl) do - _ = File.rm_rf!(install_dir()) - install(dl) - end - - defp bin_path(key) do - {:ok, arch} = Sys.Arch.current() - dl = Map.fetch!(@downloads, arch) - Path.join(install_dir(), Map.fetch!(dl, key)) - end - - defp install_dir, do: Hyper.Config.firecracker_install_dir() -end diff --git a/test/hyper/node/fire_vmm/jailer_test.exs b/test/hyper/node/fire_vmm/jailer_test.exs new file mode 100644 index 00000000..9a231118 --- /dev/null +++ b/test/hyper/node/fire_vmm/jailer_test.exs @@ -0,0 +1,81 @@ +defmodule Hyper.Node.FireVMM.JailerTest do + @moduledoc """ + Properties and examples for `Hyper.Node.FireVMM.Jailer.command/1`. + + Load-bearing invariant: the BEAM must never place a privileged binary path + (firecracker, jailer) or lifecycle flags owned by the suidhelper (`--exec-file`, + `--chroot-base-dir`, `--cgroup-version`, `--parent-cgroup`, `--`) in the args + it hands to the helper. The helper derives those from its trusted config. + """ + + use ExUnit.Case, async: false + use ExUnitProperties + + alias Hyper.Node.FireVMM + alias Hyper.Node.FireVMM.Jailer + + @vm_id "vmtest01" + + # Stub config_toml persistent_term so firecracker_bin/jailer_bin resolve + # to dummy paths without requiring /etc/hyper/config.toml on the test host. + # async: false because persistent_term is global state. + setup do + :persistent_term.put({Hyper.Config, :config_toml}, %{ + "firecracker_bin" => "/usr/local/bin/firecracker-v1.16.0-x86_64", + "jailer_bin" => "/usr/local/bin/jailer-v1.16.0-x86_64" + }) + + on_exit(fn -> :persistent_term.erase({Hyper.Config, :config_toml}) end) + end + + defp micro_opts do + %FireVMM.Opts{ + vm_id: @vm_id, + uid: 900_001, + gid: 900_001, + type: :micro, + arch: :x86_64, + mutable: nil, + kernel: "/srv/hyper/redist/vmlinux/vmlinux-x86_64-6.1", + boot_args: nil + } + end + + test "binary is the suid helper" do + assert Jailer.command(micro_opts()).binary == Hyper.Config.suid_helper() + end + + test "args start with the jailer subcommand" do + %{args: [first | _]} = Jailer.command(micro_opts()) + assert first == "jailer" + end + + test "args contain --id, --uid, --gid with the opts values" do + %{args: args} = Jailer.command(micro_opts()) + assert "--id" in args + assert @vm_id in args + assert "--uid" in args + assert "--gid" in args + assert "900001" in args + end + + test "args end with --api-sock /api.socket" do + %{args: args} = Jailer.command(micro_opts()) + assert Enum.take(args, -2) == ["--api-sock", "/api.socket"] + end + + test "args do not contain privileged flags owned by the suidhelper" do + %{args: args} = Jailer.command(micro_opts()) + refute "--exec-file" in args + refute "--chroot-base-dir" in args + refute "--cgroup-version" in args + refute "--parent-cgroup" in args + refute "--" in args + end + + property "gen_vm_id/0 never produces an id starting with -" do + check all(_ <- StreamData.constant(nil)) do + refute String.starts_with?(Hyper.gen_vm_id(), "-") + end + end +end From 3a58b77a38bc29db83664f87026e8a62fdea53d8 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Thu, 25 Jun 2026 23:53:37 +0000 Subject: [PATCH 22/46] feat(mix): add firecracker.install task to fetch+configure firecracker/jailer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Downloads the pinned Firecracker v1.16.0 release via Redist.Targz.install/3 (download → SHA-256 verify → extract), copies the version-stamped binaries to bare-basename paths (/firecracker and /jailer) required by the suidhelper's SafeBin validator, and prints the config snippets the operator needs to paste. --- lib/mix/tasks/firecracker.install.ex | 141 +++++++++++++++++++++++++++ 1 file changed, 141 insertions(+) create mode 100644 lib/mix/tasks/firecracker.install.ex diff --git a/lib/mix/tasks/firecracker.install.ex b/lib/mix/tasks/firecracker.install.ex new file mode 100644 index 00000000..5d2a8723 --- /dev/null +++ b/lib/mix/tasks/firecracker.install.ex @@ -0,0 +1,141 @@ +defmodule Mix.Tasks.Firecracker.Install do + @shortdoc "Download, verify, and install the pinned Firecracker release" + @moduledoc """ + Downloads, verifies, and installs the pinned Firecracker release (v1.16.0) + for the current CPU architecture. + + mix firecracker.install [--prefix DIR] + + Steps performed: + + 1. Detects the CPU architecture (`x86_64` or `aarch64`). + 2. Downloads the release tarball and verifies its SHA-256 checksum. + 3. Extracts the tarball, then copies the binaries to `/firecracker` + and `/jailer` using the **bare basenames** `firecracker` and + `jailer`. The setuid helper validates binaries via `SafeBin<"firecracker">` + and `SafeBin<"jailer">`, which match on basename only — version-stamped + names such as `firecracker-v1.16.0-x86_64` are rejected unconditionally. + 4. Marks both binaries executable (`0o755`). + 5. Prints the configuration snippets the operator needs to paste. + + This task installs **unprivileged** binaries and prints configuration. + Privilege at runtime is handled by `hyper-suidhelper` (the setuid helper). + This task does **not** setuid `firecracker` or `jailer`. Install and setuid + the helper separately with `mix suidhelper.install`. + + ## Options + + * `--prefix DIR` — installation directory (default: `/opt/firecracker`). + + ## Security requirements + + After installing, ensure: + + * The binaries are root-owned and **not** group- or world-writable. + The suidhelper refuses binaries with loose permissions. + * `/etc/hyper/config.toml` is root-owned with mode `0644`. + """ + + use Mix.Task + + @version "1.16.0" + @default_prefix "/opt/firecracker" + + @impl Mix.Task + @spec run([String.t()]) :: :ok + def run(argv) do + {opts, _rest, _invalid} = OptionParser.parse(argv, strict: [prefix: :string]) + prefix = Keyword.get(opts, :prefix, @default_prefix) + + arch = detect_arch!() + + case Application.ensure_all_started(:req) do + {:ok, _} -> :ok + {:error, {reason, app}} -> Mix.raise("Cannot start HTTP client #{app}: #{inspect(reason)}") + end + + install!(release_for(arch), prefix) + print_config(prefix) + end + + defp detect_arch! do + case Sys.Arch.current() do + {:ok, arch} -> + arch + + {:error, {:unsupported_arch, raw}} -> + Mix.raise( + "Unsupported CPU architecture #{inspect(raw)}; " <> + "Firecracker supports x86_64 and aarch64." + ) + end + end + + defp release_for(:x86_64) do + %{ + url: + "https://github.com/firecracker-microvm/firecracker/releases/download/" <> + "v#{@version}/firecracker-v#{@version}-x86_64.tgz", + sha256: "bd04e26952d4e158085778c6230a0b383d2619c319182e27eaa9d61a212e92d6", + firecracker_path: "release-v#{@version}-x86_64/firecracker-v#{@version}-x86_64", + jailer_path: "release-v#{@version}-x86_64/jailer-v#{@version}-x86_64" + } + end + + defp release_for(:aarch64) do + %{ + url: + "https://github.com/firecracker-microvm/firecracker/releases/download/" <> + "v#{@version}/firecracker-v#{@version}-aarch64.tgz", + sha256: "531c713cdbc37d4b8bc2533d851aabc0267096afa1768086a37672abb668efd7", + firecracker_path: "release-v#{@version}-aarch64/firecracker-v#{@version}-aarch64", + jailer_path: "release-v#{@version}-aarch64/jailer-v#{@version}-aarch64" + } + end + + defp install!( + %{url: url, sha256: sha256, firecracker_path: fc_rel, jailer_path: jailer_rel}, + prefix + ) do + extract_dir = Path.join(prefix, ".firecracker-extract") + + Mix.shell().info("Downloading Firecracker v#{@version} from #{url} ...") + + case Redist.Targz.install(url, sha256, extract_dir) do + :ok -> :ok + {:error, reason} -> Mix.raise("Download from #{url} failed: #{inspect(reason)}") + end + + dst_fc = Path.join(prefix, "firecracker") + dst_jailer = Path.join(prefix, "jailer") + + # The release ships version-stamped names; copy to bare basenames so SafeBin + # validation passes. The helper matches on basename, not full path. + File.cp!(Path.join(extract_dir, fc_rel), dst_fc) + File.cp!(Path.join(extract_dir, jailer_rel), dst_jailer) + File.chmod!(dst_fc, 0o755) + File.chmod!(dst_jailer, 0o755) + _ = File.rm_rf!(extract_dir) + + Mix.shell().info("Installed #{dst_fc}") + Mix.shell().info("Installed #{dst_jailer}") + end + + defp print_config(prefix) do + fc = Path.join(prefix, "firecracker") + jailer = Path.join(prefix, "jailer") + + Mix.shell().info(""" + + Add to /etc/hyper/config.toml (file: root-owned, mode 0644; binaries: + root-owned, not group- or world-writable): + + firecracker = "#{fc}" + jailer = "#{jailer}" + + Or in your Elixir config (e.g. config/runtime.exs): + + config :hyper, firecracker_bin: "#{fc}", jailer_bin: "#{jailer}" + """) + end +end From a8743fc9e3d8a47a6376298709ede041b4c30d0b Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 00:06:04 +0000 Subject: [PATCH 23/46] docs: firecracker/jailer are operator-installed via mix firecracker.install Remove firecracker/jailer from the auto-redistributed list; operators now install them via `mix firecracker.install [--prefix ]`. Update /etc/hyper/config.toml example to show the required firecracker/jailer keys (no default, root-owned + non-world-writable, bare basenames validated by the helper) and the optional [uid_gid_range] table. Update `config :hyper` snippet to drop the auto-download TODOs and point at the install-produced paths. Document that uid_gid_range in config :hyper and [uid_gid_range] in config.toml must be kept in sync. --- docs/cookbook/intro.md | 65 +++++++++++++++++++++++++++++------------- 1 file changed, 45 insertions(+), 20 deletions(-) diff --git a/docs/cookbook/intro.md b/docs/cookbook/intro.md index 99c80123..ce3c3bf5 100644 --- a/docs/cookbook/intro.md +++ b/docs/cookbook/intro.md @@ -67,22 +67,38 @@ helper, never named by the unprivileged caller. Their paths therefore live in the helper's own config, `/etc/hyper/config.toml`, and default to `/usr/sbin/{losetup,blockdev,dmsetup}`. -**The config file is optional.** If it is absent the helper uses the built-in -defaults below (and `work_dir = "/srv/hyper"`, matching the node's own -fallback). Create one only to override a default - and if you do, it must be -root-owned and not group/other-writable, or the helper refuses to start (a -present-but-untrusted file is treated as an attack signal, unlike a missing -one): +**The config file must exist** to set `firecracker` and `jailer` (no built-in +defaults for those). The device-tool paths (`dmsetup`, `losetup`, `blockdev`) +and `work_dir` do have built-in defaults, so if you only need those defaults +and are not running VMs you may omit the file entirely. When the file is +present it must be root-owned and not group/other-writable, or the helper +refuses to start (a present-but-untrusted file is treated as an attack signal, +unlike a missing one): ```toml -# /etc/hyper/config.toml (root-owned, mode 0644) - every line optional +# /etc/hyper/config.toml (root-owned, mode 0644) work_dir = "/srv/hyper" -# Each must be an absolute path to a root-owned, non-world-writable binary; -# the helper validates this before it will exec the tool. +# REQUIRED - no default. Each must be an absolute path to a root-owned, +# non-group/world-writable binary named exactly "firecracker" or "jailer" +# (the helper validates the basename). Run `mix firecracker.install` to +# download the pinned release and print these values. +firecracker = "/opt/firecracker/firecracker" +jailer = "/opt/firecracker/jailer" + +# Optional device-tool overrides; default to /usr/sbin/{dmsetup,losetup,blockdev}. +# Each must be root-owned and not group/world-writable. dmsetup = "/usr/sbin/dmsetup" losetup = "/usr/sbin/losetup" blockdev = "/usr/sbin/blockdev" + +# Optional. Governs which uid/gid values the helper accepts when launching the +# jailer. Must satisfy min > 0 and min <= max. Defaults to {900000, 999999}. +# If you narrow this range, set the same bounds in `config :hyper, uid_gid_range:` +# so the node hands out only uids the helper will accept. +[uid_gid_range] +min = 900000 +max = 999999 ``` `dmsetup` (lvm2) is frequently *not* installed by default - check that one @@ -168,14 +184,22 @@ printf 'dm_snapshot\ndm_thin_pool\nloop\n' | sudo tee /etc/modules-load.d/hyper. echo 'd /sys/fs/cgroup/hyper 0755 root root -' \ | sudo tee /etc/tmpfiles.d/hyper-cgroup.conf ``` - - The host UID/GID range given by `uid_gid_range` must be free for Hyper to - allocate per-VM users from. + - The host UID/GID range must be free for Hyper to allocate per-VM users + from. The node's range is set by `uid_gid_range` in `config :hyper`; the + helper independently reads `[uid_gid_range]` from `/etc/hyper/config.toml` + (see below) and only accepts jailer `--uid`/`--gid` within that range. + Keep the two in sync. #### Auto-redistributed -The remaining runtime dependencies - `firecracker`, `jailer`, `umoci`, and the -guest `vmlinux` kernels - are downloaded, checksum-verified, and managed by -Hyper itself; you do not install them. +`umoci` and the guest `vmlinux` kernels are downloaded, checksum-verified, and +managed by Hyper itself; you do not install them. + +`firecracker` and `jailer` are not auto-downloaded. Install them with +`mix firecracker.install [--prefix ]` (default prefix `/opt/firecracker`), +which downloads the pinned v1.16.0 release, places the binaries at +`/firecracker` and `/jailer`, and prints the config snippets to +paste into `/etc/hyper/config.toml` and `config.exs`. ### Installation @@ -190,18 +214,19 @@ configuration. ```elixir config :hyper, - # TODO(markovejnovic): Remove this after it gets auto-downloaded. - jailer_bin: "/opt/firecracker/jailer-v1.16.0-x86_64", - # TODO(markovejnovic): Remove this after it gets auto-downloaded. - firecracker_bin: "/opt/firecracker/firecracker-v1.16.0-x86_64", + # REQUIRED. Must point at the bare-basename binaries installed by + # `mix firecracker.install`. The setuid helper validates these paths + # (root-owned, non-group/world-writable, basename exactly "firecracker"/"jailer"). + firecracker_bin: "/opt/firecracker/firecracker", + jailer_bin: "/opt/firecracker/jailer", # You must create a parent cgroup on your system. Continue reading for # further details. cgroup_parent: "hyper", - # TODO(markovejnovic): Merge these directories into one. jailer_chroot_base: "/srv/hyper/jails", socket_dir: "/srv/hyper/socks", scratch_dir: "/srv/hyper/scratch", - # Hyper requires that each VM you pass + # Must match the [uid_gid_range] table in /etc/hyper/config.toml so the node + # hands out only uids the helper will accept. uid_gid_range: {900_000, 999_999}, layer_dir: "/srv/hyper/layers" ``` From 5770f57fd0a09e26cc9087b793e00ac41f24453f Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 00:17:21 +0000 Subject: [PATCH 24/46] fix(config): node and helper read the same firecracker/jailer TOML keys MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Elixir node was fetching TOML keys "firecracker_bin"/"jailer_bin" while the setuid helper (and the TOML example in docs) uses "firecracker"/"jailer". A correctly-installed host therefore crashed on launch with an unset-key error even when the file was present. - config.ex: fix fetch_bin! keys to "firecracker"/"jailer"; add non-raising firecracker_bin_configured/0 and jailer_bin_configured/0 returning {:ok, path} | :error for use by pre-launch checks. - node.ex: rewrite check_firecracker_bins/0 to use the non-raising accessors so a missing key returns {:error, :firecracker_not_configured} instead of raising, honouring the @spec contract. - firecracker.install.ex: drop the dead "config :hyper, firecracker_bin:" snippet from print_config/1 — nothing reads Application env for these paths. - docs/cookbook/intro.md: remove firecracker_bin/jailer_bin from config :hyper block; clarify paths live only in /etc/hyper/config.toml. - jailer_test.exs: align stub map keys to "firecracker"/"jailer" (the bug that masked this); add positive --cgroup assertion for :micro type. - Cargo.toml: remove redundant toml dev-dependency (already a normal dep). --- docs/cookbook/intro.md | 13 ++++----- lib/hyper/config.ex | 36 ++++++++++++++++++------ lib/hyper/node.ex | 21 ++++++++------ lib/mix/tasks/firecracker.install.ex | 8 ++---- native/suidhelper/Cargo.toml | 1 - test/hyper/node/fire_vmm/jailer_test.exs | 11 ++++++-- 6 files changed, 58 insertions(+), 32 deletions(-) diff --git a/docs/cookbook/intro.md b/docs/cookbook/intro.md index ce3c3bf5..57aa4ff0 100644 --- a/docs/cookbook/intro.md +++ b/docs/cookbook/intro.md @@ -198,8 +198,8 @@ managed by Hyper itself; you do not install them. `firecracker` and `jailer` are not auto-downloaded. Install them with `mix firecracker.install [--prefix ]` (default prefix `/opt/firecracker`), which downloads the pinned v1.16.0 release, places the binaries at -`/firecracker` and `/jailer`, and prints the config snippets to -paste into `/etc/hyper/config.toml` and `config.exs`. +`/firecracker` and `/jailer`, and prints the `/etc/hyper/config.toml` +snippet to paste in. ### Installation @@ -214,11 +214,6 @@ configuration. ```elixir config :hyper, - # REQUIRED. Must point at the bare-basename binaries installed by - # `mix firecracker.install`. The setuid helper validates these paths - # (root-owned, non-group/world-writable, basename exactly "firecracker"/"jailer"). - firecracker_bin: "/opt/firecracker/firecracker", - jailer_bin: "/opt/firecracker/jailer", # You must create a parent cgroup on your system. Continue reading for # further details. cgroup_parent: "hyper", @@ -231,6 +226,10 @@ config :hyper, layer_dir: "/srv/hyper/layers" ``` +The `firecracker` and `jailer` binary paths are **not** set here — they are read +from `/etc/hyper/config.toml` (the single source of truth shared with the setuid +helper). See the `config.toml` example above. + ### Usage diff --git a/lib/hyper/config.ex b/lib/hyper/config.ex index 250bfa01..0704c647 100644 --- a/lib/hyper/config.ex +++ b/lib/hyper/config.ex @@ -3,9 +3,9 @@ defmodule Hyper.Config do Host configuration, read from `config :hyper, ...` (see `config/config.exs`). Runtime values shared with the setuid helper (`native/suidhelper`) — `work_dir`, - `firecracker_bin`, `jailer_bin` — are read from `/etc/hyper/config.toml` (the - single source of truth for both sides) the first time they are needed, then - cached in `:persistent_term`. Everything else is compile-time. + `firecracker`, `jailer` — are read from `/etc/hyper/config.toml` (the single + source of truth for both sides) the first time they are needed, then cached in + `:persistent_term`. Everything else is compile-time. """ # The shared data-root config file, read by both this node and the setuid @@ -37,19 +37,39 @@ defmodule Hyper.Config do @doc """ Absolute path to the firecracker binary, as set in `#{@config_path}` under the - `firecracker_bin` key. Raises if the key is absent — the operator must configure - it; there is no default. + `firecracker` key. Raises if the key is absent — the operator must configure it; + there is no default. + + For the launch path only. Pre-launch checks should use `firecracker_bin_configured/0` + so a missing key returns a typed error rather than crashing. """ @spec firecracker_bin :: Path.t() - def firecracker_bin, do: fetch_bin!("firecracker_bin") + def firecracker_bin, do: fetch_bin!("firecracker") + + @doc """ + Non-raising form of `firecracker_bin/0`. Returns `{:ok, path}` when the + `firecracker` key is present in `#{@config_path}`, or `:error` when it is absent. + """ + @spec firecracker_bin_configured :: {:ok, Path.t()} | :error + def firecracker_bin_configured, do: Map.fetch(config_toml(), "firecracker") @doc """ Absolute path to the jailer binary, as set in `#{@config_path}` under the - `jailer_bin` key. Raises if the key is absent — the operator must configure it; + `jailer` key. Raises if the key is absent — the operator must configure it; there is no default. + + For the launch path only. Pre-launch checks should use `jailer_bin_configured/0` + so a missing key returns a typed error rather than crashing. """ @spec jailer_bin :: Path.t() - def jailer_bin, do: fetch_bin!("jailer_bin") + def jailer_bin, do: fetch_bin!("jailer") + + @doc """ + Non-raising form of `jailer_bin/0`. Returns `{:ok, path}` when the `jailer` key + is present in `#{@config_path}`, or `:error` when it is absent. + """ + @spec jailer_bin_configured :: {:ok, Path.t()} | :error + def jailer_bin_configured, do: Map.fetch(config_toml(), "jailer") @spec fetch_bin!(String.t()) :: Path.t() defp fetch_bin!(key) do diff --git a/lib/hyper/node.ex b/lib/hyper/node.ex index 5c2bd3fe..b4721167 100644 --- a/lib/hyper/node.ex +++ b/lib/hyper/node.ex @@ -168,15 +168,20 @@ defmodule Hyper.Node do end @spec check_firecracker_bins :: - :ok | {:error, {:firecracker_bin_missing | :jailer_bin_missing, Path.t()}} + :ok + | {:error, {:firecracker_bin_missing | :jailer_bin_missing, Path.t()}} + | {:error, :firecracker_not_configured | :jailer_not_configured} defp check_firecracker_bins do - fc = Hyper.Config.firecracker_bin() - jail = Hyper.Config.jailer_bin() - - cond do - not Sys.Posix.executable?(fc) -> {:error, {:firecracker_bin_missing, fc}} - not Sys.Posix.executable?(jail) -> {:error, {:jailer_bin_missing, jail}} - true -> :ok + with {:fc, {:ok, fc}} <- {:fc, Hyper.Config.firecracker_bin_configured()}, + {:jail, {:ok, jail}} <- {:jail, Hyper.Config.jailer_bin_configured()} do + cond do + not Sys.Posix.executable?(fc) -> {:error, {:firecracker_bin_missing, fc}} + not Sys.Posix.executable?(jail) -> {:error, {:jailer_bin_missing, jail}} + true -> :ok + end + else + {:fc, :error} -> {:error, :firecracker_not_configured} + {:jail, :error} -> {:error, :jailer_not_configured} end end diff --git a/lib/mix/tasks/firecracker.install.ex b/lib/mix/tasks/firecracker.install.ex index 5d2a8723..9db3e2e1 100644 --- a/lib/mix/tasks/firecracker.install.ex +++ b/lib/mix/tasks/firecracker.install.ex @@ -16,7 +16,7 @@ defmodule Mix.Tasks.Firecracker.Install do and `SafeBin<"jailer">`, which match on basename only — version-stamped names such as `firecracker-v1.16.0-x86_64` are rejected unconditionally. 4. Marks both binaries executable (`0o755`). - 5. Prints the configuration snippets the operator needs to paste. + 5. Prints the `/etc/hyper/config.toml` snippet the operator needs to paste. This task installs **unprivileged** binaries and prints configuration. Privilege at runtime is handled by `hyper-suidhelper` (the setuid helper). @@ -131,11 +131,7 @@ defmodule Mix.Tasks.Firecracker.Install do root-owned, not group- or world-writable): firecracker = "#{fc}" - jailer = "#{jailer}" - - Or in your Elixir config (e.g. config/runtime.exs): - - config :hyper, firecracker_bin: "#{fc}", jailer_bin: "#{jailer}" + jailer = "#{jailer}" """) end end diff --git a/native/suidhelper/Cargo.toml b/native/suidhelper/Cargo.toml index 9b2b6cbe..154a5926 100644 --- a/native/suidhelper/Cargo.toml +++ b/native/suidhelper/Cargo.toml @@ -96,7 +96,6 @@ toml = { version = "0.8", default-features = false, features = ["parse"] } [dev-dependencies] proptest = "1" tempfile = "3" -toml = { version = "0.8", default-features = false, features = ["parse"] } [profile.release] strip = true diff --git a/test/hyper/node/fire_vmm/jailer_test.exs b/test/hyper/node/fire_vmm/jailer_test.exs index 9a231118..9ac17536 100644 --- a/test/hyper/node/fire_vmm/jailer_test.exs +++ b/test/hyper/node/fire_vmm/jailer_test.exs @@ -21,8 +21,8 @@ defmodule Hyper.Node.FireVMM.JailerTest do # async: false because persistent_term is global state. setup do :persistent_term.put({Hyper.Config, :config_toml}, %{ - "firecracker_bin" => "/usr/local/bin/firecracker-v1.16.0-x86_64", - "jailer_bin" => "/usr/local/bin/jailer-v1.16.0-x86_64" + "firecracker" => "/usr/local/bin/firecracker", + "jailer" => "/usr/local/bin/jailer" }) on_exit(fn -> :persistent_term.erase({Hyper.Config, :config_toml}) end) @@ -73,6 +73,13 @@ defmodule Hyper.Node.FireVMM.JailerTest do refute "--" in args end + test "args include --cgroup cpu.max and memory.max for :micro type" do + %{args: args} = Jailer.command(micro_opts()) + assert "--cgroup" in args + assert Enum.any?(args, &String.starts_with?(&1, "cpu.max=")) + assert Enum.any?(args, &String.starts_with?(&1, "memory.max=")) + end + property "gen_vm_id/0 never produces an id starting with -" do check all(_ <- StreamData.constant(nil)) do refute String.starts_with?(Hyper.gen_vm_id(), "-") From ffd46a4c4e19f425a26986807d9fd08d421f9af7 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 00:33:23 +0000 Subject: [PATCH 25/46] feat(mix): firecracker.install prints the chown/chmod root commands The task runs unprivileged, so the binaries land owned by the invoking user. SafeBin in the suidhelper refuses any jailer/firecracker not owned by root, so every launch fails closed until they are chowned. Print the exact chown/chmod commands (with the real installed paths) instead of a vague 'ensure root-owned' note. --- lib/mix/tasks/firecracker.install.ex | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/lib/mix/tasks/firecracker.install.ex b/lib/mix/tasks/firecracker.install.ex index 9db3e2e1..03ada2be 100644 --- a/lib/mix/tasks/firecracker.install.ex +++ b/lib/mix/tasks/firecracker.install.ex @@ -125,10 +125,20 @@ defmodule Mix.Tasks.Firecracker.Install do fc = Path.join(prefix, "firecracker") jailer = Path.join(prefix, "jailer") + # This task runs unprivileged, so the binaries land owned by the invoking + # user. The suidhelper's SafeBin refuses any binary not owned by root and not + # free of group/other write bits, so the operator MUST chown/chmod them or + # every jailer launch fails closed. Print the exact commands rather than a + # vague "ensure root-owned". Mix.shell().info(""" - Add to /etc/hyper/config.toml (file: root-owned, mode 0644; binaries: - root-owned, not group- or world-writable): + Almost done. Run these as root so the setuid helper will accept the binaries + (it refuses any jailer/firecracker not owned by root): + + sudo chown root:root #{fc} #{jailer} + sudo chmod 0755 #{fc} #{jailer} + + Then add to /etc/hyper/config.toml (file: root-owned, mode 0644): firecracker = "#{fc}" jailer = "#{jailer}" From 202141be1e259d5d7775c181e91edcae27d3338f Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 01:11:18 +0000 Subject: [PATCH 26/46] feat(fire_vmm): log jailer/firecracker output and real exit status The supervised process is MuonTrap.Daemon, so route the jailed process's stdout+stderr (guest serial console included) to the Logger via log_output, and map the exit status into {:firecracker_exited, status} so a crash names the real exit code instead of the opaque :error_exit_status. Add explicit launch/launch-failure log lines keyed by vm id, so a boot loop is visible without reading console scrollback. --- lib/hyper/node/fire_vmm/daemon.ex | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/lib/hyper/node/fire_vmm/daemon.ex b/lib/hyper/node/fire_vmm/daemon.ex index c139340b..b6257732 100644 --- a/lib/hyper/node/fire_vmm/daemon.ex +++ b/lib/hyper/node/fire_vmm/daemon.ex @@ -23,6 +23,8 @@ defmodule Hyper.Node.FireVMM.Daemon do use OpenTelemetryDecorator + require Logger + @shutdown_timeout Time.s(5) @spec child_spec(Opts.t()) :: Supervisor.child_spec() @@ -46,9 +48,26 @@ defmodule Hyper.Node.FireVMM.Daemon do with :ok <- SuidHelper.ChrootJail.remove(Jailer.chroot_dir(id), Jailer.cgroup_dir(id)) do cmd = Jailer.command(opts) - case MuonTrap.Daemon.start_link(cmd.binary, cmd.args, []) do - {:ok, pid} -> {:ok, pid} - {:error, _} = err -> err + # Surface what the jailed process actually does: `log_output` routes the + # helper/jailer/firecracker stdout+stderr (guest serial console included) + # to the Logger, and `exit_status_to_reason` turns MuonTrap's opaque + # `:error_exit_status` into `{:firecracker_exited, status}` so a crash + # report names the real exit code instead of hiding it. + daemon_opts = [ + log_output: :info, + log_prefix: "vm #{id} firecracker: ", + stderr_to_stdout: true, + exit_status_to_reason: &{:firecracker_exited, &1} + ] + + case MuonTrap.Daemon.start_link(cmd.binary, cmd.args, daemon_opts) do + {:ok, pid} -> + Logger.info("vm #{id}: jailer launched under MuonTrap (#{inspect(pid)})") + {:ok, pid} + + {:error, reason} = err -> + Logger.error("vm #{id}: jailer failed to launch: #{inspect(reason)}") + err end end end From 0ecccf558d9ee6753ac66c91f68acec2e7cac5a2 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 01:33:11 +0000 Subject: [PATCH 27/46] feat(fire_vmm): surface the API readiness-probe failure reason The :awaiting_api probe swallowed the describe_instance error and stopped with a bare :daemon_unready, hiding why a healthy-looking firecracker is unreachable. Log the last probe error on deadline and carry it in the stop reason ({:daemon_unready, reason}), so a host->jail socket permission/path problem is diagnosable instead of looking like an unexplained 5s restart loop. --- lib/hyper/node/fire_vmm/state.ex | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/lib/hyper/node/fire_vmm/state.ex b/lib/hyper/node/fire_vmm/state.ex index b390965c..552da0f1 100644 --- a/lib/hyper/node/fire_vmm/state.ex +++ b/lib/hyper/node/fire_vmm/state.ex @@ -113,6 +113,8 @@ defmodule Hyper.Node.FireVMM.State do alias Hyper.Node.FireVMM.{Client, Opts} alias Unit.Time + require Logger + # How often to probe the daemon's API while waiting for it. @probe_interval Time.ms(50) @@ -123,9 +125,19 @@ defmodule Hyper.Node.FireVMM.State do {:ok, %InstanceInfo{}} -> {:next_state, :configuring, data, [{:state_timeout, 0, :configure}]} - {:error, _reason} -> + {:error, reason} -> if System.monotonic_time(:millisecond) >= data.boot_deadline do - {:stop, {:shutdown, {:boot_failed, :daemon_unready}}, data} + # firecracker's own log shows the API server is up, so a persistent + # probe failure points at the host->jail socket (path or, more often, + # permissions: the jailer drops firecracker to the per-VM uid and the + # unprivileged controller cannot reach the jailed socket). Surface the + # last error rather than swallowing it into a bare :daemon_unready. + Logger.warning( + "vm #{id}: firecracker API not reachable before deadline; " <> + "last probe error: #{inspect(reason)}" + ) + + {:stop, {:shutdown, {:boot_failed, {:daemon_unready, reason}}}, data} else {:keep_state_and_data, [{:state_timeout, Time.as_ms(@probe_interval), :probe}]} end From 2d48872d902c311776e1739d345f642e7912101b Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 01:51:13 +0000 Subject: [PATCH 28/46] feat(suidhelper): add chroot-jail grant-api to hand the API socket to the node user The jailer drops firecracker to a per-VM uid/gid and chroots it, so the API socket it creates at /root/api.socket is owned by that per-VM id, mode 0755. Connecting a unix socket needs write permission, so the unprivileged node controller gets EACCES on connect(). grant-api confines the socket path under JAIL_BASE (SafePath + O_NOFOLLOW walk, fd-relative on the pinned root dir), verifies the leaf is a real socket via fstatat(AT_SYMLINK_NOFOLLOW) (a planted file/symlink is refused, never touched), then chowns it to the helper's caller (getuid/getgid, the real=caller ids inside the privileged scope) and chmods 0660. A not-yet-created socket is reported Pending, not an error. Adds SafeDir::stat and SafeDir::chmod fd-relative primitives. --- native/suidhelper/Cargo.toml | 4 + .../src/tools/chroot_jail/grant_api.rs | 160 +++++++++++++ .../suidhelper/src/tools/chroot_jail/mod.rs | 5 + native/suidhelper/src/util/safe_dir.rs | 37 ++- .../tests/tools/chroot_jail_grant_api.rs | 224 ++++++++++++++++++ 5 files changed, 429 insertions(+), 1 deletion(-) create mode 100644 native/suidhelper/src/tools/chroot_jail/grant_api.rs create mode 100644 native/suidhelper/tests/tools/chroot_jail_grant_api.rs diff --git a/native/suidhelper/Cargo.toml b/native/suidhelper/Cargo.toml index 154a5926..be6a449c 100644 --- a/native/suidhelper/Cargo.toml +++ b/native/suidhelper/Cargo.toml @@ -60,6 +60,10 @@ path = "tests/util/confinement.rs" name = "tools_chroot_jail_remove" path = "tests/tools/chroot_jail_remove.rs" +[[test]] +name = "tools_chroot_jail_grant_api" +path = "tests/tools/chroot_jail_grant_api.rs" + [[test]] name = "util_chroot_jail" path = "tests/util/chroot_jail.rs" diff --git a/native/suidhelper/src/tools/chroot_jail/grant_api.rs b/native/suidhelper/src/tools/chroot_jail/grant_api.rs new file mode 100644 index 00000000..a74b5481 --- /dev/null +++ b/native/suidhelper/src/tools/chroot_jail/grant_api.rs @@ -0,0 +1,160 @@ +// SPDX-License-Identifier: AGPL-3.0-only +//! `chroot-jail grant-api`: hand the firecracker API socket to the node user so +//! the unprivileged controller can `connect()` to it. +//! +//! The jailer drops firecracker to a per-VM uid/gid and chroots it; firecracker +//! then creates its API socket at `/root/api.socket` owned by that per-VM +//! id, mode `0755`. Connecting a unix socket needs *write* permission on the +//! node, so the node user (a different uid) gets `EACCES`. This op chowns just +//! that one socket to the helper's CALLER — `getuid()`/`getgid()`, which inside +//! the privileged scope are the real (caller) ids while euid is 0 — and chmods +//! it `0660`. The node thus connects as owner, and humans added to the node's +//! group connect via the group bit. Per-VM isolation is otherwise untouched: +//! only this single socket moves, nothing else in the jail. +//! +//! Security: the socket path is validated as a `SafePath` and reached by an +//! `O_NOFOLLOW` walk from `JAIL_BASE`, so a symlinked component cannot redirect +//! the chown outside the jail, and every op is fd-relative on the pinned `root` +//! dir fd, never by re-resolved name. The leaf must be exactly `api.socket` +//! `//root` below the base, and `fstatat(AT_SYMLINK_NOFOLLOW)` must +//! report a *socket* — a regular file or symlink planted at that name is an +//! attack and is refused, never chmod'd. A missing socket (`ENOENT`, anywhere on +//! the path) is `Pending`, not an error: firecracker has not created it yet, so +//! the controller keeps probing. + +use crate::config::Config; +use crate::tools::IsTool; +use crate::util::safe_dir::{self, SafeDir}; +use crate::util::safe_path::{self, IsAbsolute, SafePath, StrictComponents}; +use clap::Args; +use nix::errno::Errno; +use nix::sys::stat::SFlag; +use nix::unistd::{getgid, getuid}; +use serde::Serialize; +use std::path::{Path, PathBuf}; +use thiserror::Error as ThisError; + +/// The fixed in-jail socket name firecracker opens (mirrors the Elixir +/// `Hyper.Node.FireVMM.Jailer` `@jail_socket`). +const SOCKET_NAME: &str = "api.socket"; + +/// The socket sits at `///root/api.socket`: three parent +/// components (``, ``, `root`) before the leaf. +const SOCKET_PARENT_DEPTH: usize = 3; + +/// Mode handed to the node: owner+group read/write, no world access. +const SOCKET_MODE: u32 = 0o660; + +type LexicalPath = SafePath; + +#[derive(Debug, ThisError)] +pub enum Error { + #[error("--socket path: {0}")] + SocketPath(#[source] safe_path::ValidationError), + #[error("--socket must be exactly //root/api.socket below JAIL_BASE: {0:?}")] + SocketShape(PathBuf), + #[error("--socket leaf must be {SOCKET_NAME:?}: {0:?}")] + SocketName(PathBuf), + #[error("walking to the jail root: {0}")] + Walk(#[source] safe_dir::Error), + #[error("api.socket is not a socket (or is a symlink); refusing to touch it")] + NotASocket, + #[error("statting the socket: {0}")] + Stat(#[source] safe_dir::Error), + #[error("chowning the socket to the caller: {0}")] + Chown(#[source] safe_dir::Error), + #[error("chmoding the socket: {0}")] + Chmod(#[source] safe_dir::Error), +} + +#[derive(Args)] +pub struct GrantApiArgs { + /// Host path of the firecracker API socket, shape + /// ///root/api.socket. + #[arg(long)] + socket: PathBuf, +} + +#[derive(Debug, Serialize)] +#[serde(tag = "result", rename_all = "snake_case")] +pub enum GrantOut { + /// The socket was handed to the caller (chowned + chmoded). + Granted, + /// The socket does not exist yet; the caller should keep waiting. + Pending, +} + +/// Run the `grant-api` op in its own privileged scope (returns its serialized `Value`). +pub fn run(args: GrantApiArgs) -> Result { + GrantApi { args }.run() +} + +struct GrantApi { + args: GrantApiArgs, +} + +impl IsTool for GrantApi { + type Args = GrantApiArgs; + type Output = GrantOut; + type RunT = Result; + + fn run_privileged(&self) -> Self::RunT { + grant_api_under(&Config::get().jail_base(), &self.args.socket) + } + + fn parse(&self, res: Self::RunT) -> Result> { + Ok(res?) + } +} + +/// Hand `socket` (`///root/api.socket`) to the helper's +/// caller, fd-relative after an `O_NOFOLLOW` walk from `jail_base`. Returns +/// `Pending` if any path component or the socket itself is not yet present. +pub fn grant_api_under(jail_base: &Path, socket: &Path) -> Result { + let path: LexicalPath = socket.to_path_buf().try_into().map_err(Error::SocketPath)?; + let (parents, leaf) = path.relative_to(jail_base).map_err(Error::SocketPath)?; + if parents.len() != SOCKET_PARENT_DEPTH { + return Err(Error::SocketShape(socket.to_path_buf())); + } + if leaf != Path::new(SOCKET_NAME) { + return Err(Error::SocketName(socket.to_path_buf())); + } + + let Some(root) = walk(jail_base.to_path_buf(), &parents)? else { + return Ok(GrantOut::Pending); // jail not fully created yet + }; + + let leaf = Path::new(SOCKET_NAME); + let stat = match root.stat(leaf) { + Ok(stat) => stat, + Err(e) if e.errno() == Some(Errno::ENOENT) => return Ok(GrantOut::Pending), + Err(e) => return Err(Error::Stat(e)), + }; + // `stat` used AT_SYMLINK_NOFOLLOW, so a symlink reports as S_IFLNK and fails + // this check too: only a real socket is accepted, anything else is refused. + if stat.st_mode & SFlag::S_IFMT.bits() != SFlag::S_IFSOCK.bits() { + return Err(Error::NotASocket); + } + + root.chown(leaf, getuid().as_raw(), getgid().as_raw()) + .map_err(Error::Chown)?; + root.chmod(leaf, SOCKET_MODE).map_err(Error::Chmod)?; + Ok(GrantOut::Granted) +} + +/// Open `base` and walk `parents` from it (`O_NOFOLLOW` each step). Returns +/// `Ok(None)` if `base` or any parent is not yet present (`ENOENT`), so the +/// caller can treat a half-built jail as `Pending` rather than an error. +fn walk(base: PathBuf, parents: &[PathBuf]) -> Result, Error> { + let base_path: LexicalPath = base.try_into().map_err(Error::SocketPath)?; + let anchor = match SafeDir::open(&base_path) { + Ok(dir) => dir, + Err(e) if e.errno() == Some(Errno::ENOENT) => return Ok(None), + Err(e) => return Err(Error::Walk(e)), + }; + match anchor.descend(parents) { + Ok(dir) => Ok(Some(dir)), + Err(e) if e.errno() == Some(Errno::ENOENT) => Ok(None), + Err(e) => Err(Error::Walk(e)), + } +} diff --git a/native/suidhelper/src/tools/chroot_jail/mod.rs b/native/suidhelper/src/tools/chroot_jail/mod.rs index d65bac7b..3a06c35a 100644 --- a/native/suidhelper/src/tools/chroot_jail/mod.rs +++ b/native/suidhelper/src/tools/chroot_jail/mod.rs @@ -1,9 +1,11 @@ // SPDX-License-Identifier: AGPL-3.0-only //! `chroot-jail`: per-VM chroot/jail lifecycle. +pub mod grant_api; mod prepare; pub mod remove; +pub use grant_api::GrantApiArgs; pub use prepare::PrepareArgs; pub use remove::RemoveArgs; @@ -15,6 +17,8 @@ pub enum ChrootJailOp { Prepare(PrepareArgs), /// Remove a VM's stale chroot and cgroup leaf before relaunching the jailer. Remove(RemoveArgs), + /// Hand the firecracker API socket to the node user (chown to caller, 0660). + GrantApi(GrantApiArgs), } impl ChrootJailOp { @@ -25,6 +29,7 @@ impl ChrootJailOp { match self { ChrootJailOp::Prepare(args) => prepare::run(args), ChrootJailOp::Remove(args) => remove::run(args), + ChrootJailOp::GrantApi(args) => grant_api::run(args), } } } diff --git a/native/suidhelper/src/util/safe_dir.rs b/native/suidhelper/src/util/safe_dir.rs index 388ae838..ed88f07d 100644 --- a/native/suidhelper/src/util/safe_dir.rs +++ b/native/suidhelper/src/util/safe_dir.rs @@ -17,7 +17,7 @@ use super::safe_path::SafePath; use nix::dir::{Dir, Type}; use nix::fcntl::{openat, AtFlags, OFlag}; use nix::libc::dev_t; -use nix::sys::stat::{mknodat, Mode, SFlag}; +use nix::sys::stat::{fchmodat, fstatat, mknodat, FchmodatFlags, FileStat, Mode, SFlag}; use nix::unistd::{dup, fchownat, linkat, unlinkat, Gid, Uid, UnlinkatFlags}; use std::ffi::OsStr; use std::os::unix::ffi::OsStrExt; @@ -37,6 +37,10 @@ pub enum Error { Mknod { name: PathBuf, source: nix::Error }, #[error("fchownat {name:?}: {source}")] Chown { name: PathBuf, source: nix::Error }, + #[error("fchmodat {name:?}: {source}")] + Chmod { name: PathBuf, source: nix::Error }, + #[error("fstatat {name:?}: {source}")] + Stat { name: PathBuf, source: nix::Error }, #[error("linkat -> {name:?}: {source}")] Link { name: PathBuf, source: nix::Error }, #[error("dup: {0}")] @@ -52,6 +56,8 @@ impl Error { | Error::Unlink { source, .. } | Error::Mknod { source, .. } | Error::Chown { source, .. } + | Error::Chmod { source, .. } + | Error::Stat { source, .. } | Error::Link { source, .. } => Some(*source), Error::ReadDir(source) | Error::Dup(source) => Some(*source), } @@ -180,6 +186,35 @@ impl SafeDir { }) } + /// `fstat` entry `name` relative to this dir's fd without following a final + /// symlink (`AT_SYMLINK_NOFOLLOW`). A symlink stats as itself (`S_IFLNK`), + /// never its target, so a caller inspecting the file type can reject one. + pub fn stat(&self, name: &Path) -> Result { + fstatat(Some(self.0.as_raw_fd()), name, AtFlags::AT_SYMLINK_NOFOLLOW).map_err(|source| { + Error::Stat { + name: name.to_path_buf(), + source, + } + }) + } + + /// `chmod` entry `name` to `mode`. Linux's `fchmodat` has no working + /// no-follow mode (it returns `ENOTSUP`), so this follows a final symlink; + /// call it only after [`stat`](Self::stat) has proven `name` is not a + /// symlink, so the follow is a no-op on a real (non-link) entry. + pub fn chmod(&self, name: &Path, mode: u32) -> Result<(), Error> { + fchmodat( + Some(self.0.as_raw_fd()), + name, + Mode::from_bits_truncate(mode), + FchmodatFlags::FollowSymlink, + ) + .map_err(|source| Error::Chmod { + name: name.to_path_buf(), + source, + }) + } + /// Remove the non-directory entry `name` from this directory. pub fn unlink(&self, name: &Path) -> Result<(), Error> { unlinkat(Some(self.0.as_raw_fd()), name, UnlinkatFlags::NoRemoveDir).map_err(|source| { diff --git a/native/suidhelper/tests/tools/chroot_jail_grant_api.rs b/native/suidhelper/tests/tools/chroot_jail_grant_api.rs new file mode 100644 index 00000000..3e6092f5 --- /dev/null +++ b/native/suidhelper/tests/tools/chroot_jail_grant_api.rs @@ -0,0 +1,224 @@ +//! Contracts of the `chroot-jail grant-api` op, driven through the base-injected +//! `grant_api_under` seam so they run unprivileged in a tempdir. The promises +//! under test (refusal contracts first — they are the security boundary): +//! * shape — the socket is accepted iff it is exactly +//! `//root/api.socket` below the jail base; any other depth or a +//! leaf that is not `api.socket` is refused before any chown; +//! * lexical — a `.`/`..`/empty component or a relative path is always rejected +//! before any filesystem access; +//! * type — a regular file or a symlink planted at `api.socket` is refused +//! (`NotASocket`) and left untouched, never chmod'd; only a real socket is +//! granted; +//! * confinement — a symlinked path component is never followed, so the chown +//! can never escape the anchored jail tree (the core TOCTOU guarantee); +//! * pending — a not-yet-created socket (or half-built jail) is `Pending`, not +//! an error, so the controller keeps probing; +//! * grant — a real socket is chowned to the caller and left mode 0660. + +use hyper_suidhelper::tools::chroot_jail::grant_api::{grant_api_under, Error, GrantOut}; +use hyper_suidhelper::util::safe_path::ValidationError; +use proptest::prelude::*; +use std::os::unix::fs::{symlink, PermissionsExt}; +use std::os::unix::net::UnixListener; +use std::path::{Path, PathBuf}; +use std::{fs, os::unix::fs::MetadataExt}; + +/// Build the canonical `/exec/id/root` parent dirs and return that dir. +fn make_root(jail: &Path) -> PathBuf { + let root = jail.join("exec").join("id").join("root"); + fs::create_dir_all(&root).unwrap(); + root +} + +#[test] +fn socket_outside_jail_base_is_rejected() { + let tmp = tempfile::tempdir().unwrap(); + let jail = tmp.path().join("jail"); + fs::create_dir(&jail).unwrap(); + let outside = tmp.path().join("elsewhere/exec/id/root/api.socket"); + let err = grant_api_under(&jail, &outside).unwrap_err(); + assert!( + matches!(err, Error::SocketPath(ValidationError::NotUnderBase)), + "got {err:?}", + ); +} + +#[test] +fn wrong_leaf_basename_is_rejected() { + let tmp = tempfile::tempdir().unwrap(); + let jail = tmp.path(); + let bad = jail.join("exec").join("id").join("root").join("evil.sock"); + let err = grant_api_under(jail, &bad).unwrap_err(); + assert!(matches!(err, Error::SocketName(_)), "got {err:?}"); +} + +#[test] +fn too_shallow_is_shape_error() { + let tmp = tempfile::tempdir().unwrap(); + let jail = tmp.path(); + let bad = jail.join("exec").join("id").join("api.socket"); // missing root/ + let err = grant_api_under(jail, &bad).unwrap_err(); + assert!(matches!(err, Error::SocketShape(_)), "got {err:?}"); +} + +#[test] +fn too_deep_is_shape_error() { + let tmp = tempfile::tempdir().unwrap(); + let jail = tmp.path(); + let bad = jail + .join("exec") + .join("id") + .join("root") + .join("extra") + .join("api.socket"); + let err = grant_api_under(jail, &bad).unwrap_err(); + assert!(matches!(err, Error::SocketShape(_)), "got {err:?}"); +} + +#[test] +fn dotdot_traversal_is_rejected() { + let tmp = tempfile::tempdir().unwrap(); + let jail = tmp.path(); + let bad = PathBuf::from(format!("{}/exec/../id/root/api.socket", jail.display())); + let err = grant_api_under(jail, &bad).unwrap_err(); + assert!( + matches!(err, Error::SocketPath(ValidationError::LooseComponents)), + "got {err:?}", + ); +} + +#[test] +fn relative_socket_is_rejected() { + let tmp = tempfile::tempdir().unwrap(); + let err = grant_api_under(tmp.path(), Path::new("exec/id/root/api.socket")).unwrap_err(); + assert!( + matches!(err, Error::SocketPath(ValidationError::NotAbsolute)), + "got {err:?}", + ); +} + +#[test] +fn missing_socket_is_pending() { + let tmp = tempfile::tempdir().unwrap(); + let jail = tmp.path(); + let root = make_root(jail); + let socket = root.join("api.socket"); // never created + let out = grant_api_under(jail, &socket).expect("missing socket must be Ok(Pending)"); + assert!(matches!(out, GrantOut::Pending), "got {out:?}"); +} + +#[test] +fn missing_jail_tree_is_pending() { + let tmp = tempfile::tempdir().unwrap(); + let jail = tmp.path(); + let socket = jail.join("exec").join("id").join("root").join("api.socket"); // nothing created + let out = grant_api_under(jail, &socket).expect("half-built jail must be Ok(Pending)"); + assert!(matches!(out, GrantOut::Pending), "got {out:?}"); +} + +// A real socket is granted: chowned to the caller (our own uid/gid, which a +// non-root process is allowed to set on a file it owns) and chmod'd 0660. +#[test] +fn real_socket_is_granted_and_chmod_0660() { + let tmp = tempfile::tempdir().unwrap(); + let jail = tmp.path(); + let root = make_root(jail); + let socket = root.join("api.socket"); + let _listener = UnixListener::bind(&socket).unwrap(); + fs::set_permissions(&socket, fs::Permissions::from_mode(0o755)).unwrap(); + + let out = grant_api_under(jail, &socket).expect("real socket must grant"); + assert!(matches!(out, GrantOut::Granted), "got {out:?}"); + + let meta = fs::symlink_metadata(&socket).unwrap(); + assert_eq!(meta.mode() & 0o777, 0o660, "socket must be chmod'd 0660"); + assert_eq!(meta.uid(), nix::unistd::getuid().as_raw()); + assert_eq!(meta.gid(), nix::unistd::getgid().as_raw()); +} + +// A regular file planted at api.socket is refused and left untouched (not chmod'd). +#[test] +fn regular_file_at_leaf_is_refused_and_untouched() { + let tmp = tempfile::tempdir().unwrap(); + let jail = tmp.path(); + let root = make_root(jail); + let imposter = root.join("api.socket"); + fs::write(&imposter, b"not a socket").unwrap(); + fs::set_permissions(&imposter, fs::Permissions::from_mode(0o600)).unwrap(); + + let err = grant_api_under(jail, &imposter).unwrap_err(); + assert!(matches!(err, Error::NotASocket), "got {err:?}"); + assert_eq!( + fs::symlink_metadata(&imposter).unwrap().mode() & 0o777, + 0o600, + "imposter file must not be chmod'd", + ); +} + +// A symlink planted at api.socket is refused: fstatat(AT_SYMLINK_NOFOLLOW) stats +// the link itself (S_IFLNK), so it is never seen as a socket and never followed. +#[test] +fn symlink_at_leaf_is_refused() { + let tmp = tempfile::tempdir().unwrap(); + let jail = tmp.path(); + let root = make_root(jail); + let target = tmp.path().join("real-target"); + fs::write(&target, b"secret").unwrap(); + let link = root.join("api.socket"); + symlink(&target, &link).unwrap(); + + let err = grant_api_under(jail, &link).unwrap_err(); + assert!(matches!(err, Error::NotASocket), "got {err:?}"); +} + +// A symlinked path component must NOT be followed: the walk fails rather than +// reaching through it, so nothing outside the jail is touched. +#[test] +fn symlinked_component_does_not_escape() { + let tmp = tempfile::tempdir().unwrap(); + let jail = tmp.path().join("jail"); + fs::create_dir(&jail).unwrap(); + + let sentinel = tmp.path().join("sentinel"); + fs::create_dir_all(sentinel.join("id").join("root")).unwrap(); + let outside_socket = sentinel.join("id").join("root").join("api.socket"); + let _listener = UnixListener::bind(&outside_socket).unwrap(); + fs::set_permissions(&outside_socket, fs::Permissions::from_mode(0o700)).unwrap(); + + // `/exec` is a symlink to the external sentinel dir. + symlink(&sentinel, jail.join("exec")).unwrap(); + + let socket = jail.join("exec").join("id").join("root").join("api.socket"); + let _ = grant_api_under(&jail, &socket); // O_NOFOLLOW makes the walk refuse + + assert_eq!( + fs::symlink_metadata(&outside_socket).unwrap().mode() & 0o777, + 0o700, + "grant escaped through a symlinked component", + ); +} + +proptest! { + // For a socket `depth` components below the jail base with leaf `api.socket` + // (target never created), grant_api_under returns Ok(Pending) iff depth == 4 + // (i.e. 3 parents), else SocketShape. The generator emits only plain names so + // the lexical gate never fires and the leaf is always `api.socket`. + #[test] + fn shape_classification( + parents in prop::collection::vec("[a-z][a-z0-9]{0,5}", 1..6) + ) { + let tmp = tempfile::tempdir().unwrap(); + let jail = tmp.path(); + let mut socket = jail.to_path_buf(); + for c in &parents { + socket.push(c); + } + socket.push("api.socket"); + let res = grant_api_under(jail, &socket); + if parents.len() == 3 { + prop_assert!(matches!(res, Ok(GrantOut::Pending)), "depth 3 must be Pending, got {res:?}"); + } else { + prop_assert!(matches!(res, Err(Error::SocketShape(_))), "got {res:?}"); + } + } +} From 85bc4a4455ccae5ca3aadca76a253fa8b126eb3b Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 01:51:19 +0000 Subject: [PATCH 29/46] feat(fire_vmm): grant the API socket before probing so the controller can connect AwaitingApi now hands the jailed API socket to the node user (via the new chroot-jail grant-api op) before each readiness probe. firecracker creates the socket owned by the per-VM uid, so the controller gets EACCES until the helper chowns it; granting first lets the probe (and every later API call) connect. The grant runs once (tracked by State.api_granted); :socket_pending and transient grant errors keep the controller waiting until the existing boot deadline rather than crashing, via a shared deadline-aware keep_probing path. --- lib/hyper/node/fire_vmm/state.ex | 87 ++++++++++++++++++++-------- lib/hyper/suid_helper/chroot_jail.ex | 21 +++++++ 2 files changed, 84 insertions(+), 24 deletions(-) diff --git a/lib/hyper/node/fire_vmm/state.ex b/lib/hyper/node/fire_vmm/state.ex index 552da0f1..0ca98e7a 100644 --- a/lib/hyper/node/fire_vmm/state.ex +++ b/lib/hyper/node/fire_vmm/state.ex @@ -39,12 +39,13 @@ defmodule Hyper.Node.FireVMM.State do } @enforce_keys [:opts] - defstruct [:opts, :spec, :boot_deadline] + defstruct [:opts, :spec, :boot_deadline, api_granted: false] @type t :: %State{ opts: Opts.t(), spec: BootSpec.Cold.t() | nil, - boot_deadline: integer() | nil + boot_deadline: integer() | nil, + api_granted: boolean() } # How long to wait for the daemon's API to come up before failing the boot. @@ -110,7 +111,8 @@ defmodule Hyper.Node.FireVMM.State do @moduledoc "Poll the (already-launched) daemon's API socket, then advance to `:configuring`." alias Hyper.Firecracker.Api.{InstanceInfo, Operations} - alias Hyper.Node.FireVMM.{Client, Opts} + alias Hyper.Node.FireVMM.{Client, Jailer, Opts} + alias Hyper.SuidHelper.ChrootJail alias Unit.Time require Logger @@ -118,35 +120,72 @@ defmodule Hyper.Node.FireVMM.State do # How often to probe the daemon's API while waiting for it. @probe_interval Time.ms(50) - # Poll the daemon's API until it answers, then configure. Give up if the - # readiness deadline passes first. + # Hand the jailed API socket to the node user, then poll the daemon's API + # until it answers and advance to `:configuring`. Give up if the readiness + # deadline passes first. The grant must happen before the probe: firecracker + # creates the socket owned by the per-VM uid, so the unprivileged controller + # gets EACCES on connect until the helper chowns it to us. def handle(:state_timeout, :probe, %{opts: %Opts{vm_id: id}} = data) do - case Client.run(Client.via(id), &Operations.describe_instance/1) do - {:ok, %InstanceInfo{}} -> - {:next_state, :configuring, data, [{:state_timeout, 0, :configure}]} - - {:error, reason} -> - if System.monotonic_time(:millisecond) >= data.boot_deadline do - # firecracker's own log shows the API server is up, so a persistent - # probe failure points at the host->jail socket (path or, more often, - # permissions: the jailer drops firecracker to the per-VM uid and the - # unprivileged controller cannot reach the jailed socket). Surface the - # last error rather than swallowing it into a bare :daemon_unready. - Logger.warning( - "vm #{id}: firecracker API not reachable before deadline; " <> - "last probe error: #{inspect(reason)}" - ) - - {:stop, {:shutdown, {:boot_failed, {:daemon_unready, reason}}}, data} - else - {:keep_state_and_data, [{:state_timeout, Time.as_ms(@probe_interval), :probe}]} + case ensure_api_granted(id, data) do + {:cont, data} -> + case Client.run(Client.via(id), &Operations.describe_instance/1) do + {:ok, %InstanceInfo{}} -> + {:next_state, :configuring, data, [{:state_timeout, 0, :configure}]} + + {:error, reason} -> + keep_probing(id, data, reason) end + + {:wait, data, reason} -> + keep_probing(id, data, reason) end end def handle({:call, from}, :stop, data) do {:next_state, :stopping, data, [{:reply, from, :ok}]} end + + # Ensure the jailed API socket has been handed to the node user. Idempotent + # once granted (we record it in `data` so we ask the helper only once). + # `:socket_pending` means firecracker has not created the socket yet, so we + # keep waiting; a hard error is logged but also tolerated until the deadline + # (the probe that follows would fail with EACCES anyway and drive the stop). + @spec ensure_api_granted(Hyper.Vm.id(), State.t()) :: + {:cont, State.t()} | {:wait, State.t(), term()} + defp ensure_api_granted(_id, %{api_granted: true} = data), do: {:cont, data} + + defp ensure_api_granted(id, data) do + case ChrootJail.grant_api(Jailer.host_socket(id)) do + :ok -> + {:cont, %{data | api_granted: true}} + + {:error, :socket_pending} -> + {:wait, data, :socket_pending} + + {:error, reason} -> + Logger.warning("vm #{id}: grant-api failed: #{inspect(reason)}") + {:wait, data, {:grant_api, reason}} + end + end + + # Keep waiting for readiness, re-arming the probe timer, unless the deadline + # has lapsed - then fail the boot, surfacing `reason` rather than swallowing + # it into a bare `:daemon_unready`. A persistent failure here points at the + # host->jail socket (path or, more often, the grant/permission step above). + @spec keep_probing(Hyper.Vm.id(), State.t(), term()) :: + {:keep_state, State.t(), list()} | {:stop, term(), State.t()} + defp keep_probing(id, data, reason) do + if System.monotonic_time(:millisecond) >= data.boot_deadline do + Logger.warning( + "vm #{id}: firecracker API not reachable before deadline; " <> + "last probe error: #{inspect(reason)}" + ) + + {:stop, {:shutdown, {:boot_failed, {:daemon_unready, reason}}}, data} + else + {:keep_state, data, [{:state_timeout, Time.as_ms(@probe_interval), :probe}]} + end + end end defmodule Configuring do diff --git a/lib/hyper/suid_helper/chroot_jail.ex b/lib/hyper/suid_helper/chroot_jail.ex index fdc0f1ce..cf8370c0 100644 --- a/lib/hyper/suid_helper/chroot_jail.ex +++ b/lib/hyper/suid_helper/chroot_jail.ex @@ -57,4 +57,25 @@ defmodule Hyper.SuidHelper.ChrootJail do {:error, _} = err -> err end end + + @doc """ + Hand the firecracker API `socket` to the node user so the unprivileged + controller can `connect()` to it. The jailer drops firecracker to a per-VM + uid/gid and chroots it, so the socket it creates is owned by that per-VM id and + the node (a different uid) gets `EACCES` on connect. The helper chowns just + that one socket to its caller (the node user) and chmods it `0660`, leaving the + rest of the per-VM isolation intact. + + Returns `{:error, :socket_pending}` while firecracker has not yet created the + socket, so the caller can keep waiting. + """ + @spec grant_api(Path.t()) :: :ok | {:error, :socket_pending} | {:error, err()} + @decorate with_span("Hyper.SuidHelper.ChrootJail.grant_api", include: [:socket]) + def grant_api(socket) do + case SuidHelper.exec(["chroot-jail", "grant-api", "--socket", socket]) do + {:ok, %{"result" => "granted"}} -> :ok + {:ok, %{"result" => "pending"}} -> {:error, :socket_pending} + {:error, _} = err -> err + end + end end From 71d64403d43880f5889844bb2819bf24f7228905 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 02:19:17 +0000 Subject: [PATCH 30/46] fix(hyper): generate alphanumeric vm ids (firecracker rejects _) gen_vm_id used Base.url_encode64, which emits - and _. firecracker rejects _ in an instance id (InvalidInstanceId / "Invalid char (_)"), so any id containing one crash-looped the jailer at boot. Switch to lowercase base32 ([a-z2-7], alphanumeric only) - the intersection of the firecracker, dm/jailer, and registry-key constraints. Strengthen the property from "no leading -" to "strictly alphanumeric". --- lib/hyper.ex | 19 +++++++++++++++++-- test/hyper/node/fire_vmm/jailer_test.exs | 7 +++++-- 2 files changed, 22 insertions(+), 4 deletions(-) diff --git a/lib/hyper.ex b/lib/hyper.ex index c29d2f03..ce946a6a 100644 --- a/lib/hyper.ex +++ b/lib/hyper.ex @@ -34,9 +34,24 @@ defmodule Hyper do end end - @doc "Generate a fresh VM id (url-safe base64, dm-name compatible)." + @doc """ + Generate a fresh VM id: a `v` prefix followed by lowercase base32 of 10 random + bytes, charset `[a-z2-7]`. + + Alphanumeric only - no `-`, `_`, or other punctuation. That is the intersection + of three independent constraints the id must satisfy at once: + + * firecracker rejects `_` in an instance id (`InvalidInstanceId`); + * dm/jailer names must not start with `-`; + * registry keys and chroot path components stay trivially safe. + + The previous base64url encoding emitted `-` and `_`, so it could produce ids + firecracker refused at boot (`Invalid char (_)`). + """ @spec gen_vm_id() :: Hyper.Vm.id() - def gen_vm_id, do: "v" <> Base.url_encode64(:crypto.strong_rand_bytes(9), padding: false) + def gen_vm_id do + "v" <> Base.encode32(:crypto.strong_rand_bytes(10), padding: false, case: :lower) + end @spec resolve_arch(Hyper.Vm.Instance.arch() | nil) :: {:ok, Hyper.Vm.Instance.arch()} | {:error, term()} diff --git a/test/hyper/node/fire_vmm/jailer_test.exs b/test/hyper/node/fire_vmm/jailer_test.exs index 9ac17536..1ca862fe 100644 --- a/test/hyper/node/fire_vmm/jailer_test.exs +++ b/test/hyper/node/fire_vmm/jailer_test.exs @@ -80,9 +80,12 @@ defmodule Hyper.Node.FireVMM.JailerTest do assert Enum.any?(args, &String.starts_with?(&1, "memory.max=")) end - property "gen_vm_id/0 never produces an id starting with -" do + # firecracker rejects an instance id containing `_` (and dm/jailer names must + # not lead with `-`), so the id must be strictly alphanumeric. + property "gen_vm_id/0 produces only alphanumeric ids" do check all(_ <- StreamData.constant(nil)) do - refute String.starts_with?(Hyper.gen_vm_id(), "-") + id = Hyper.gen_vm_id() + assert id =~ ~r/\A[A-Za-z0-9]+\z/ end end end From b3c2315bf94462af071fc1203a9bf1cfe5d7d599 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 02:19:26 +0000 Subject: [PATCH 31/46] fix(fire_vmm): self-register per-VM names in init, not via a start name Starting a process with a {:via, Horde.Registry, _} name makes OTP run gen:get_proc_name right after start: it calls whereis_name immediately after the synchronous register. Horde materialises the name into local ETS only asynchronously (DeltaCRDT diff loop), so under registry churn the read loses the race and startup aborts with {:process_not_registered_via, Horde.Registry}. The crash-storm from the bad vm_id flooded the CRDT and tipped this over, killing FireVMM.State. Add Routing.register_self/1 and have the supervisor, client, and state machine register from their own init (started unnamed). Names stay cluster-resolvable via via/1 once the diff propagates - callers already tolerate that lag. Core starts unnamed entirely (resolved by nobody). --- lib/hyper/cluster/routing.ex | 20 ++++++++++++++++++++ lib/hyper/node/fire_vmm.ex | 30 +++++++++++++++++++----------- lib/hyper/node/fire_vmm/client.ex | 31 +++++++++++++++++++++---------- lib/hyper/node/fire_vmm/core.ex | 5 ++++- lib/hyper/node/fire_vmm/state.ex | 28 ++++++++++++++++++++-------- 5 files changed, 84 insertions(+), 30 deletions(-) diff --git a/lib/hyper/cluster/routing.ex b/lib/hyper/cluster/routing.ex index c7802595..7b478363 100644 --- a/lib/hyper/cluster/routing.ex +++ b/lib/hyper/cluster/routing.ex @@ -28,6 +28,26 @@ defmodule Hyper.Cluster.Routing do @spec via(term()) :: {:via, module(), {atom(), term()}} def via(key), do: {:via, Horde.Registry, {@name, key}} + @doc """ + Register the calling process under `key` from inside its own `init`. + + Prefer this over starting a process with a `{:via, Horde.Registry, _}` name. + OTP's post-start name check (`gen:get_proc_name`) calls `whereis_name` + immediately after the synchronous `register`, but Horde materialises the name + into its local ETS only asynchronously, via the DeltaCRDT diff loop. Under + registry churn that read loses the race and OTP aborts startup with + `{:process_not_registered_via, Horde.Registry}`. Registering from within + `init` carries no such self-check, while leaving the name cluster-resolvable + through `via/1` once the diff propagates (callers already tolerate that lag). + """ + @spec register_self(term()) :: :ok | {:error, {:already_registered, pid()}} + def register_self(key) do + case Horde.Registry.register(@name, key, nil) do + {:ok, _pid} -> :ok + {:error, {:already_registered, _pid}} = err -> err + end + end + @doc "Which node currently runs `vm_id`? `nil` if unknown." @spec whereis(Hyper.Vm.id()) :: node() | nil @decorate with_span("Hyper.Cluster.Routing.whereis", include: [:vm_id]) diff --git a/lib/hyper/node/fire_vmm.ex b/lib/hyper/node/fire_vmm.ex index 59388626..f9e45fbc 100644 --- a/lib/hyper/node/fire_vmm.ex +++ b/lib/hyper/node/fire_vmm.ex @@ -47,7 +47,7 @@ defmodule Hyper.Node.FireVMM do @spec start_link(Opts.t()) :: Supervisor.on_start() def start_link(opts) do - Supervisor.start_link(__MODULE__, opts, name: via(opts.vm_id)) + Supervisor.start_link(__MODULE__, opts) end @spec child_spec(Opts.t()) :: Supervisor.child_spec() @@ -64,18 +64,26 @@ defmodule Hyper.Node.FireVMM do @impl true def init(opts) do - children = [ - # Client must be registered before Core: Core starts the State machine, - # which calls Client.run while waiting for the daemon's API. Client depends - # only on vm_id (an independent peer), so it has no reverse dependency. - {Client, %Client.Opts{vm_id: opts.vm_id}}, - {Core, opts} - ] + # Self-register the cluster routing entry here rather than via a start name; + # see `Hyper.Cluster.Routing.register_self/1`. A fresh random vm_id never + # collides, so `:already_registered` only happens against a stale dead + # incarnation - decline the start and let the supervisor retry clean. + case Hyper.Cluster.Routing.register_self({opts.vm_id, :supervisor}) do + :ok -> + children = [ + # Client must be registered before Core: Core starts the State machine, + # which calls Client.run while waiting for the daemon's API. Client + # depends only on vm_id (an independent peer), so no reverse dependency. + {Client, %Client.Opts{vm_id: opts.vm_id}}, + {Core, opts} + ] - Supervisor.init(children, strategy: :one_for_one) - end + Supervisor.init(children, strategy: :one_for_one) - defp via(vm_id), do: Hyper.Cluster.Routing.via({vm_id, :supervisor}) + {:error, _} -> + :ignore + end + end @doc "Test whether the system can run firecracker VMMs." @spec test_system() :: :ok | {:error, term()} diff --git a/lib/hyper/node/fire_vmm/client.ex b/lib/hyper/node/fire_vmm/client.ex index a336b095..2ed991ce 100644 --- a/lib/hyper/node/fire_vmm/client.ex +++ b/lib/hyper/node/fire_vmm/client.ex @@ -57,15 +57,12 @@ defmodule Hyper.Node.FireVMM.Client do @type t :: %__MODULE__{socket_path: Path.t()} end + # Prod path (vm_id, no explicit name) starts unnamed and self-registers in + # `init` - see `Hyper.Cluster.Routing.register_self/1`. A `:name` override + # (test stand-ins) is honoured as a plain local name and skips registration. @spec start_link(Opts.t()) :: GenServer.on_start() def start_link(%Opts{} = opts) do - name = - case opts.name do - nil when not is_nil(opts.vm_id) -> via(opts.vm_id) - other -> other - end - - GenServer.start_link(__MODULE__, opts, gen_opts(name)) + GenServer.start_link(__MODULE__, opts, gen_opts(opts.name)) end @spec via(Hyper.Vm.id()) :: GenServer.name() @@ -78,12 +75,26 @@ defmodule Hyper.Node.FireVMM.Client do end @impl true - @spec init(Opts.t()) :: {:ok, State.t()} + @spec init(Opts.t()) :: {:ok, State.t()} | {:stop, {:already_registered, pid()}} def init(%Opts{} = opts) do - socket_path = opts.socket_path || Jailer.host_socket(opts.vm_id) - {:ok, %State{socket_path: socket_path}} + with :ok <- register(opts) do + socket_path = opts.socket_path || Jailer.host_socket(opts.vm_id) + {:ok, %State{socket_path: socket_path}} + end + end + + # Register cluster-wide under {vm_id, :client} on the prod path. With an + # explicit name (test stand-in), the name is the local registration, so skip. + @spec register(Opts.t()) :: :ok | {:stop, {:already_registered, pid()}} + defp register(%Opts{name: nil, vm_id: vm_id}) when not is_nil(vm_id) do + case Hyper.Cluster.Routing.register_self({vm_id, :client}) do + :ok -> :ok + {:error, reason} -> {:stop, reason} + end end + defp register(%Opts{}), do: :ok + @impl true def handle_call({:run, op_fun}, _from, %State{socket_path: socket_path} = state) do {:reply, op_fun.(socket_path: socket_path), state} diff --git a/lib/hyper/node/fire_vmm/core.ex b/lib/hyper/node/fire_vmm/core.ex index 8ccf6180..18dfeb50 100644 --- a/lib/hyper/node/fire_vmm/core.ex +++ b/lib/hyper/node/fire_vmm/core.ex @@ -28,9 +28,12 @@ defmodule Hyper.Node.FireVMM.Core do alias Hyper.Node.FireVMM.Daemon alias Hyper.Node.FireVMM.State + # Started unnamed: nothing resolves the core by name (it is addressed as a + # child of `Hyper.Node.FireVMM`), so it needs no registry entry - and avoids a + # needless racy Horde registration at startup. @spec start_link(FireVMM.Opts.t()) :: Supervisor.on_start() def start_link(opts) do - Supervisor.start_link(__MODULE__, opts, name: Hyper.Cluster.Routing.via({opts.vm_id, :core})) + Supervisor.start_link(__MODULE__, opts) end @impl true diff --git a/lib/hyper/node/fire_vmm/state.ex b/lib/hyper/node/fire_vmm/state.ex index 0ca98e7a..7a1749d9 100644 --- a/lib/hyper/node/fire_vmm/state.ex +++ b/lib/hyper/node/fire_vmm/state.ex @@ -55,8 +55,11 @@ defmodule Hyper.Node.FireVMM.State do %{id: __MODULE__, start: {__MODULE__, :start_link, [opts]}} end - def start_link(%Opts{vm_id: id} = opts) do - :gen_statem.start_link(via(id), __MODULE__, opts, []) + # Started unnamed; the controller self-registers under `{id, :state}` from + # `init` (see `Hyper.Cluster.Routing.register_self/1`). `stop/1` still resolves + # it cluster-wide through `via/1`. + def start_link(%Opts{} = opts) do + :gen_statem.start_link(__MODULE__, opts, []) end @spec stop(Hyper.Vm.id()) :: :ok @@ -73,12 +76,21 @@ defmodule Hyper.Node.FireVMM.State do # The daemon is already (being) started by `Core` as our sibling. Read the root # device off the per-VM mutable layer, resolve the boot spec, set the readiness # deadline, and start probing the API. - def init(%Opts{mutable: mutable, kernel: kernel, boot_args: boot_args, type: type} = opts) do - spec = BootSpec.resolve(boot_source(kernel, Mutable.blk_path(mutable), boot_args), type) - deadline = System.monotonic_time(:millisecond) + Time.as_ms(@ready_timeout) - data = %State{opts: opts, spec: spec, boot_deadline: deadline} - - {:ok, :awaiting_api, data, [{:state_timeout, 0, :probe}]} + def init( + %Opts{vm_id: id, mutable: mutable, kernel: kernel, boot_args: boot_args, type: type} = + opts + ) do + case Hyper.Cluster.Routing.register_self({id, :state}) do + :ok -> + spec = BootSpec.resolve(boot_source(kernel, Mutable.blk_path(mutable), boot_args), type) + deadline = System.monotonic_time(:millisecond) + Time.as_ms(@ready_timeout) + data = %State{opts: opts, spec: spec, boot_deadline: deadline} + + {:ok, :awaiting_api, data, [{:state_timeout, 0, :probe}]} + + {:error, reason} -> + {:stop, reason} + end end # Assemble the `Hyper.Vm.source()` BootSpec expects from the resolved kernel + From 41961533b98b65020457acaf4e61ca222bf8d2d1 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 02:44:27 +0000 Subject: [PATCH 32/46] fix(suidhelper): grant-api opens the jail root dir for node traversal grant-api chowned the API socket to the caller, but the jailer leaves /root as 0700 owned by the per-VM uid. connect() needs search (+x) on every ancestor, so the unprivileged node still got EACCES traversing into root - the socket's owner was irrelevant. The op now also opens that one directory to the caller's group: owner stays the per-VM uid (firecracker needs it), chgrp to the caller's gid, chmod 0710 (owner rwx, group --x traverse-not-list, other none). Unrelated users stay locked out; only the socket and its parent's group/mode move. Add SafeDir::chmod_self/chgrp_self (fchmod/fchown on the pinned root fd, not by name - TOCTOU-safe). Extend the grant test to assert root is chgrp'd to the caller and chmod'd 0710. --- .../src/tools/chroot_jail/grant_api.rs | 40 +++++++++++++++---- native/suidhelper/src/util/safe_dir.rs | 24 ++++++++++- .../tests/tools/chroot_jail_grant_api.rs | 19 ++++++++- 3 files changed, 73 insertions(+), 10 deletions(-) diff --git a/native/suidhelper/src/tools/chroot_jail/grant_api.rs b/native/suidhelper/src/tools/chroot_jail/grant_api.rs index a74b5481..4ec94a21 100644 --- a/native/suidhelper/src/tools/chroot_jail/grant_api.rs +++ b/native/suidhelper/src/tools/chroot_jail/grant_api.rs @@ -4,13 +4,22 @@ //! //! The jailer drops firecracker to a per-VM uid/gid and chroots it; firecracker //! then creates its API socket at `/root/api.socket` owned by that per-VM -//! id, mode `0755`. Connecting a unix socket needs *write* permission on the -//! node, so the node user (a different uid) gets `EACCES`. This op chowns just -//! that one socket to the helper's CALLER — `getuid()`/`getgid()`, which inside -//! the privileged scope are the real (caller) ids while euid is 0 — and chmods -//! it `0660`. The node thus connects as owner, and humans added to the node's -//! group connect via the group bit. Per-VM isolation is otherwise untouched: -//! only this single socket moves, nothing else in the jail. +//! id. Connecting a unix socket needs *write* permission on the node, so the +//! node user (a different uid) gets `EACCES`. This op chowns just that one +//! socket to the helper's CALLER — `getuid()`/`getgid()`, which inside the +//! privileged scope are the real (caller) ids while euid is 0 — and chmods it +//! `0660`. The node thus connects as owner, and humans added to the node's +//! group connect via the group bit. +//! +//! That alone is not enough: the jailer leaves `/root` as `0700` owned by +//! the per-VM uid, and connecting needs *search* (`+x`) on every ancestor, so +//! the node cannot even traverse into `root` to reach the (now its own) socket. +//! So this op also opens just that one directory to the caller's group: it keeps +//! the per-VM uid as owner (firecracker still needs it), chgrps `root` to the +//! caller's gid, and chmods it `0710` — owner `rwx`, group `--x` (traverse, not +//! list), other none. Per-VM isolation is otherwise untouched: only this socket +//! and its immediate parent's group/mode move, nothing else in the jail, and +//! unrelated users stay locked out. //! //! Security: the socket path is validated as a `SafePath` and reached by an //! `O_NOFOLLOW` walk from `JAIL_BASE`, so a symlinked component cannot redirect @@ -45,6 +54,11 @@ const SOCKET_PARENT_DEPTH: usize = 3; /// Mode handed to the node: owner+group read/write, no world access. const SOCKET_MODE: u32 = 0o660; +/// Mode set on the jail `root` dir so the node's group can *traverse* it to +/// reach the socket: owner `rwx` (the per-VM uid, unchanged), group `--x` +/// (traverse, not list), other none. +const JAIL_ROOT_MODE: u32 = 0o710; + type LexicalPath = SafePath; #[derive(Debug, ThisError)] @@ -65,6 +79,10 @@ pub enum Error { Chown(#[source] safe_dir::Error), #[error("chmoding the socket: {0}")] Chmod(#[source] safe_dir::Error), + #[error("chgrp-ing the jail root dir to the caller: {0}")] + ChgrpRoot(#[source] safe_dir::Error), + #[error("chmoding the jail root dir for traversal: {0}")] + ChmodRoot(#[source] safe_dir::Error), } #[derive(Args)] @@ -139,6 +157,14 @@ pub fn grant_api_under(jail_base: &Path, socket: &Path) -> Result Result<(), Error> { + fchmod(self.0.as_raw_fd(), Mode::from_bits_truncate(mode)).map_err(|source| Error::Chmod { + name: PathBuf::from("."), + source, + }) + } + + /// `fchown` this directory's group through its own held fd, preserving its + /// owner (no uid passed). Same TOCTOU guarantee as [`chmod_self`](Self::chmod_self). + pub fn chgrp_self(&self, gid: u32) -> Result<(), Error> { + fchown(self.0.as_raw_fd(), None, Some(Gid::from_raw(gid))).map_err(|source| Error::Chown { + name: PathBuf::from("."), + source, + }) + } + /// Remove the non-directory entry `name` from this directory. pub fn unlink(&self, name: &Path) -> Result<(), Error> { unlinkat(Some(self.0.as_raw_fd()), name, UnlinkatFlags::NoRemoveDir).map_err(|source| { diff --git a/native/suidhelper/tests/tools/chroot_jail_grant_api.rs b/native/suidhelper/tests/tools/chroot_jail_grant_api.rs index 3e6092f5..e55a679b 100644 --- a/native/suidhelper/tests/tools/chroot_jail_grant_api.rs +++ b/native/suidhelper/tests/tools/chroot_jail_grant_api.rs @@ -13,7 +13,9 @@ //! can never escape the anchored jail tree (the core TOCTOU guarantee); //! * pending — a not-yet-created socket (or half-built jail) is `Pending`, not //! an error, so the controller keeps probing; -//! * grant — a real socket is chowned to the caller and left mode 0660. +//! * grant — a real socket is chowned to the caller and left mode 0660, and +//! its parent `root` dir is opened for the caller's group to traverse +//! (chgrp'd to the caller, chmod'd 0710) so the node can reach the socket. use hyper_suidhelper::tools::chroot_jail::grant_api::{grant_api_under, Error, GrantOut}; use hyper_suidhelper::util::safe_path::ValidationError; @@ -134,6 +136,21 @@ fn real_socket_is_granted_and_chmod_0660() { assert_eq!(meta.mode() & 0o777, 0o660, "socket must be chmod'd 0660"); assert_eq!(meta.uid(), nix::unistd::getuid().as_raw()); assert_eq!(meta.gid(), nix::unistd::getgid().as_raw()); + + // The parent `root` dir must also be opened for the caller's group to + // traverse, else the node could not reach the socket: chgrp'd to the caller, + // chmod'd 0710 (owner rwx, group --x, other none). + let root_meta = fs::symlink_metadata(&root).unwrap(); + assert_eq!( + root_meta.mode() & 0o777, + 0o710, + "jail root must be chmod'd 0710 for traversal", + ); + assert_eq!( + root_meta.gid(), + nix::unistd::getgid().as_raw(), + "jail root must be chgrp'd to the caller", + ); } // A regular file planted at api.socket is refused and left untouched (not chmod'd). From 528487855a9960e12750d5fa022b14fdd2d28da2 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 05:22:04 +0000 Subject: [PATCH 33/46] fix(suidhelper): resolve dm symlink to real node before rootfs open The rootfs device handed to chroot-jail staging is /dev/mapper/hyper-rw-, a symlink into the dm farm. SafeFile opens it O_NOFOLLOW (it must never follow an attacker-supplied symlink), so the dm symlink itself was rejected with "file is not of the required type" and the guest never booted. Resolve the device to its real /dev/dm-N node via canonicalize before the open. Safe because the BlockDev lexical guard already proved the name is a hyper-owned dm device and /dev/mapper is a root-owned 0755 dir an unprivileged caller cannot redirect; loop nodes canonicalize to themselves. The re-opened target is still verified IsBlockDevice. --- native/suidhelper/src/util/chroot_jail.rs | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/native/suidhelper/src/util/chroot_jail.rs b/native/suidhelper/src/util/chroot_jail.rs index 70d4ce78..6a618187 100644 --- a/native/suidhelper/src/util/chroot_jail.rs +++ b/native/suidhelper/src/util/chroot_jail.rs @@ -43,6 +43,12 @@ pub enum Error { }, #[error("copy kernel: {0}")] Copy(#[source] io::Error), + #[error("resolving block device {path}: {source}")] + ResolveDev { + path: PathBuf, + #[source] + source: io::Error, + }, } /// An unfilled artifact slot. @@ -190,8 +196,18 @@ pub fn stage_kernel_under( /// `uid:gid`. The device is opened as a verified `SafeFile`, so the /// number comes from a real device node, never a caller-supplied value. fn make_rootfs(chroot: &SafeDir, device: &BlockDev, uid: u32, gid: u32) -> Result<(), Error> { - let dev_path: SafePath = - device.as_ref().to_path_buf().try_into()?; + // `device` is lexically validated as `/dev/loopN` or `/dev/mapper/hyper-*`. + // The dm form is a symlink under `/dev/mapper` — a root-owned `0755` dir an + // unprivileged caller cannot write — so resolving it to its real `/dev/dm-N` + // node cannot be redirected by the caller. We must resolve first: `SafeFile` + // opens `O_NOFOLLOW` (it must never follow an attacker-supplied symlink) and + // would otherwise reject the dm symlink itself. Loop nodes resolve to + // themselves. The re-opened target is still verified `IsBlockDevice`. + let real = std::fs::canonicalize(device.as_ref()).map_err(|source| Error::ResolveDev { + path: device.as_ref().to_path_buf(), + source, + })?; + let dev_path: SafePath = real.try_into()?; let dev = SafeFile::::open(&dev_path, OFlag::O_PATH)?; let rdev = dev.rdev()?; chroot.mknod_block(Path::new(ROOTFS_NAME), rdev, uid, gid)?; From f2dc209171a3e6af495cb34ca86abe0f201678f5 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 05:22:08 +0000 Subject: [PATCH 34/46] feat(fire_vmm): log boot failures in the Configuring state Both staging and config-apply failures end the boot via a :one_for_all supervisor restart, so without an explicit log the reason vanished into the restart cycle and the VM just appeared to relaunch for no visible cause. Log the reason at each failure path before stopping. --- lib/hyper/node/fire_vmm/state.ex | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/lib/hyper/node/fire_vmm/state.ex b/lib/hyper/node/fire_vmm/state.ex index 7a1749d9..fa87d9b4 100644 --- a/lib/hyper/node/fire_vmm/state.ex +++ b/lib/hyper/node/fire_vmm/state.ex @@ -208,8 +208,12 @@ defmodule Hyper.Node.FireVMM.State do alias Hyper.Firecracker.Api.{InstanceActionInfo, Operations} alias Hyper.Node.FireVMM.{BootSpec, ChrootJail, Client, Opts} + require Logger + # Stage boot artifacts into the chroot, then issue the pre-boot config and - # start the guest. + # start the guest. Both failure paths end the boot via a supervisor restart, + # so log the reason here - otherwise it vanishes into the `:one_for_all` + # cycle and the VM just appears to relaunch for no visible cause. def handle( :state_timeout, :configure, @@ -218,11 +222,17 @@ defmodule Hyper.Node.FireVMM.State do case ChrootJail.stage(id, uid, gid, spec) do {:ok, jailed_spec} -> case apply_spec(id, jailed_spec) do - :ok -> {:next_state, :running, data} - {:error, reason} -> {:stop, {:shutdown, {:boot_failed, reason}}, data} + :ok -> + Logger.info("vm #{id}: configured, guest starting") + {:next_state, :running, data} + + {:error, reason} -> + Logger.error("vm #{id}: boot failed applying config: #{inspect(reason)}") + {:stop, {:shutdown, {:boot_failed, reason}}, data} end {:error, reason} -> + Logger.error("vm #{id}: boot failed staging jail: #{inspect(reason)}") {:stop, {:shutdown, {:boot_failed, {:staging, reason}}}, data} end end From 725e1ca4c6f20e6141e9b0f4ece19753a1b36b05 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 17:06:06 +0000 Subject: [PATCH 35/46] fix(suidhelper): cgroup.kill the leaf before removing it chroot-jail remove rmdir'd the per-VM cgroup leaf and swallowed ENOTEMPTY as success, so a firecracker still alive in the cgroup was never killed and leaked - holding its rootfs dm device and loop devices, wedging the host (a restart storm left 344 orphaned firecrackers that needed sudo pkill to clear). Write "1" to the leaf's cgroup.kill (v2; the jailer runs --cgroup-version 2) before rmdir, SIGKILLing the whole subtree regardless of session - the jailer setsid's firecracker out of the process group, so MuonTrap's group-kill misses it but cgroup.kill does not. Reorder so the cgroup teardown (the kill) runs before the chroot removal, and retry the rmdir while the killed cgroup drains (EBUSY/ENOTEMPTY tolerated; a persistent busy is left for a later sweep rather than failing a relaunch). Adds SafeDir::write_file (O_WRONLY|O_NOFOLLOW, no O_CREAT) for the pseudo-file write, fd-relative on the pinned leaf dir. --- .../src/tools/chroot_jail/remove.rs | 74 ++++++++++++++++--- native/suidhelper/src/util/safe_dir.rs | 38 +++++++++- .../tests/tools/chroot_jail_remove.rs | 47 ++++++++++++ native/suidhelper/tests/util/confinement.rs | 37 ++++++++++ 4 files changed, 186 insertions(+), 10 deletions(-) diff --git a/native/suidhelper/src/tools/chroot_jail/remove.rs b/native/suidhelper/src/tools/chroot_jail/remove.rs index 57e19495..7574b364 100644 --- a/native/suidhelper/src/tools/chroot_jail/remove.rs +++ b/native/suidhelper/src/tools/chroot_jail/remove.rs @@ -7,8 +7,11 @@ //! symlinked component cannot redirect the deletion outside the tree, and removal //! is fd-relative (`unlinkat`), never by re-resolved name. `--chroot` must be //! exactly `/` below `JAIL_BASE`; `--cgroup` at least two components -//! below its base (a non-recursive `rmdir`). Both deletes are idempotent: a -//! missing target (`ENOENT`, and for the cgroup `ENOTEMPTY`) is success. +//! below its base. Before removing the cgroup leaf we write `cgroup.kill` to +//! SIGKILL any process still in the subtree (a still-live firecracker would +//! otherwise keep the leaf non-empty and leak its loop/dm devices), then +//! `rmdir` it (non-recursive). Both deletes are idempotent: a missing target +//! (`ENOENT`, and for the cgroup `ENOTEMPTY`) is success. use crate::config::Config; use crate::tools::IsTool; @@ -18,11 +21,18 @@ use clap::Args; use nix::errno::Errno; use serde::Serialize; use std::path::{Path, PathBuf}; +use std::time::Duration; use thiserror::Error as ThisError; /// The cgroup virtual filesystem root. const CGROUP_BASE: &str = "/sys/fs/cgroup"; +/// How many times to retry the leaf `rmdir` while a killed cgroup drains, and the +/// pause between tries: ~`ATTEMPTS * BACKOFF` total (a cgroup reaps in a few ms), +/// after which a still-busy leaf is left for a later sweep rather than failing. +const RMDIR_ATTEMPTS: u32 = 20; +const RMDIR_BACKOFF_MS: u64 = 5; + type LexicalPath = SafePath; #[derive(Debug, ThisError)] @@ -41,6 +51,8 @@ pub enum Error { RemoveChroot(#[source] safe_dir::Error), #[error("removing cgroup: {0}")] RemoveCgroup(#[source] safe_dir::Error), + #[error("killing cgroup procs: {0}")] + KillCgroup(#[source] safe_dir::Error), } #[derive(Args)] @@ -74,8 +86,10 @@ impl IsTool for Remove { type RunT = Result<(), Error>; fn run_privileged(&self) -> Self::RunT { - remove_chroot_under(&Config::get().jail_base(), &self.args.chroot)?; + // Cgroup first: it SIGKILLs any firecracker still alive in the leaf, so the + // chroot teardown below is not yanking files out from under a live process. remove_cgroup_under(Path::new(CGROUP_BASE), &self.args.cgroup)?; + remove_chroot_under(&Config::get().jail_base(), &self.args.chroot)?; Ok(()) } @@ -105,9 +119,11 @@ pub fn remove_chroot_under(jail_base: &Path, chroot: &Path) -> Result<(), Error> } } -/// Remove the (empty) per-VM cgroup leaf, fd-relative after an `O_NOFOLLOW` walk -/// from `base`. `--cgroup` must be at least two components below `base` (one or -/// more parents, plus the leaf). Idempotent on `ENOENT`/`ENOTEMPTY`. +/// SIGKILL any process still in the per-VM cgroup leaf (via `cgroup.kill`), then +/// `rmdir` it, fd-relative after an `O_NOFOLLOW` walk from `base`. `--cgroup` must +/// be at least two components below `base` (one or more parents, plus the leaf). +/// Idempotent: a missing leaf is success, and a leaf that has not finished +/// draining after the kill (`EBUSY`/`ENOTEMPTY`) is left for a later sweep. pub fn remove_cgroup_under(base: &Path, cgroup: &Path) -> Result<(), Error> { let path: LexicalPath = cgroup.to_path_buf().try_into().map_err(Error::CgroupPath)?; let (parents, leaf) = path.relative_to(base).map_err(Error::CgroupPath)?; @@ -118,10 +134,50 @@ pub fn remove_cgroup_under(base: &Path, cgroup: &Path) -> Result<(), Error> { let Some(parent) = walk(base.to_path_buf(), &parents)? else { return Ok(()); // an ancestor is already gone }; - match parent.rmdir(&leaf) { + + // Kill any process still in the leaf cgroup before removing it. Open the leaf + // O_NOFOLLOW (one step past the parent walk) and write "1" to cgroup.kill. + match parent.openat_dir(&leaf) { + Ok(leaf_dir) => kill_cgroup(&leaf_dir)?, + Err(e) if e.errno() == Some(Errno::ENOENT) => return Ok(()), // leaf already gone + Err(e) => return Err(Error::Walk(e)), + } + + rmdir_drained(&parent, &leaf) +} + +/// `rmdir` the killed cgroup leaf, retrying while it drains. `cgroup.kill` signals +/// SIGKILL synchronously but the kernel reaps the processes and offlines the +/// cgroup asynchronously, so the `rmdir` briefly returns `EBUSY`/`ENOTEMPTY`. +/// Retry a bounded number of times; if it still has not drained, the processes are +/// already dead (the kill is the load-bearing op) and a later sweep or the next +/// relaunch clears the empty leaf, so a persistent busy is success — never fail +/// the caller (a relaunch) over leftover-dir cleanup. `ENOENT` means already gone. +fn rmdir_drained(parent: &SafeDir, leaf: &Path) -> Result<(), Error> { + for attempt in 0..RMDIR_ATTEMPTS { + match parent.rmdir(leaf) { + Ok(()) => return Ok(()), + Err(e) if e.errno() == Some(Errno::ENOENT) => return Ok(()), + Err(e) if matches!(e.errno(), Some(Errno::EBUSY | Errno::ENOTEMPTY)) => { + if attempt + 1 < RMDIR_ATTEMPTS { + std::thread::sleep(Duration::from_millis(RMDIR_BACKOFF_MS)); + } + } + Err(e) => return Err(Error::RemoveCgroup(e)), + } + } + Ok(()) +} + +/// Best-effort SIGKILL of every process in the v2 cgroup `leaf_dir`: write "1" to +/// its `cgroup.kill` pseudo-file. A missing file (`ENOENT`: pre-5.14/non-v2 +/// kernel, or already-emptied leaf) or a cgroup torn down concurrently (`ENODEV`) +/// is tolerated — killing must not hard-fail the remove. +fn kill_cgroup(leaf_dir: &SafeDir) -> Result<(), Error> { + match leaf_dir.write_file(Path::new("cgroup.kill"), b"1") { Ok(()) => Ok(()), - Err(e) if matches!(e.errno(), Some(Errno::ENOENT | Errno::ENOTEMPTY)) => Ok(()), - Err(e) => Err(Error::RemoveCgroup(e)), + Err(e) if matches!(e.errno(), Some(Errno::ENOENT | Errno::ENODEV)) => Ok(()), + Err(e) => Err(Error::KillCgroup(e)), } } diff --git a/native/suidhelper/src/util/safe_dir.rs b/native/suidhelper/src/util/safe_dir.rs index 0e4c6870..61808d8a 100644 --- a/native/suidhelper/src/util/safe_dir.rs +++ b/native/suidhelper/src/util/safe_dir.rs @@ -18,7 +18,7 @@ use nix::dir::{Dir, Type}; use nix::fcntl::{openat, AtFlags, OFlag}; use nix::libc::dev_t; use nix::sys::stat::{fchmod, fchmodat, fstatat, mknodat, FchmodatFlags, FileStat, Mode, SFlag}; -use nix::unistd::{dup, fchown, fchownat, linkat, unlinkat, Gid, Uid, UnlinkatFlags}; +use nix::unistd::{dup, fchown, fchownat, linkat, unlinkat, write, Gid, Uid, UnlinkatFlags}; use std::ffi::OsStr; use std::os::unix::ffi::OsStrExt; use std::os::unix::io::{AsRawFd, FromRawFd, OwnedFd, RawFd}; @@ -33,6 +33,8 @@ pub enum Error { ReadDir(#[source] nix::Error), #[error("unlinkat {name:?}: {source}")] Unlink { name: PathBuf, source: nix::Error }, + #[error("write {name:?}: {source}")] + Write { name: PathBuf, source: nix::Error }, #[error("mknodat {name:?}: {source}")] Mknod { name: PathBuf, source: nix::Error }, #[error("fchownat {name:?}: {source}")] @@ -54,6 +56,7 @@ impl Error { match self { Error::Open { source, .. } | Error::Unlink { source, .. } + | Error::Write { source, .. } | Error::Mknod { source, .. } | Error::Chown { source, .. } | Error::Chmod { source, .. } @@ -139,6 +142,39 @@ impl SafeDir { Ok(unsafe { SafeFile::from_raw_fd(raw) }) } + /// Open existing file `name` write-only (`O_WRONLY|O_NOFOLLOW|O_CLOEXEC`, no + /// `O_CREAT`/`O_TRUNC`) and write `contents`. For kernel pseudo-files such as + /// `cgroup.kill` where a single `write` is the whole API. `O_NOFOLLOW` rejects a + /// symlinked `name`, so the write cannot be redirected out of this pinned dir. + pub fn write_file(&self, name: &Path, contents: &[u8]) -> Result<(), Error> { + let raw = openat( + Some(self.0.as_raw_fd()), + name, + OFlag::O_WRONLY | OFlag::O_NOFOLLOW | OFlag::O_CLOEXEC, + Mode::empty(), + ) + .map_err(|source| Error::Open { + name: name.to_path_buf(), + source, + })?; + // SAFETY: openat just handed us this fd; nobody else owns it. + let fd = unsafe { OwnedFd::from_raw_fd(raw) }; + let mut off = 0; + while off < contents.len() { + match write(&fd, &contents[off..]) { + Ok(0) => break, + Ok(n) => off += n, + Err(source) => { + return Err(Error::Write { + name: name.to_path_buf(), + source, + }) + } + } + } + Ok(()) + } + /// Create a block device node `name` in this directory with the given /// `rdev`, then chown it to `uid:gid`. pub fn mknod_block(&self, name: &Path, rdev: dev_t, uid: u32, gid: u32) -> Result<(), Error> { diff --git a/native/suidhelper/tests/tools/chroot_jail_remove.rs b/native/suidhelper/tests/tools/chroot_jail_remove.rs index 93377e25..7d0de542 100644 --- a/native/suidhelper/tests/tools/chroot_jail_remove.rs +++ b/native/suidhelper/tests/tools/chroot_jail_remove.rs @@ -164,6 +164,53 @@ fn symlinked_chroot_component_does_not_escape() { ); } +// cgroup.kill is written "1" before the rmdir: the fix SIGKILLs the subtree +// first. Here cgroup.kill keeps the leaf non-empty so rmdir returns ENOTEMPTY +// (tolerated) and the leaf survives — letting us assert the write happened. +#[test] +fn cgroup_kill_is_written_before_rmdir() { + let tmp = tempfile::tempdir().unwrap(); + let base = tmp.path(); + let leaf = base.join("slice").join("leaf"); + fs::create_dir_all(&leaf).unwrap(); + fs::write(leaf.join("cgroup.kill"), b"0").unwrap(); + + remove_cgroup_under(base, &leaf).expect("write+rmdir must be Ok"); + assert!(leaf.exists(), "non-empty leaf must survive"); + assert_eq!( + fs::read(leaf.join("cgroup.kill")).unwrap(), + b"1", + "cgroup.kill must have been written \"1\"", + ); +} + +// A leaf without a cgroup.kill file (pre-5.14/non-v2 kernel) is tolerated: the +// missing pseudo-file is a no-op and the empty leaf is removed normally. +#[test] +fn missing_cgroup_kill_is_tolerated() { + let tmp = tempfile::tempdir().unwrap(); + let base = tmp.path(); + let leaf = base.join("slice").join("leaf"); + fs::create_dir_all(&leaf).unwrap(); + + remove_cgroup_under(base, &leaf).expect("missing cgroup.kill must be Ok"); + assert!(!leaf.exists(), "empty leaf must be removed"); +} + +// The leaf is already gone but its parent survives: the new openat_dir(leaf) +// ENOENT branch returns Ok without touching the parent. +#[test] +fn missing_leaf_with_present_parent_is_ok() { + let tmp = tempfile::tempdir().unwrap(); + let base = tmp.path(); + let parent = base.join("slice"); + fs::create_dir(&parent).unwrap(); + let leaf = parent.join("leaf"); // never created + + remove_cgroup_under(base, &leaf).expect("missing leaf must be Ok"); + assert!(parent.exists(), "parent must survive"); +} + proptest! { // For a chroot `depth` components below the jail base (target never created), // remove_chroot_under returns Ok iff depth == 2, else ChrootDepth. The diff --git a/native/suidhelper/tests/util/confinement.rs b/native/suidhelper/tests/util/confinement.rs index 5cffd2b0..f537500e 100644 --- a/native/suidhelper/tests/util/confinement.rs +++ b/native/suidhelper/tests/util/confinement.rs @@ -8,6 +8,7 @@ use hyper_suidhelper::util::safe_file::{ Any, IsBlockDevice, IsRegularFile, OnlyRootWritable, RootOwner, SafeFile, }; use hyper_suidhelper::util::safe_path::{IsAbsolute, SafePath, StrictComponents}; +use nix::errno::Errno; use nix::fcntl::OFlag; use std::fs; use std::os::unix::fs::{symlink, PermissionsExt}; @@ -116,6 +117,42 @@ fn safefile_mode_axis_rejects_group_writable() { assert!(SafeFile::::open(&p, OFlag::O_PATH).is_ok()); } +// write_file writes its bytes to an existing file through the pinned dir fd. +#[test] +fn write_file_writes_bytes_to_existing_file() { + let tmp = tempfile::tempdir().unwrap(); + let f = tmp.path().join("f"); + fs::write(&f, b"x").unwrap(); + + let dir = SafeDir::open(&safe(tmp.path())).unwrap(); + dir.write_file(Path::new("f"), b"1").unwrap(); + assert_eq!(fs::read(&f).unwrap(), b"1"); +} + +// write_file refuses a symlinked target (O_NOFOLLOW) and leaves it untouched: a +// write cannot be redirected through a symlink out of the pinned dir. +#[test] +fn write_file_refuses_symlinked_target_and_leaves_it_unchanged() { + let tmp = tempfile::tempdir().unwrap(); + let target = tmp.path().join("target"); + fs::write(&target, b"keep").unwrap(); + symlink(&target, tmp.path().join("link")).unwrap(); + + let dir = SafeDir::open(&safe(tmp.path())).unwrap(); + assert!(dir.write_file(Path::new("link"), b"1").is_err()); + assert_eq!(fs::read(&target).unwrap(), b"keep"); +} + +// write_file on a missing name fails with ENOENT (no O_CREAT), so the caller can +// treat an absent pseudo-file as a no-op. +#[test] +fn write_file_missing_returns_enoent() { + let tmp = tempfile::tempdir().unwrap(); + let dir = SafeDir::open(&safe(tmp.path())).unwrap(); + let err = dir.write_file(Path::new("nope"), b"1").unwrap_err(); + assert_eq!(err.errno(), Some(Errno::ENOENT)); +} + // SafeFile::open refuses a symlinked final component (O_NOFOLLOW). #[test] fn safefile_open_rejects_final_symlink() { From 8fd381742300c595a3bfb2fc990598f878ccb8e5 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 17:06:16 +0000 Subject: [PATCH 36/46] fix(fire_vmm): guarantee firecracker death on teardown Two bugs let a stopped VM's firecracker survive: 1. Jailer.cgroup_dir/1 computed /sys/fs/cgroup///, but the jailer (cgroup v2) places firecracker at / with no level (confirmed via /proc//cgroup = 0:://). So the teardown's cgroup remove targeted a path that never existed - a silent no-op. Drop the level; add Jailer.cgroup_parent_dir/0 (the dir whose subdirs are the vm_id leaves) for the reaper to enumerate. 2. Daemon only cleared the jail on (re)start, never on a final stop, and relied on MuonTrap's port-close to kill firecracker - which it cannot, since the jailer setsid's firecracker into its own session. A graceful VM stop thus left firecracker running, holding its dm/loop devices. Make Daemon a trap_exit GenServer whose terminate/2 runs the helper's cgroup.kill teardown, so firecracker is guaranteed dead before its mutable dm layer is torn down. A linked MuonTrap.Daemon exit still stops the server so Core cold-boots. --- lib/hyper/node/fire_vmm/core.ex | 5 +- lib/hyper/node/fire_vmm/daemon.ex | 142 ++++++++++++++++++++---------- lib/hyper/node/fire_vmm/jailer.ex | 22 ++++- 3 files changed, 119 insertions(+), 50 deletions(-) diff --git a/lib/hyper/node/fire_vmm/core.ex b/lib/hyper/node/fire_vmm/core.ex index 18dfeb50..34830b54 100644 --- a/lib/hyper/node/fire_vmm/core.ex +++ b/lib/hyper/node/fire_vmm/core.ex @@ -18,8 +18,9 @@ defmodule Hyper.Node.FireVMM.Core do * firecracker crash -> the `Daemon` child exits; both restart; `Daemon` resets the stale jail and relaunches, and the fresh controller cold-boots. - `MuonTrap` kills the OS process when its port closes (teardown or BEAM death), - so no firecracker process outlives the supervisor. + `Daemon` guarantees firecracker is dead on teardown via the helper's + `cgroup.kill` (MuonTrap's port-close kill misses the setsid'd firecracker), so + no firecracker process outlives a graceful supervisor shutdown. """ use Supervisor diff --git a/lib/hyper/node/fire_vmm/daemon.ex b/lib/hyper/node/fire_vmm/daemon.ex index b6257732..964a0923 100644 --- a/lib/hyper/node/fire_vmm/daemon.ex +++ b/lib/hyper/node/fire_vmm/daemon.ex @@ -3,30 +3,37 @@ defmodule Hyper.Node.FireVMM.Daemon do The jailed firecracker OS process for one microVM, as a static child of `Hyper.Node.FireVMM.Core`. - Lifecycle is supervisor-owned. On every (re)start it first resets any stale - jail left by a prior incarnation — the firecracker jailer refuses to reuse an - existing chroot — then builds the jailer command and runs it under - `MuonTrap.Daemon`, which kills the OS process when its port closes (controller - crash, container teardown, or BEAM death). So no firecracker process outlives - the supervisor, and `Core`'s `:one_for_all` restarting this child (e.g. after a - firecracker crash) cleanly cold-boots against a fresh jail. - - The supervised process is `hyper-suidhelper jailer ...`, which `execve`s into - the jailer (same pid) so MuonTrap owns the firecracker lifetime without needing - to know the jailer path. `start_link/1` does the reset, then delegates and - returns that pid. + A `trap_exit` GenServer that owns firecracker's lifetime end to end: + + * on every (re)start it resets any stale jail left by a prior incarnation — + the firecracker jailer refuses to reuse an existing chroot — then launches + the jailer under a linked `MuonTrap.Daemon`. The supervised process is + `hyper-suidhelper jailer ...`, which `execve`s into the jailer (same pid). + * if firecracker exits, the linked `MuonTrap.Daemon` exits and this server + stops with that reason, so `Core`'s `:one_for_all` cold-boots the pair. + * on teardown it **guarantees firecracker is dead**: `MuonTrap`'s port-close + kills by process group, but the jailer `setsid`s firecracker into its own + session, so it escapes that kill and would leak (holding the cgroup, the + rootfs dm device, loop devices). `terminate/2` therefore runs the helper's + `cgroup.kill` teardown (`ChrootJail.remove`), which SIGKILLs the whole leaf + cgroup regardless of session. The same call on (re)start cleans up after a + prior incarnation the BEAM could not (a SIGKILL'd node leaves no + `terminate/2`); the periodic `Hyper.Node.Reaper` is the final backstop. """ + use GenServer + use OpenTelemetryDecorator + alias Hyper.Node.FireVMM.{Jailer, Opts} alias Hyper.SuidHelper alias Unit.Time - use OpenTelemetryDecorator - require Logger @shutdown_timeout Time.s(5) + defstruct [:opts, :muontrap] + @spec child_spec(Opts.t()) :: Supervisor.child_spec() def child_spec(%Opts{} = opts) do %{ @@ -38,37 +45,82 @@ defmodule Hyper.Node.FireVMM.Daemon do } end - @doc """ - Reset the VM's stale jail, then launch the jailer under `MuonTrap.Daemon` and - return its pid. Fails (so the supervisor retries) if the reset cannot run. - """ - @spec start_link(Opts.t()) :: {:ok, pid()} | {:error, term()} - @decorate with_span("Hyper.Node.FireVMM.Daemon.start_link", include: [:id]) - def start_link(%Opts{vm_id: id} = opts) do - with :ok <- SuidHelper.ChrootJail.remove(Jailer.chroot_dir(id), Jailer.cgroup_dir(id)) do - cmd = Jailer.command(opts) - - # Surface what the jailed process actually does: `log_output` routes the - # helper/jailer/firecracker stdout+stderr (guest serial console included) - # to the Logger, and `exit_status_to_reason` turns MuonTrap's opaque - # `:error_exit_status` into `{:firecracker_exited, status}` so a crash - # report names the real exit code instead of hiding it. - daemon_opts = [ - log_output: :info, - log_prefix: "vm #{id} firecracker: ", - stderr_to_stdout: true, - exit_status_to_reason: &{:firecracker_exited, &1} - ] - - case MuonTrap.Daemon.start_link(cmd.binary, cmd.args, daemon_opts) do - {:ok, pid} -> - Logger.info("vm #{id}: jailer launched under MuonTrap (#{inspect(pid)})") - {:ok, pid} - - {:error, reason} = err -> - Logger.error("vm #{id}: jailer failed to launch: #{inspect(reason)}") - err - end + @spec start_link(Opts.t()) :: GenServer.on_start() + def start_link(%Opts{} = opts) do + GenServer.start_link(__MODULE__, opts) + end + + @impl true + @decorate with_span("Hyper.Node.FireVMM.Daemon.init", include: [:id]) + def init(%Opts{vm_id: id} = opts) do + # Trap exits so the linked MuonTrap's exit reaches `handle_info` (not a silent + # link kill) and so `terminate/2` runs on supervisor shutdown. + Process.flag(:trap_exit, true) + + with :ok <- reset_stale_jail(id), + {:ok, muontrap} <- launch(opts) do + {:ok, %__MODULE__{opts: opts, muontrap: muontrap}} + else + {:error, reason} -> {:stop, reason} + end + end + + # firecracker (the linked MuonTrap.Daemon) exited: stop with its reason so + # `Core`'s `:one_for_all` discards the controller too and cold-boots the pair. + @impl true + def handle_info({:EXIT, muontrap, reason}, %__MODULE__{muontrap: muontrap} = state) do + {:stop, reason, state} + end + + def handle_info(_msg, state), do: {:noreply, state} + + # Guarantee firecracker is dead and its jail cleared. MuonTrap cannot kill the + # setsid'd firecracker; the helper's `cgroup.kill` can. Best-effort: a failure + # here is logged, and the `Reaper` will retry, but it must not crash teardown. + @impl true + @decorate with_span("Hyper.Node.FireVMM.Daemon.terminate", include: [:id]) + def terminate(_reason, %__MODULE__{opts: %Opts{vm_id: id}}) do + case clear_jail(id) do + :ok -> + :ok + + {:error, reason} -> + Logger.error("vm #{id}: teardown failed to clear jail: #{inspect(reason)}") + end + end + + @spec reset_stale_jail(Hyper.Vm.id()) :: :ok | {:error, term()} + defp reset_stale_jail(id), do: clear_jail(id) + + @spec clear_jail(Hyper.Vm.id()) :: :ok | {:error, term()} + defp clear_jail(id) do + SuidHelper.ChrootJail.remove(Jailer.chroot_dir(id), Jailer.cgroup_dir(id)) + end + + @spec launch(Opts.t()) :: {:ok, pid()} | {:error, term()} + defp launch(%Opts{vm_id: id} = opts) do + cmd = Jailer.command(opts) + + # Surface what the jailed process actually does: `log_output` routes the + # helper/jailer/firecracker stdout+stderr (guest serial console included) + # to the Logger, and `exit_status_to_reason` turns MuonTrap's opaque + # `:error_exit_status` into `{:firecracker_exited, status}` so a crash + # report names the real exit code instead of hiding it. + daemon_opts = [ + log_output: :info, + log_prefix: "vm #{id} firecracker: ", + stderr_to_stdout: true, + exit_status_to_reason: &{:firecracker_exited, &1} + ] + + case MuonTrap.Daemon.start_link(cmd.binary, cmd.args, daemon_opts) do + {:ok, pid} -> + Logger.info("vm #{id}: jailer launched under MuonTrap (#{inspect(pid)})") + {:ok, pid} + + {:error, reason} = err -> + Logger.error("vm #{id}: jailer failed to launch: #{inspect(reason)}") + err end end end diff --git a/lib/hyper/node/fire_vmm/jailer.ex b/lib/hyper/node/fire_vmm/jailer.ex index 355ad0d7..36683736 100644 --- a/lib/hyper/node/fire_vmm/jailer.ex +++ b/lib/hyper/node/fire_vmm/jailer.ex @@ -123,13 +123,29 @@ defmodule Hyper.Node.FireVMM.Jailer do end @doc """ - Host path of the VM's cgroup leaf (`/sys/fs/cgroup///`), the - cgroup the jailer creates for firecracker. Reconstructed (the jailer owns its + Host path of the cgroup dir holding every VM's leaf (`/sys/fs/cgroup/`). + Its immediate subdir names are vm_ids - the leaves the jailer creates for + firecracker - so listing it (directories only; the dir also holds cgroup control + files) enumerates the cgroup leaves on this host. + + Note the cgroup hierarchy has NO `` level: the jailer (cgroup v2, + `--parent-cgroup `) places firecracker directly at `/`, + unlike the chroot (`//`). Confirmed via + `/proc//cgroup` = `0:://`. + """ + @spec cgroup_parent_dir() :: Path.t() + def cgroup_parent_dir do + Path.join("/sys/fs/cgroup", Hyper.Config.parent_cgroup()) + end + + @doc """ + Host path of the VM's cgroup leaf (`/sys/fs/cgroup//`), the cgroup + the jailer creates for firecracker. Reconstructed (the jailer owns its placement) so a relaunch can clear the stale leaf left by a prior incarnation. """ @spec cgroup_dir(Hyper.Vm.id()) :: Path.t() def cgroup_dir(id) do - Path.join(["/sys/fs/cgroup", Hyper.Config.parent_cgroup(), exec_name(), id]) + Path.join(cgroup_parent_dir(), id) end @doc """ From 09c1ea889f5cfd1506bd20d1c08bff8fe0ab0384 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 17:06:26 +0000 Subject: [PATCH 37/46] feat(node): periodic Reaper to GC orphaned VM resources A SIGKILL'd BEAM runs no terminate/2, so a firecracker (plus its cgroup and hyper-rw- dm volume) can outlive its owner with no vm_id ever rebooting to clean it. Reaper is a per-node periodic, liveness-aware GC: each tick it diffs the live local VMs (supervisor children + routing) against the on-host per-VM cgroup leaves and hyper-rw-* dm volumes, and reaps the orphans via the helper's cgroup.kill teardown + dm remove. Liveness-aware on purpose: unlike boot-time Reclaim (which clears every hyper-* before any owner starts), a periodic GC must never touch the live thinpool, base images, or a running VM - so it only ever considers hyper-rw-* keyed to a vm_id, and a two-strike grace (reap only an id orphaned across two consecutive ticks) makes reaping a live or mid-boot VM structurally impossible. The decision core (Reaper.Plan) is pure and property-tested: a live id is never a candidate, only twice-seen orphans reap, and thinpool/img names are never candidates. Mirrors the Hyper.Img.Db.Gc shell/pure-core/config trio. --- lib/hyper/node.ex | 10 +- lib/hyper/node/reaper.ex | 166 ++++++++++++++++++ lib/hyper/node/reaper/config.ex | 35 ++++ lib/hyper/node/reaper/plan.ex | 32 ++++ .../node/reaper/plan_properties_test.exs | 78 ++++++++ test/hyper/node/reaper/plan_test.exs | 65 +++++++ 6 files changed, 385 insertions(+), 1 deletion(-) create mode 100644 lib/hyper/node/reaper.ex create mode 100644 lib/hyper/node/reaper/config.ex create mode 100644 lib/hyper/node/reaper/plan.ex create mode 100644 test/hyper/node/reaper/plan_properties_test.exs create mode 100644 test/hyper/node/reaper/plan_test.exs diff --git a/lib/hyper/node.ex b/lib/hyper/node.ex index b4721167..1f50fe58 100644 --- a/lib/hyper/node.ex +++ b/lib/hyper/node.ex @@ -26,6 +26,11 @@ defmodule Hyper.Node do real-time monitors backing the soft budget (`Hyper.Node.Budget.Soft`). Lives here, not at the application root, because both are per-node and only meaningful while this node hosts VMs. + + * `Hyper.Node.Reaper` - a periodic, liveness-aware GC for per-VM host + resources (orphaned firecracker cgroups and `hyper-rw-*` dm volumes) stranded + by an unclean death whose vm_id never reboots. Started last so the VM + supervisor it consults for liveness is already up. """ use Supervisor @@ -61,7 +66,10 @@ defmodule Hyper.Node do Hyper.Node.Layer, Hyper.Node.Budget.Supervisor, {DynamicSupervisor, name: @vm_sup, strategy: :one_for_one}, - Hyper.Node.Img + Hyper.Node.Img, + # Last child: :one_for_one starts in order, so the VM supervisor and Img are + # up before the reaper's first tick can read their liveness. + Hyper.Node.Reaper ] Supervisor.init(children, strategy: :one_for_one) diff --git a/lib/hyper/node/reaper.ex b/lib/hyper/node/reaper.ex new file mode 100644 index 00000000..479a1726 --- /dev/null +++ b/lib/hyper/node/reaper.ex @@ -0,0 +1,166 @@ +defmodule Hyper.Node.Reaper do + @moduledoc """ + Per-node periodic, liveness-aware garbage collector for per-VM host resources + that an unclean BEAM death can strand: a firecracker cgroup leaf and a + `hyper-rw-` dm volume whose owning processes' `terminate/2` never ran and + whose vm_id never reboots (so `Hyper.Node.Reclaim`, which runs once at boot, and + the relaunch-time cleanup in the FireVMM path, never get a chance to clear it). + + Liveness is the whole point. The reaper consults two independent sources of + truth for "this vm is alive" (`Plan.orphans/3` removes their union from the + candidate set) and only ever touches `hyper-rw-*` dm names and per-VM cgroup + leaves - never `hyper-thinpool`, `hyper-img-*`, or a live VM's resources. A + candidate must also survive two consecutive ticks (`Plan.confirm/2`) before it + is reaped, so a VM caught mid-boot (resources present, not yet registered) is + given a grace tick rather than destroyed. + + The decision logic lives in the pure `Hyper.Node.Reaper.Plan`; this module is a + thin I/O adapter that gathers the inputs, calls the plan, and executes the + best-effort, idempotent removals. + """ + + use GenServer + use OpenTelemetryDecorator + require Logger + + alias Hyper.Cluster.Routing + alias Hyper.Node.FireVMM.Jailer + alias Hyper.Node.Img + alias Hyper.Node.Reaper.{Config, Plan} + alias Hyper.SuidHelper.{ChrootJail, Dmsetup} + + @vm_sup Hyper.Node.VMSupervisor + + defstruct [:config, last_orphans: MapSet.new()] + + @type t :: %__MODULE__{config: Config.t(), last_orphans: MapSet.t(String.t())} + + @spec start_link(keyword()) :: GenServer.on_start() + def start_link(opts \\ []) do + GenServer.start_link(__MODULE__, opts, name: __MODULE__) + end + + @impl true + def init(opts) do + config = Keyword.get(opts, :config) || Config.load() + + if config.enabled do + schedule(config) + {:ok, %__MODULE__{config: config}} + else + Logger.info("reaper: disabled by config; not starting") + :ignore + end + end + + @impl true + def handle_info(:tick, state) do + state = sweep(state) + schedule(state.config) + {:noreply, state} + end + + # Ignore any unexpected message (port noise, stale timers) rather than crashing. + def handle_info(_msg, state), do: {:noreply, state} + + @spec schedule(Config.t()) :: reference() + defp schedule(config) do + Process.send_after(self(), :tick, Unit.Time.as_ms(config.interval)) + end + + @spec sweep(t()) :: t() + @decorate with_span("Hyper.Node.Reaper.sweep") + defp sweep(%__MODULE__{} = state) do + live = gather_live() + leaves = list_cgroup_leaves() + rw = Plan.rw_ids(list_rw_dm()) + + current = Plan.orphans(live, leaves, rw) + {to_reap, next} = Plan.confirm(current, state.last_orphans) + + Enum.each(to_reap, &reap_one/1) + %{state | last_orphans: next} + end + + # Over-counting "live" only defers a reap (safe); under-counting destroys a live + # VM (catastrophic). So union two independent liveness sources: the local VM + # supervisor's children and the cluster routing table's view of this node. + @spec gather_live() :: MapSet.t(String.t()) + defp gather_live, do: MapSet.union(local_live(), routed_live()) + + @spec local_live() :: MapSet.t(String.t()) + defp local_live do + case Process.whereis(@vm_sup) do + nil -> + MapSet.new() + + _pid -> + @vm_sup + |> DynamicSupervisor.which_children() + |> Enum.map(fn {_, child, _, _} -> child end) + |> Enum.filter(&is_pid/1) + |> Enum.map(&Routing.id_for/1) + |> Enum.reject(&is_nil/1) + |> MapSet.new() + end + end + + @spec routed_live() :: MapSet.t(String.t()) + defp routed_live do + for {id, node} <- Routing.all(), node == node(), into: MapSet.new(), do: id + end + + @spec list_cgroup_leaves() :: [String.t()] + defp list_cgroup_leaves do + parent = Jailer.cgroup_parent_dir() + + case File.ls(parent) do + {:ok, names} -> + # The parent holds the per-VM leaf directories alongside cgroup control + # files (`cgroup.procs`, `cgroup.controllers`, ...); only the directories + # are vm_id leaves. + Enum.filter(names, &File.dir?(Path.join(parent, &1))) + + {:error, :enoent} -> + [] + + {:error, reason} -> + Logger.warning("reaper: could not list cgroup leaves: #{inspect(reason)}") + [] + end + end + + @spec list_rw_dm() :: [String.t()] + defp list_rw_dm do + case Dmsetup.list() do + {:ok, names} -> + names + + {:error, reason} -> + Logger.warning("reaper: could not list dm devices: #{inspect(reason)}") + [] + end + end + + @spec reap_one(String.t()) :: :ok + @decorate with_span("Hyper.Node.Reaper.reap_one", include: [:id]) + defp reap_one(id) do + Logger.warning("reaper: reaping orphan vm #{id}") + + log_result( + "chroot/cgroup", + id, + ChrootJail.remove(Jailer.chroot_dir(id), Jailer.cgroup_dir(id)) + ) + + log_result("dm volume", id, Dmsetup.remove(Img.Mutable.dm_name(id))) + :ok + end + + @spec log_result(String.t(), String.t(), :ok | {:error, term()}) :: :ok + defp log_result(_what, _id, :ok), do: :ok + + defp log_result(what, id, {:error, reason}) do + Logger.warning("reaper: removing #{what} for #{id} failed: #{inspect(reason)}") + end +end diff --git a/lib/hyper/node/reaper/config.ex b/lib/hyper/node/reaper/config.ex new file mode 100644 index 00000000..5c0da184 --- /dev/null +++ b/lib/hyper/node/reaper/config.ex @@ -0,0 +1,35 @@ +defmodule Hyper.Node.Reaper.Config do + @moduledoc """ + Configuration for the per-node resource reaper (`Hyper.Node.Reaper`). + + Every field has a default, so configuration is optional - set only what you want + to change. Durations are `Unit.Time` values, so (like `Hyper.Img.Db.Gc.Config`) + overrides belong in `config/runtime.exs`: + + config :hyper, Hyper.Node.Reaper.Config, + enabled: true, + interval: Unit.Time.s(30) + + Set `enabled: false` to turn the reaper off entirely - it then never starts. + + ## Fields + + * `enabled` - run the reaper at all (default `true`). + * `interval` - rest between reap ticks (default `60s`). The two-strike grace + means an orphan is reaped at most one interval after it is first seen. + """ + + @type t :: %__MODULE__{ + enabled: boolean(), + interval: Unit.Time.t() + } + + defstruct enabled: true, + interval: Unit.Time.s(60) + + @doc "Build the config from app env, filling any unset field with its default." + @spec load() :: t() + def load do + struct!(__MODULE__, Application.get_env(:hyper, __MODULE__, [])) + end +end diff --git a/lib/hyper/node/reaper/plan.ex b/lib/hyper/node/reaper/plan.ex new file mode 100644 index 00000000..17b6dde9 --- /dev/null +++ b/lib/hyper/node/reaper/plan.ex @@ -0,0 +1,32 @@ +defmodule Hyper.Node.Reaper.Plan do + @moduledoc """ + Pure reap-decision core for `Hyper.Node.Reaper`. No I/O. Every safety invariant + is a property of these functions: a live vm_id is never a candidate, only an + orphan seen on two consecutive ticks is reaped, and only `hyper-rw-*` dm names + yield candidates (so `hyper-thinpool` / `hyper-img-*` can never be reaped). + """ + + @rw_prefix "hyper-rw-" + + @doc "vm_ids of orphan rw-volumes from raw `dmsetup ls` names (only `hyper-rw-*`)." + @spec rw_ids([String.t()]) :: [String.t()] + def rw_ids(dm_names) do + for name <- dm_names, + String.starts_with?(name, @rw_prefix), + do: String.replace_prefix(name, @rw_prefix, "") + end + + @doc "Candidate orphans this tick: (cgroup leaves ∪ rw vm_ids) minus the live set." + @spec orphans(MapSet.t(String.t()), [String.t()], [String.t()]) :: MapSet.t(String.t()) + def orphans(live, cgroup_leaves, rw_ids) do + cgroup_leaves + |> MapSet.new() + |> MapSet.union(MapSet.new(rw_ids)) + |> MapSet.difference(live) + end + + @doc "Two-strike grace: reap only ids that were also orphans last tick. Returns {to_reap, next_last}." + @spec confirm(MapSet.t(String.t()), MapSet.t(String.t())) :: + {MapSet.t(String.t()), MapSet.t(String.t())} + def confirm(current, last), do: {MapSet.intersection(current, last), current} +end diff --git a/test/hyper/node/reaper/plan_properties_test.exs b/test/hyper/node/reaper/plan_properties_test.exs new file mode 100644 index 00000000..6720a608 --- /dev/null +++ b/test/hyper/node/reaper/plan_properties_test.exs @@ -0,0 +1,78 @@ +defmodule Hyper.Node.Reaper.PlanPropertiesTest do + @moduledoc """ + Safety laws every `Hyper.Node.Reaper.Plan` decision must obey, generated over a + small shared id alphabet so live / cgroup-leaf / rw sets actually overlap: + + * a live vm_id is NEVER a reap candidate (the union of liveness sources is + removed from the orphan set); + * only an orphan seen on two consecutive ticks is reaped (`confirm/2` reaps a + subset of both current and last, and carries `current` forward); + * `hyper-thinpool` / `hyper-img-*` / any non-`hyper-rw-*` name never yields a + candidate, and `Mutable.dm_name(id)` round-trips back to exactly `id` (so a + future id-scheme change that breaks the strip fails loudly here). + """ + use ExUnit.Case, async: true + use ExUnitProperties + + alias Hyper.Node.Img.Mutable + alias Hyper.Node.Reaper.Plan + + # A deliberately tiny alphabet so generated live/leaf/rw id sets collide often, + # exercising the difference and intersection rather than always being disjoint. + defp id, do: member_of(~w(a b c d e)) + + defp id_set, do: map(list_of(id()), &MapSet.new/1) + + defp id_list, do: list_of(id()) + + property "a live vm_id is never a reap candidate" do + check all( + live <- id_set(), + leaves <- id_list(), + rw <- id_list() + ) do + orphans = Plan.orphans(live, leaves, rw) + assert MapSet.disjoint?(orphans, live) + end + end + + property "only twice-seen orphans are reaped; current is carried forward" do + check all( + current <- id_set(), + last <- id_set() + ) do + {reap, next} = Plan.confirm(current, last) + + assert MapSet.subset?(reap, current) + assert MapSet.subset?(reap, last) + assert next == current + end + end + + property "rw_ids excludes thinpool, img, and non-rw junk" do + safe_dm = + member_of([ + "hyper-thinpool", + "hyper-img-abc-0", + "hyper-img-deadbeef-3", + "unrelated-device", + "cryptroot" + ]) + + check all(names <- list_of(safe_dm)) do + assert Plan.rw_ids(names) == [] + end + end + + property "Mutable.dm_name/1 round-trips through rw_ids for a real vm_id" do + check all( + id <- + map( + binary(min_length: 1, max_length: 16), + &("v" <> Base.encode32(&1, padding: false, case: :lower)) + ) + ) do + assert Plan.rw_ids([Mutable.dm_name(id)]) == [id] + end + end +end diff --git a/test/hyper/node/reaper/plan_test.exs b/test/hyper/node/reaper/plan_test.exs new file mode 100644 index 00000000..0021db98 --- /dev/null +++ b/test/hyper/node/reaper/plan_test.exs @@ -0,0 +1,65 @@ +defmodule Hyper.Node.Reaper.PlanTest do + use ExUnit.Case, async: true + + alias Hyper.Node.Reaper.Plan + + defp set(ids), do: MapSet.new(ids) + + describe "orphans/3" do + test "a cgroup-leaf-only orphan is a candidate" do + assert Plan.orphans(set([]), ["dead"], []) == set(["dead"]) + end + + test "a dm-only orphan is a candidate" do + assert Plan.orphans(set([]), [], ["dead"]) == set(["dead"]) + end + + test "an id seen in both sources is a single candidate" do + assert Plan.orphans(set([]), ["dead"], ["dead"]) == set(["dead"]) + end + + test "an id present in live is never a candidate, even if it also has resources" do + assert Plan.orphans(set(["alive"]), ["alive"], ["alive"]) == set([]) + end + + test "only the non-live ids survive as candidates" do + assert Plan.orphans(set(["alive"]), ["alive", "dead"], ["alive", "gone"]) == + set(["dead", "gone"]) + end + end + + describe "confirm/2 two-strike grace" do + test "first tick reaps nothing (last is empty) but remembers the candidates" do + current = set(["x", "y"]) + {reap, next} = Plan.confirm(current, set([])) + + assert reap == set([]) + assert next == current + end + + test "second tick reaps the still-orphan ids" do + {_, last} = Plan.confirm(set(["x", "y"]), set([])) + {reap, _next} = Plan.confirm(set(["x", "y"]), last) + + assert reap == set(["x", "y"]) + end + + test "an id orphaned tick1 but live/absent tick2 is not reaped" do + {_, last} = Plan.confirm(set(["x"]), set([])) + {reap, _next} = Plan.confirm(set([]), last) + + assert reap == set([]) + end + + test "an id new in tick2 is not reaped (only one strike)" do + {_, last} = Plan.confirm(set(["x"]), set([])) + {reap, _next} = Plan.confirm(set(["x", "fresh"]), last) + + assert reap == set(["x"]) + end + end + + test "rw_ids/1 strips the hyper-rw- prefix and ignores thinpool/img names" do + assert Plan.rw_ids(["hyper-thinpool", "hyper-img-abc-0", "hyper-rw-vabc"]) == ["vabc"] + end +end From 988b026d3cb5848b93dc9c8de19f17855105a5d5 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 17:18:21 +0000 Subject: [PATCH 38/46] refactor(vm): extract Hyper.Vm.Id (type + generator) The VM id was the only generated id in the system (image ids are content hashes, user ids are pool integers), and it carries a real contract: a strict `[a-z2-7]` charset that is the intersection of firecracker's instance-id rules, dm/jailer naming, and path-component safety. Give it a home. * move `Hyper.gen_vm_id/0` -> `Hyper.Vm.Id.generate/0`, with the charset rationale as the module doc; * make `Hyper.Vm.Id.t()` the canonical id type and migrate every `Hyper.Vm.id()` reference to it (dropping the duplicate `@type id` alias on `Hyper.Vm`, which keeps `t :: pid()` for the VM handle); * move the id charset property out of jailer_test into a dedicated test/hyper/vm/id_test.exs. No behavior change. Hyper.Img.id/Hyper.Node.Users.id stay inline - they have no generator or charset contract to encapsulate. --- lib/hyper.ex | 25 +++------------------- lib/hyper/cluster/routing.ex | 6 +++--- lib/hyper/grpc/codec.ex | 8 +++---- lib/hyper/img/db/lease.ex | 4 ++-- lib/hyper/node.ex | 4 ++-- lib/hyper/node/fire_vmm.ex | 2 +- lib/hyper/node/fire_vmm/chroot_jail.ex | 2 +- lib/hyper/node/fire_vmm/client.ex | 2 +- lib/hyper/node/fire_vmm/daemon.ex | 4 ++-- lib/hyper/node/fire_vmm/jailer.ex | 8 +++---- lib/hyper/node/fire_vmm/state.ex | 8 +++---- lib/hyper/node/img.ex | 8 +++---- lib/hyper/node/img/mutable.ex | 4 ++-- lib/hyper/vm.ex | 1 - lib/hyper/vm/id.ex | 27 ++++++++++++++++++++++++ test/hyper/node/fire_vmm/jailer_test.exs | 12 +---------- test/hyper/vm/id_test.exs | 20 ++++++++++++++++++ 17 files changed, 81 insertions(+), 64 deletions(-) create mode 100644 lib/hyper/vm/id.ex create mode 100644 test/hyper/vm/id_test.exs diff --git a/lib/hyper.ex b/lib/hyper.ex index ce946a6a..9b70f131 100644 --- a/lib/hyper.ex +++ b/lib/hyper.ex @@ -18,7 +18,7 @@ defmodule Hyper do @spec create_vm(Hyper.Vm.Spec.t()) :: {:ok, Hyper.Vm.t()} | {:error, term()} def create_vm(%Hyper.Vm.Spec{} = spec) do with {:ok, arch} <- resolve_arch(spec.arch) do - vm_id = gen_vm_id() + vm_id = Hyper.Vm.Id.generate() spec = %{spec | arch: arch} instance_spec = Hyper.Vm.Instance.spec(spec.type) @@ -34,32 +34,13 @@ defmodule Hyper do end end - @doc """ - Generate a fresh VM id: a `v` prefix followed by lowercase base32 of 10 random - bytes, charset `[a-z2-7]`. - - Alphanumeric only - no `-`, `_`, or other punctuation. That is the intersection - of three independent constraints the id must satisfy at once: - - * firecracker rejects `_` in an instance id (`InvalidInstanceId`); - * dm/jailer names must not start with `-`; - * registry keys and chroot path components stay trivially safe. - - The previous base64url encoding emitted `-` and `_`, so it could produce ids - firecracker refused at boot (`Invalid char (_)`). - """ - @spec gen_vm_id() :: Hyper.Vm.id() - def gen_vm_id do - "v" <> Base.encode32(:crypto.strong_rand_bytes(10), padding: false, case: :lower) - end - @spec resolve_arch(Hyper.Vm.Instance.arch() | nil) :: {:ok, Hyper.Vm.Instance.arch()} | {:error, term()} defp resolve_arch(nil), do: Sys.Arch.current() defp resolve_arch(arch), do: {:ok, arch} @doc "Cluster-wide: which node currently runs `vm_id`? `nil` if unknown." - @spec whereis(Hyper.Vm.id()) :: node() | nil + @spec whereis(Hyper.Vm.Id.t()) :: node() | nil def whereis(vm_id), do: Hyper.Cluster.Routing.whereis(vm_id) @doc """ @@ -72,7 +53,7 @@ defmodule Hyper do died with its host, so "unknown" is the truthful answer. Only `:erpc`'s own transport failures are swallowed; a genuine fault in the lookup still raises. """ - @spec id(Hyper.Vm.t()) :: Hyper.Vm.id() | nil + @spec id(Hyper.Vm.t()) :: Hyper.Vm.Id.t() | nil def id(pid) when is_pid(pid) do :erpc.call(node(pid), Hyper.Cluster.Routing, :id_for, [pid]) catch diff --git a/lib/hyper/cluster/routing.ex b/lib/hyper/cluster/routing.ex index 7b478363..9491a132 100644 --- a/lib/hyper/cluster/routing.ex +++ b/lib/hyper/cluster/routing.ex @@ -49,7 +49,7 @@ defmodule Hyper.Cluster.Routing do end @doc "Which node currently runs `vm_id`? `nil` if unknown." - @spec whereis(Hyper.Vm.id()) :: node() | nil + @spec whereis(Hyper.Vm.Id.t()) :: node() | nil @decorate with_span("Hyper.Cluster.Routing.whereis", include: [:vm_id]) def whereis(vm_id) do case Horde.Registry.lookup(@name, {vm_id, :supervisor}) do @@ -63,7 +63,7 @@ defmodule Hyper.Cluster.Routing do replica via a registry match spec; intended to be called on the node that owns `pid` (see `Hyper.id/1`). """ - @spec id_for(pid()) :: Hyper.Vm.id() | nil + @spec id_for(pid()) :: Hyper.Vm.Id.t() | nil @decorate with_span("Hyper.Cluster.Routing.id_for") def id_for(pid) when is_pid(pid) do case Horde.Registry.select(@name, [ @@ -75,7 +75,7 @@ defmodule Hyper.Cluster.Routing do end @doc "Every VM the cluster currently knows about, paired with the node it runs on." - @spec all() :: [{Hyper.Vm.id(), node()}] + @spec all() :: [{Hyper.Vm.Id.t(), node()}] @decorate with_span("Hyper.Cluster.Routing.all") def all do @name diff --git a/lib/hyper/grpc/codec.ex b/lib/hyper/grpc/codec.ex index 650040a1..9da622a0 100644 --- a/lib/hyper/grpc/codec.ex +++ b/lib/hyper/grpc/codec.ex @@ -85,15 +85,15 @@ defmodule Hyper.Grpc.Codec do end @doc "Convert a domain result to an outbound response message, or an error to `GRPC.RPCError`." - @spec to_grpc({:created, Hyper.Vm.id(), node()}) :: CreateVmResponse.t() + @spec to_grpc({:created, Hyper.Vm.Id.t(), node()}) :: CreateVmResponse.t() def to_grpc({:created, vm_id, node}) when is_binary(vm_id), do: %CreateVmResponse{vm_id: vm_id, node: to_string(node)} - @spec to_grpc({:located, Hyper.Vm.id(), node()}) :: GetVmResponse.t() + @spec to_grpc({:located, Hyper.Vm.Id.t(), node()}) :: GetVmResponse.t() def to_grpc({:located, vm_id, node}), do: %GetVmResponse{vm_id: vm_id, node: to_string(node)} - @spec to_grpc({:vms, [{Hyper.Vm.id(), node()}]}) :: ListVmsResponse.t() + @spec to_grpc({:vms, [{Hyper.Vm.Id.t(), node()}]}) :: ListVmsResponse.t() def to_grpc({:vms, vms}), do: %ListVmsResponse{vms: Enum.map(vms, &vm/1)} @@ -117,7 +117,7 @@ defmodule Hyper.Grpc.Codec do def to_grpc({:exit, {:nodedown, _}}), do: rpc_error(:machine_unreachable) def to_grpc({:exit, reason}), do: rpc_error({:stop_failed, reason}) - @spec vm({Hyper.Vm.id(), node()}) :: Vm.t() + @spec vm({Hyper.Vm.Id.t(), node()}) :: Vm.t() defp vm({vm_id, node}), do: %Vm{vm_id: vm_id, node: to_string(node)} @spec instance_type(instance_enum()) :: {:ok, Hyper.Vm.Instance.t()} diff --git a/lib/hyper/img/db/lease.ex b/lib/hyper/img/db/lease.ex index 7c737406..979b5f84 100644 --- a/lib/hyper/img/db/lease.ex +++ b/lib/hyper/img/db/lease.ex @@ -49,7 +49,7 @@ defmodule Hyper.Img.Db.Lease do Upserts on `(node_id, vm_id)` - the same call both takes a fresh lease and heartbeats a live one. """ - @spec bump(Hyper.Img.id(), Hyper.Vm.id(), Unit.Time.t()) :: + @spec bump(Hyper.Img.id(), Hyper.Vm.Id.t(), Unit.Time.t()) :: {:ok, %__MODULE__{}} | {:error, Ecto.Changeset.t()} @decorate with_span("Hyper.Img.Db.Lease.bump", include: [:image_id, :vm_id]) def bump(image_id, vm_id, ttl) do @@ -72,7 +72,7 @@ defmodule Hyper.Img.Db.Lease do Release the lease issued to the given node_id and the given vm_id. Since each VM only ever uses one image, it is not necessary to specify the image id. """ - @spec release(Hyper.Vm.id()) :: :ok + @spec release(Hyper.Vm.Id.t()) :: :ok @decorate with_span("Hyper.Img.Db.Lease.release", include: [:vm_id]) def release(vm_id) do query = from(l in __MODULE__, where: l.node_id == ^to_string(node()) and l.vm_id == ^vm_id) diff --git a/lib/hyper/node.ex b/lib/hyper/node.ex index 1f50fe58..34275278 100644 --- a/lib/hyper/node.ex +++ b/lib/hyper/node.ex @@ -80,7 +80,7 @@ defmodule Hyper.Node do layer, resolve the kernel, and start the VM supervisor. The uid is freed and the mutable layer torn down automatically when the VM supervisor dies. """ - @spec start_image_vm(Hyper.Vm.id(), Hyper.Vm.Spec.t()) :: {:ok, pid()} | {:error, term()} + @spec start_image_vm(Hyper.Vm.Id.t(), Hyper.Vm.Spec.t()) :: {:ok, pid()} | {:error, term()} @decorate with_span("Hyper.Node.start_image_vm", include: [:vm_id, :spec]) def start_image_vm(vm_id, %Hyper.Vm.Spec{} = spec) do with {:ok, uid} <- Users.claim(), @@ -106,7 +106,7 @@ defmodule Hyper.Node do end @doc false - @spec build_opts(Hyper.Vm.id(), Hyper.Vm.Spec.t(), Users.id(), pid(), Path.t()) :: + @spec build_opts(Hyper.Vm.Id.t(), Hyper.Vm.Spec.t(), Users.id(), pid(), Path.t()) :: FireVMM.Opts.t() def build_opts(vm_id, %Hyper.Vm.Spec{} = spec, uid, mutable, kernel) do %FireVMM.Opts{ diff --git a/lib/hyper/node/fire_vmm.ex b/lib/hyper/node/fire_vmm.ex index f9e45fbc..22b069ac 100644 --- a/lib/hyper/node/fire_vmm.ex +++ b/lib/hyper/node/fire_vmm.ex @@ -34,7 +34,7 @@ defmodule Hyper.Node.FireVMM do defstruct [:vm_id, :uid, :gid, :type, :arch, :mutable, :kernel, :boot_args] @type t :: %__MODULE__{ - vm_id: Hyper.Vm.id(), + vm_id: Hyper.Vm.Id.t(), uid: Hyper.Node.Users.id(), gid: Hyper.Node.Users.id(), type: Hyper.Vm.Instance.t(), diff --git a/lib/hyper/node/fire_vmm/chroot_jail.ex b/lib/hyper/node/fire_vmm/chroot_jail.ex index edaa4f1d..cab7364c 100644 --- a/lib/hyper/node/fire_vmm/chroot_jail.ex +++ b/lib/hyper/node/fire_vmm/chroot_jail.ex @@ -27,7 +27,7 @@ defmodule Hyper.Node.FireVMM.ChrootJail do `uid:gid`), and return `cold` with its kernel + rootfs paths rewritten to their in-jail equivalents. Fails the boot if either artifact cannot be staged. """ - @spec stage(Hyper.Vm.id(), non_neg_integer(), non_neg_integer(), Cold.t()) :: + @spec stage(Hyper.Vm.Id.t(), non_neg_integer(), non_neg_integer(), Cold.t()) :: {:ok, Cold.t()} | {:error, term()} @decorate with_span("Hyper.Node.FireVMM.ChrootJail.stage", include: [:vm_id]) def stage(vm_id, uid, gid, %Cold{} = cold) do diff --git a/lib/hyper/node/fire_vmm/client.ex b/lib/hyper/node/fire_vmm/client.ex index 2ed991ce..78835fbf 100644 --- a/lib/hyper/node/fire_vmm/client.ex +++ b/lib/hyper/node/fire_vmm/client.ex @@ -65,7 +65,7 @@ defmodule Hyper.Node.FireVMM.Client do GenServer.start_link(__MODULE__, opts, gen_opts(opts.name)) end - @spec via(Hyper.Vm.id()) :: GenServer.name() + @spec via(Hyper.Vm.Id.t()) :: GenServer.name() def via(vm_id), do: Hyper.Cluster.Routing.via({vm_id, :client}) @doc "Run a generated operation against this VM's daemon, serialized." diff --git a/lib/hyper/node/fire_vmm/daemon.ex b/lib/hyper/node/fire_vmm/daemon.ex index 964a0923..882a6cca 100644 --- a/lib/hyper/node/fire_vmm/daemon.ex +++ b/lib/hyper/node/fire_vmm/daemon.ex @@ -89,10 +89,10 @@ defmodule Hyper.Node.FireVMM.Daemon do end end - @spec reset_stale_jail(Hyper.Vm.id()) :: :ok | {:error, term()} + @spec reset_stale_jail(Hyper.Vm.Id.t()) :: :ok | {:error, term()} defp reset_stale_jail(id), do: clear_jail(id) - @spec clear_jail(Hyper.Vm.id()) :: :ok | {:error, term()} + @spec clear_jail(Hyper.Vm.Id.t()) :: :ok | {:error, term()} defp clear_jail(id) do SuidHelper.ChrootJail.remove(Jailer.chroot_dir(id), Jailer.cgroup_dir(id)) end diff --git a/lib/hyper/node/fire_vmm/jailer.ex b/lib/hyper/node/fire_vmm/jailer.ex index 36683736..46b53288 100644 --- a/lib/hyper/node/fire_vmm/jailer.ex +++ b/lib/hyper/node/fire_vmm/jailer.ex @@ -111,13 +111,13 @@ defmodule Hyper.Node.FireVMM.Jailer do end @doc "Host path of the VM's per-VM jail dir (`//`)." - @spec chroot_dir(Hyper.Vm.id()) :: Path.t() + @spec chroot_dir(Hyper.Vm.Id.t()) :: Path.t() def chroot_dir(id) do Path.join([Hyper.Config.chroot_base(), exec_name(), id]) end @doc "Host path of the VM's chroot root (`///root`)." - @spec chroot_root(Hyper.Vm.id()) :: Path.t() + @spec chroot_root(Hyper.Vm.Id.t()) :: Path.t() def chroot_root(id) do Path.join(chroot_dir(id), "root") end @@ -143,7 +143,7 @@ defmodule Hyper.Node.FireVMM.Jailer do the jailer creates for firecracker. Reconstructed (the jailer owns its placement) so a relaunch can clear the stale leaf left by a prior incarnation. """ - @spec cgroup_dir(Hyper.Vm.id()) :: Path.t() + @spec cgroup_dir(Hyper.Vm.Id.t()) :: Path.t() def cgroup_dir(id) do Path.join(cgroup_parent_dir(), id) end @@ -155,7 +155,7 @@ defmodule Hyper.Node.FireVMM.Jailer do derive it independently and are guaranteed to agree. We do not control where the jailer places the socket, so the path is reconstructed here. """ - @spec host_socket(Hyper.Vm.id()) :: Path.t() + @spec host_socket(Hyper.Vm.Id.t()) :: Path.t() def host_socket(id) do Path.join([ Hyper.Config.chroot_base(), diff --git a/lib/hyper/node/fire_vmm/state.ex b/lib/hyper/node/fire_vmm/state.ex index fa87d9b4..eab53db5 100644 --- a/lib/hyper/node/fire_vmm/state.ex +++ b/lib/hyper/node/fire_vmm/state.ex @@ -62,7 +62,7 @@ defmodule Hyper.Node.FireVMM.State do :gen_statem.start_link(__MODULE__, opts, []) end - @spec stop(Hyper.Vm.id()) :: :ok + @spec stop(Hyper.Vm.Id.t()) :: :ok def stop(id) do :gen_statem.call(via(id), :stop) end @@ -162,7 +162,7 @@ defmodule Hyper.Node.FireVMM.State do # `:socket_pending` means firecracker has not created the socket yet, so we # keep waiting; a hard error is logged but also tolerated until the deadline # (the probe that follows would fail with EACCES anyway and drive the stop). - @spec ensure_api_granted(Hyper.Vm.id(), State.t()) :: + @spec ensure_api_granted(Hyper.Vm.Id.t(), State.t()) :: {:cont, State.t()} | {:wait, State.t(), term()} defp ensure_api_granted(_id, %{api_granted: true} = data), do: {:cont, data} @@ -184,7 +184,7 @@ defmodule Hyper.Node.FireVMM.State do # has lapsed - then fail the boot, surfacing `reason` rather than swallowing # it into a bare `:daemon_unready`. A persistent failure here points at the # host->jail socket (path or, more often, the grant/permission step above). - @spec keep_probing(Hyper.Vm.id(), State.t(), term()) :: + @spec keep_probing(Hyper.Vm.Id.t(), State.t(), term()) :: {:keep_state, State.t(), list()} | {:stop, term(), State.t()} defp keep_probing(id, data, reason) do if System.monotonic_time(:millisecond) >= data.boot_deadline do @@ -247,7 +247,7 @@ defmodule Hyper.Node.FireVMM.State do # Cold boot, issued through the Client and aborting at the first error: # machine-config -> boot-source -> each drive -> each NIC -> InstanceStart. - @spec apply_spec(Hyper.Vm.id(), BootSpec.Cold.t()) :: :ok | {:error, term()} + @spec apply_spec(Hyper.Vm.Id.t(), BootSpec.Cold.t()) :: :ok | {:error, term()} @decorate with_span("Hyper.Node.FireVMM.State.Configuring.apply_spec", include: [:id]) defp apply_spec(id, %BootSpec.Cold{} = cold) do via = Client.via(id) diff --git a/lib/hyper/node/img.ex b/lib/hyper/node/img.ex index 320d4a7f..3f2ed123 100644 --- a/lib/hyper/node/img.ex +++ b/lib/hyper/node/img.ex @@ -53,7 +53,7 @@ defmodule Hyper.Node.Img do end @doc "Create a per-VM mutable layer for `vm_id` over `img_id`." - @spec create_mutable(Hyper.Img.id(), Hyper.Vm.id()) :: {:ok, pid()} | {:error, term()} + @spec create_mutable(Hyper.Img.id(), Hyper.Vm.Id.t()) :: {:ok, pid()} | {:error, term()} def create_mutable(img_id, vm_id) do case DynamicSupervisor.start_child( @mutable_sup, @@ -74,7 +74,7 @@ defmodule Hyper.Node.Img do Serve `img` to `vm_id` for the duration of `callable`, holding a DB lease on the image (and transitively its whole blob chain) the whole time. """ - @spec with_image(Hyper.Img.id(), Hyper.Vm.id(), (-> result)) :: result | {:error, term()} + @spec with_image(Hyper.Img.id(), Hyper.Vm.Id.t(), (-> result)) :: result | {:error, term()} when result: var def with_image(img, vm_id, callable) do with_image_lease(img, vm_id, callable) @@ -84,7 +84,7 @@ defmodule Hyper.Node.Img do # even if `callable` raises. A background task re-bumps the lease for the whole # run, so a long-lived VM never lets its claim lapse. If the lease cannot be # taken, returns the error and never runs `callable`. - @spec with_image_lease(Hyper.Img.id(), Hyper.Vm.id(), (-> result)) :: result | {:error, term()} + @spec with_image_lease(Hyper.Img.id(), Hyper.Vm.Id.t(), (-> result)) :: result | {:error, term()} when result: var defp with_image_lease(img, vm_id, callable) do ttl = Db.Lease.default_ttl() @@ -104,7 +104,7 @@ defmodule Hyper.Node.Img do # Re-bump the lease forever at 1/3 of the TTL, until killed. Runs in a task for the # lifetime of `callable`; transient bump failures are swallowed so a DB hiccup # can't tear down the VM - the next tick retries. - @spec heartbeat(Hyper.Img.id(), Hyper.Vm.id(), Unit.Time.t()) :: no_return() + @spec heartbeat(Hyper.Img.id(), Hyper.Vm.Id.t(), Unit.Time.t()) :: no_return() defp heartbeat(img, vm_id, ttl) do Process.sleep(div(Unit.Time.as_ms(ttl), 3)) diff --git a/lib/hyper/node/img/mutable.ex b/lib/hyper/node/img/mutable.ex index 4dbab644..e70d8095 100644 --- a/lib/hyper/node/img/mutable.ex +++ b/lib/hyper/node/img/mutable.ex @@ -31,7 +31,7 @@ defmodule Hyper.Node.Img.Mutable do @enforce_keys [:img_id, :vm_id] defstruct [:img_id, :vm_id] - @type t :: %__MODULE__{img_id: Hyper.Img.id(), vm_id: Hyper.Vm.id()} + @type t :: %__MODULE__{img_id: Hyper.Img.id(), vm_id: Hyper.Vm.Id.t()} end defmodule State do @@ -137,7 +137,7 @@ defmodule Hyper.Node.Img.Mutable do end @doc false - @spec dm_name(Hyper.Vm.id()) :: String.t() + @spec dm_name(Hyper.Vm.Id.t()) :: String.t() def dm_name(vm_id), do: "hyper-rw-#{sanitize(vm_id)}" defp sanitize(id), do: String.replace(id, ~r/[^A-Za-z0-9._-]/, "_") diff --git a/lib/hyper/vm.ex b/lib/hyper/vm.ex index 0b99da8d..3001e41c 100644 --- a/lib/hyper/vm.ex +++ b/lib/hyper/vm.ex @@ -10,7 +10,6 @@ defmodule Hyper.Vm do @dialyzer {:nowarn_function, [fast_fork: 1, fork: 1]} @type t :: pid() - @type id :: String.t() @typedoc """ What a VM boots from: explicit, already-jail-visible artifact paths for a cold diff --git a/lib/hyper/vm/id.ex b/lib/hyper/vm/id.ex new file mode 100644 index 00000000..7c43d29e --- /dev/null +++ b/lib/hyper/vm/id.ex @@ -0,0 +1,27 @@ +defmodule Hyper.Vm.Id do + @moduledoc """ + A microVM id and its generator. + + An id is a `v` prefix followed by lowercase base32 of 10 random bytes, charset + `[a-z2-7]` - alphanumeric only, no `-`, `_`, or other punctuation. That charset + is the intersection of three independent constraints the id must satisfy at + once: + + * firecracker rejects `_` in an instance id (`InvalidInstanceId`); + * dm/jailer names must not start with `-`; + * registry keys and chroot path components stay trivially safe. + """ + + @type t :: String.t() + + @doc """ + Generate a fresh, random VM id (see the module doc for the charset contract). + + The previous base64url encoding emitted `-` and `_`, so it could produce ids + firecracker refused at boot (`Invalid char (_)`). + """ + @spec generate() :: t() + def generate do + "v" <> Base.encode32(:crypto.strong_rand_bytes(10), padding: false, case: :lower) + end +end diff --git a/test/hyper/node/fire_vmm/jailer_test.exs b/test/hyper/node/fire_vmm/jailer_test.exs index 1ca862fe..b0f03fc0 100644 --- a/test/hyper/node/fire_vmm/jailer_test.exs +++ b/test/hyper/node/fire_vmm/jailer_test.exs @@ -1,6 +1,6 @@ defmodule Hyper.Node.FireVMM.JailerTest do @moduledoc """ - Properties and examples for `Hyper.Node.FireVMM.Jailer.command/1`. + Examples for `Hyper.Node.FireVMM.Jailer.command/1`. Load-bearing invariant: the BEAM must never place a privileged binary path (firecracker, jailer) or lifecycle flags owned by the suidhelper (`--exec-file`, @@ -9,7 +9,6 @@ defmodule Hyper.Node.FireVMM.JailerTest do """ use ExUnit.Case, async: false - use ExUnitProperties alias Hyper.Node.FireVMM alias Hyper.Node.FireVMM.Jailer @@ -79,13 +78,4 @@ defmodule Hyper.Node.FireVMM.JailerTest do assert Enum.any?(args, &String.starts_with?(&1, "cpu.max=")) assert Enum.any?(args, &String.starts_with?(&1, "memory.max=")) end - - # firecracker rejects an instance id containing `_` (and dm/jailer names must - # not lead with `-`), so the id must be strictly alphanumeric. - property "gen_vm_id/0 produces only alphanumeric ids" do - check all(_ <- StreamData.constant(nil)) do - id = Hyper.gen_vm_id() - assert id =~ ~r/\A[A-Za-z0-9]+\z/ - end - end end diff --git a/test/hyper/vm/id_test.exs b/test/hyper/vm/id_test.exs new file mode 100644 index 00000000..5f82fb1e --- /dev/null +++ b/test/hyper/vm/id_test.exs @@ -0,0 +1,20 @@ +defmodule Hyper.Vm.IdTest do + @moduledoc """ + The charset contract of `Hyper.Vm.Id.generate/0`. The load-bearing invariant is + the refusal property: a generated id is *always* strictly alphanumeric, so it + can never carry a char that firecracker (`_`) or dm/jailer names (`-`) reject. + """ + + use ExUnit.Case, async: true + use ExUnitProperties + + alias Hyper.Vm.Id + + property "generate/0 produces a `v`-prefixed, strictly alphanumeric id" do + check all(_ <- StreamData.constant(nil)) do + id = Id.generate() + assert id =~ ~r/\A[A-Za-z0-9]+\z/ + assert String.starts_with?(id, "v") + end + end +end From 7811ed661c157d8162e233116bcedc886bd7907e Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 17:24:37 +0000 Subject: [PATCH 39/46] test(suidhelper): tolerate shell-set PWD in jailer empty-env e2e The privileged e2e `execs_jailer_..._empty_env_as_root` asserted the recorder's /proc/self/environ was literally empty, but the recorder is a /bin/sh script and the shell self-sets PWD (and under bash _/SHLVL) on startup - so the assertion failed in CI even though the helper does execve the jailer with an empty envp. Prove the real property instead: no CALLER variable survives. The helper is spawned with HYPER_* config vars in its own env, so their absence in the recorder is the leak canary; allow only shell-set PWD/_/SHLVL. --- native/suidhelper/tests/e2e/jailer.rs | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/native/suidhelper/tests/e2e/jailer.rs b/native/suidhelper/tests/e2e/jailer.rs index 87084aef..c5921059 100644 --- a/native/suidhelper/tests/e2e/jailer.rs +++ b/native/suidhelper/tests/e2e/jailer.rs @@ -147,12 +147,28 @@ fn execs_jailer_with_canonical_argv_and_empty_env_as_root() { "helper handed the jailer a non-canonical argv", ); - // The execve envp must be empty: once ruid==0 a smuggled LD_PRELOAD would be - // honored, so nothing of the caller's environment may reach the root jailer. + // The helper execve's the jailer with an EMPTY envp (see src/tools/jailer.rs): + // once ruid==0 a smuggled LD_PRELOAD would be honored, so nothing of the + // caller's environment may reach the root jailer. The recorder is a `/bin/sh` + // script and the shell *self-sets* `PWD` (and, under bash, `_`/`SHLVL`) on + // startup, so a literally-empty environ is impossible to observe through it. + // We instead prove no CALLER variable survives: the helper is spawned with + // `HYPER_*` config vars in its own environment (see `run`), so their absence + // here is the leak canary - had the helper passed its environment through, + // they would appear alongside `PWD`. let environ = fs::read(&env_rec).expect("recorded environ"); + let leaked: Vec = environ + .split(|&b| b == 0) + .filter(|entry| !entry.is_empty()) + .map(|entry| String::from_utf8_lossy(entry).into_owned()) + .filter(|entry| { + let key = entry.split('=').next().unwrap_or(""); + !matches!(key, "PWD" | "_" | "SHLVL") + }) + .collect(); assert!( - environ.is_empty(), - "jailer inherited a non-empty environment: {environ:?}", + leaked.is_empty(), + "caller environment leaked to the jailer (only shell-set PWD/_/SHLVL allowed): {leaked:?}", ); } From b5a0c163486fd113522a7763a1e1171c362cbe72 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 17:27:58 +0000 Subject: [PATCH 40/46] style(node): wrap with_image_lease spec after Vm.Id.t() rename The Hyper.Vm.id() -> Hyper.Vm.Id.t() migration widened the @spec past the line limit; mix format wraps it. (CI format gate.) --- lib/hyper/node/img.ex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/hyper/node/img.ex b/lib/hyper/node/img.ex index 3f2ed123..53bf2ae6 100644 --- a/lib/hyper/node/img.ex +++ b/lib/hyper/node/img.ex @@ -84,7 +84,8 @@ defmodule Hyper.Node.Img do # even if `callable` raises. A background task re-bumps the lease for the whole # run, so a long-lived VM never lets its claim lapse. If the lease cannot be # taken, returns the error and never runs `callable`. - @spec with_image_lease(Hyper.Img.id(), Hyper.Vm.Id.t(), (-> result)) :: result | {:error, term()} + @spec with_image_lease(Hyper.Img.id(), Hyper.Vm.Id.t(), (-> result)) :: + result | {:error, term()} when result: var defp with_image_lease(img, vm_id, callable) do ttl = Db.Lease.default_ttl() From dfb6078963e5a7479f9a257b33dacf846ac6e6d8 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 18:10:12 +0000 Subject: [PATCH 41/46] refactor(config): make config.toml the single source of truth MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two changes, one theme — stop duplicating host config and read /etc/hyper/config.toml at runtime so the node and the setuid helper can never drift. [tools] section: the helper's tool binaries (firecracker, jailer, dmsetup, losetup, blockdev) now live under a `[tools]` table instead of flat top-level keys. Rust models it as a dedicated `Tools` struct (device tools default, firecracker/jailer remain Option -> Unconfigured at use). Elixir reads firecracker/jailer from `[tools]`; `mix firecracker.install` prints the `[tools]` snippet. Runtime-read shared config: `parent_cgroup` and `uid_gid_range` were specified in BOTH config.toml (helper) and `config :hyper` (node) and had to be kept in sync by hand. The node now reads them from config.toml at runtime, with defaults matching the helper's `Config::default`. `layer_dir` is derived from `work_dir` (`/layers`) rather than a separate key. The `config :hyper, cgroup_parent/uid_gid_range/layer_dir` block is gone; an absent config.toml yields the same built-in defaults on both sides. --- config/config.exs | 5 -- lib/hyper/config.ex | 95 +++++++++++++++--------- lib/mix/tasks/firecracker.install.ex | 1 + native/suidhelper/src/config.rs | 62 ++++++++++------ native/suidhelper/tests/e2e/argv.rs | 3 +- native/suidhelper/tests/e2e/config.rs | 4 +- native/suidhelper/tests/e2e/jailer.rs | 2 +- test/hyper/node/fire_vmm/jailer_test.exs | 6 +- 8 files changed, 111 insertions(+), 67 deletions(-) diff --git a/config/config.exs b/config/config.exs index 7c831bce..cb0bd469 100644 --- a/config/config.exs +++ b/config/config.exs @@ -21,11 +21,6 @@ config :libcluster, ] ] -config :hyper, - cgroup_parent: "hyper", - uid_gid_range: {900_000, 999_999}, - layer_dir: "/srv/hyper/layers" - if config_env() == :test do config :opentelemetry, traces_exporter: :none # No cluster formation during tests. diff --git a/lib/hyper/config.ex b/lib/hyper/config.ex index 0704c647..c4d37c02 100644 --- a/lib/hyper/config.ex +++ b/lib/hyper/config.ex @@ -1,21 +1,30 @@ defmodule Hyper.Config do @moduledoc """ - Host configuration, read from `config :hyper, ...` (see `config/config.exs`). - - Runtime values shared with the setuid helper (`native/suidhelper`) — `work_dir`, - `firecracker`, `jailer` — are read from `/etc/hyper/config.toml` (the single - source of truth for both sides) the first time they are needed, then cached in - `:persistent_term`. Everything else is compile-time. + Host configuration. + + Everything shared with the setuid helper (`native/suidhelper`) is read from the + single source of truth, `/etc/hyper/config.toml`, at runtime — never duplicated + in `config :hyper`. The node and the helper parse the same file, so they cannot + drift: `work_dir`, the `[tools]` binary paths (`firecracker`, `jailer`, ...), + `parent_cgroup`, and `[uid_gid_range]`. The file is read once on first access + and cached in `:persistent_term`; an absent file (local dev / CI) yields the + same built-in defaults the helper compiles in, so both sides still agree. + + Node-only settings with no helper counterpart (`skopeo`/`umoci`/`mke2fs` paths, + `vmlinux`, the cluster topology) stay in `config :hyper`. """ - # The shared data-root config file, read by both this node and the setuid - # helper. Absent in local dev / CI, where `@dev_work_dir` is used instead. + # The shared config file, read by both this node and the setuid helper. Absent + # in local dev / CI, where the built-in defaults below are used instead. @config_path "/etc/hyper/config.toml" @dev_work_dir "/srv/hyper" - @parent_cgroup Application.compile_env(:hyper, :cgroup_parent, "hyper") - @uid_gid_range Application.compile_env!(:hyper, :uid_gid_range) - @layer_dir Application.compile_env!(:hyper, :layer_dir) + # Defaults for the helper-shared values, kept in lockstep with the helper's + # `Config::default` (native/suidhelper/src/config.rs) so an absent config.toml + # makes the node and the helper agree out of the box. + @default_parent_cgroup "hyper" + @default_uid_gid_range {900_000, 999_999} + @skopeo_path Application.compile_env(:hyper, :skopeo_path, "skopeo") @umoci_path Application.compile_env(:hyper, :umoci_path, nil) @mke2fs_path Application.compile_env(:hyper, :mke2fs_path, "mke2fs") @@ -36,49 +45,53 @@ defmodule Hyper.Config do def redist_dir, do: Path.join(work_dir(), "redist") @doc """ - Absolute path to the firecracker binary, as set in `#{@config_path}` under the - `firecracker` key. Raises if the key is absent — the operator must configure it; - there is no default. + Absolute path to the firecracker binary, from the `[tools]` table in + `#{@config_path}`. Raises if absent — the operator must configure it; there is + no default. For the launch path only. Pre-launch checks should use `firecracker_bin_configured/0` so a missing key returns a typed error rather than crashing. """ @spec firecracker_bin :: Path.t() - def firecracker_bin, do: fetch_bin!("firecracker") + def firecracker_bin, do: fetch_tool!("firecracker") @doc """ Non-raising form of `firecracker_bin/0`. Returns `{:ok, path}` when the - `firecracker` key is present in `#{@config_path}`, or `:error` when it is absent. + `[tools] firecracker` key is present in `#{@config_path}`, or `:error` otherwise. """ @spec firecracker_bin_configured :: {:ok, Path.t()} | :error - def firecracker_bin_configured, do: Map.fetch(config_toml(), "firecracker") + def firecracker_bin_configured, do: Map.fetch(tools(), "firecracker") @doc """ - Absolute path to the jailer binary, as set in `#{@config_path}` under the - `jailer` key. Raises if the key is absent — the operator must configure it; - there is no default. + Absolute path to the jailer binary, from the `[tools]` table in `#{@config_path}`. + Raises if absent — the operator must configure it; there is no default. For the launch path only. Pre-launch checks should use `jailer_bin_configured/0` so a missing key returns a typed error rather than crashing. """ @spec jailer_bin :: Path.t() - def jailer_bin, do: fetch_bin!("jailer") + def jailer_bin, do: fetch_tool!("jailer") @doc """ - Non-raising form of `jailer_bin/0`. Returns `{:ok, path}` when the `jailer` key - is present in `#{@config_path}`, or `:error` when it is absent. + Non-raising form of `jailer_bin/0`. Returns `{:ok, path}` when the + `[tools] jailer` key is present in `#{@config_path}`, or `:error` otherwise. """ @spec jailer_bin_configured :: {:ok, Path.t()} | :error - def jailer_bin_configured, do: Map.fetch(config_toml(), "jailer") + def jailer_bin_configured, do: Map.fetch(tools(), "jailer") - @spec fetch_bin!(String.t()) :: Path.t() - defp fetch_bin!(key) do - case Map.fetch(config_toml(), key) do + # The `[tools]` table (binary paths shared with the helper), or `%{}` when the + # file or table is absent. + @spec tools :: map() + defp tools, do: Map.get(config_toml(), "tools", %{}) + + @spec fetch_tool!(String.t()) :: Path.t() + defp fetch_tool!(key) do + case Map.fetch(tools(), key) do {:ok, path} -> path :error -> - raise "#{@config_path}: key #{inspect(key)} is not set; " <> + raise "#{@config_path}: `[tools] #{key}` is not set; " <> "operator must configure it before starting the node" end end @@ -121,9 +134,11 @@ defmodule Hyper.Config do def chroot_base, do: Path.join(work_dir(), "jails") @doc """ - A name for the parent cgroup which is used as a supervision cgroup for all VMs. + Name of the parent cgroup used as a supervision cgroup for all VMs. Read from + `parent_cgroup` in `#{@config_path}` (shared with the helper), default `"hyper"`. """ - def parent_cgroup, do: @parent_cgroup + @spec parent_cgroup :: String.t() + def parent_cgroup, do: Map.get(config_toml(), "parent_cgroup", @default_parent_cgroup) @doc """ Path to the directory where all VM sockets are held. @@ -135,12 +150,20 @@ defmodule Hyper.Config do def socket_dir, do: Path.join(work_dir(), "socks") @doc """ - Range in which `Hyper` will attempt to allocate uid/gids. Whenever a VM is allocated, it will - get a fresh uid/gid pair in this range. It is absolutely critical that this range is not used - by any other process on the system, as that can risk security. + Range in which `Hyper` allocates uid/gids: each VM gets a fresh uid/gid pair in + this range. Critical that no other process on the system uses this range. + + Read from the `[uid_gid_range]` table (`min`/`max`) in `#{@config_path}` — the + same file the helper validates against, so the node only ever hands out uids the + helper will accept. Defaults to `#{inspect(@default_uid_gid_range)}` when absent. """ @spec uid_gid_range :: {integer(), integer()} - def uid_gid_range, do: @uid_gid_range + def uid_gid_range do + case Map.get(config_toml(), "uid_gid_range") do + %{"min" => min, "max" => max} -> {min, max} + _ -> @default_uid_gid_range + end + end @doc """ Location of all image layers on all nodes. @@ -151,9 +174,11 @@ defmodule Hyper.Config do Must be stable across all nodes, and must be a directory. If it does not exist, `Hyper.Node` will attempt to create one. + + Derived as `/layers`, so it follows `work_dir` from `#{@config_path}`. """ @spec layer_dir :: Path.t() - def layer_dir, do: @layer_dir + def layer_dir, do: Path.join(work_dir(), "layers") @doc "Path to the skopeo binary (used by `Hyper.Img.OciLoader` to pull OCI images)." def skopeo_path, do: @skopeo_path diff --git a/lib/mix/tasks/firecracker.install.ex b/lib/mix/tasks/firecracker.install.ex index 03ada2be..38bb7ad1 100644 --- a/lib/mix/tasks/firecracker.install.ex +++ b/lib/mix/tasks/firecracker.install.ex @@ -140,6 +140,7 @@ defmodule Mix.Tasks.Firecracker.Install do Then add to /etc/hyper/config.toml (file: root-owned, mode 0644): + [tools] firecracker = "#{fc}" jailer = "#{jailer}" """) diff --git a/native/suidhelper/src/config.rs b/native/suidhelper/src/config.rs index 76a7f221..104031dd 100644 --- a/native/suidhelper/src/config.rs +++ b/native/suidhelper/src/config.rs @@ -89,22 +89,44 @@ fn config_path() -> PathBuf { #[derive(Debug, Clone, Deserialize)] pub struct Config { work_dir: PathBuf, - #[serde(default = "default_dmsetup")] - dmsetup: PathBuf, - #[serde(default = "default_losetup")] - losetup: PathBuf, - #[serde(default = "default_blockdev")] - blockdev: PathBuf, - #[serde(default)] - firecracker: Option, #[serde(default)] - jailer: Option, + tools: Tools, #[serde(default = "default_parent_cgroup")] parent_cgroup: String, #[serde(default)] uid_gid_range: Option, } +/// Paths to the external binaries the helper runs, the `[tools]` table. +/// +/// The device tools (`dmsetup`, `losetup`, `blockdev`) carry built-in defaults; +/// `firecracker` and `jailer` have none and must be configured before any VM can +/// launch — their absence surfaces as [`BinError::Unconfigured`] at use time, not +/// at load. Every path is validated as a root-owned, correctly-named [`SafeBin`] +/// when accessed, never at parse time (the file is read unprivileged). A missing +/// `[tools]` table, or any missing key within it, falls back to these defaults. +#[derive(Debug, Clone, Deserialize)] +#[serde(default)] +pub struct Tools { + dmsetup: PathBuf, + losetup: PathBuf, + blockdev: PathBuf, + firecracker: Option, + jailer: Option, +} + +impl Default for Tools { + fn default() -> Self { + Self { + dmsetup: default_dmsetup(), + losetup: default_losetup(), + blockdev: default_blockdev(), + firecracker: None, + jailer: None, + } + } +} + // The default data root. Must match the Elixir node's `@dev_work_dir`, which it // uses when the same config file is absent, so both sides agree (see // `Hyper.Node.check_helper_base`). @@ -133,11 +155,7 @@ impl Default for Config { fn default() -> Self { Self { work_dir: default_work_dir(), - dmsetup: default_dmsetup(), - losetup: default_losetup(), - blockdev: default_blockdev(), - firecracker: None, - jailer: None, + tools: Tools::default(), parent_cgroup: default_parent_cgroup(), uid_gid_range: None, } @@ -174,24 +192,25 @@ impl Config { /// The validated `dmsetup` binary the helper will run. pub fn dmsetup(&self) -> Result, safe_bin::Error> { - SafeBin::from_path(&self.dmsetup) + SafeBin::from_path(&self.tools.dmsetup) } /// The validated `losetup` binary the helper will run. pub fn losetup(&self) -> Result, safe_bin::Error> { - SafeBin::from_path(&self.losetup) + SafeBin::from_path(&self.tools.losetup) } /// The validated `blockdev` binary the helper will run. pub fn blockdev(&self) -> Result, safe_bin::Error> { - SafeBin::from_path(&self.blockdev) + SafeBin::from_path(&self.tools.blockdev) } /// The Firecracker VMM binary, validated as root-owned and correctly named. /// Errors [`BinError::Unconfigured`] when absent from config — an operator - /// must set this key before any VM can be launched. + /// must set `[tools] firecracker` before any VM can be launched. pub fn firecracker(&self) -> Result, BinError> { - self.firecracker + self.tools + .firecracker .as_deref() .ok_or(BinError::Unconfigured("firecracker")) .and_then(|p| SafeBin::from_path(p).map_err(BinError::Bin)) @@ -199,9 +218,10 @@ impl Config { /// The Firecracker jailer binary, validated as root-owned and correctly named. /// Errors [`BinError::Unconfigured`] when absent from config — an operator - /// must set this key before any VM can be launched. + /// must set `[tools] jailer` before any VM can be launched. pub fn jailer(&self) -> Result, BinError> { - self.jailer + self.tools + .jailer .as_deref() .ok_or(BinError::Unconfigured("jailer")) .and_then(|p| SafeBin::from_path(p).map_err(BinError::Bin)) diff --git a/native/suidhelper/tests/e2e/argv.rs b/native/suidhelper/tests/e2e/argv.rs index dcda8db9..25c5981a 100644 --- a/native/suidhelper/tests/e2e/argv.rs +++ b/native/suidhelper/tests/e2e/argv.rs @@ -36,7 +36,8 @@ fn install_fake(dir: &Path, basename: &str, record: &Path, stdout_line: &str) -> /// caller argument. fn write_root_config(dir: &Path, bins: &[(&str, &Path)]) -> PathBuf { let p = dir.join("config.toml"); - let mut body = String::from("work_dir = \"/srv/hyper\"\n"); + // Every key here is a tool name, so they live under the `[tools]` table. + let mut body = String::from("work_dir = \"/srv/hyper\"\n[tools]\n"); for (key, path) in bins { body.push_str(&format!("{key} = \"{}\"\n", path.display())); } diff --git a/native/suidhelper/tests/e2e/config.rs b/native/suidhelper/tests/e2e/config.rs index ef57c97b..f1fbfc77 100644 --- a/native/suidhelper/tests/e2e/config.rs +++ b/native/suidhelper/tests/e2e/config.rs @@ -158,7 +158,7 @@ fn jailer_unconfigured_when_absent() { fn jailer_basename_mismatch_rejected() { // The basename check in SafeBin::from_path precedes the stat, so we do not // need a real file — any absolute path with the wrong leaf name is enough. - let body = "work_dir = \"/srv/hyper\"\njailer = \"/usr/local/bin/not-jailer\"\n"; + let body = "work_dir = \"/srv/hyper\"\n[tools]\njailer = \"/usr/local/bin/not-jailer\"\n"; let config: Config = toml::from_str(body).unwrap(); let err = config .jailer() @@ -184,7 +184,7 @@ fn firecracker_and_jailer_return_ok_when_root_owned_as_root() { fs::set_permissions(p, fs::Permissions::from_mode(0o755)).unwrap(); } let body = format!( - "work_dir = \"/srv/hyper\"\nfirecracker = \"{}\"\njailer = \"{}\"\n", + "work_dir = \"/srv/hyper\"\n[tools]\nfirecracker = \"{}\"\njailer = \"{}\"\n", fc.display(), jr.display(), ); diff --git a/native/suidhelper/tests/e2e/jailer.rs b/native/suidhelper/tests/e2e/jailer.rs index c5921059..ba45468c 100644 --- a/native/suidhelper/tests/e2e/jailer.rs +++ b/native/suidhelper/tests/e2e/jailer.rs @@ -58,7 +58,7 @@ fn install_firecracker(dir: &Path) -> PathBuf { fn write_root_config(dir: &Path, jailer: &Path, firecracker: &Path) -> PathBuf { let p = dir.join("config.toml"); let body = format!( - "work_dir = \"/srv/hyper\"\njailer = \"{}\"\nfirecracker = \"{}\"\n", + "work_dir = \"/srv/hyper\"\n[tools]\njailer = \"{}\"\nfirecracker = \"{}\"\n", jailer.display(), firecracker.display(), ); diff --git a/test/hyper/node/fire_vmm/jailer_test.exs b/test/hyper/node/fire_vmm/jailer_test.exs index b0f03fc0..a744eb6c 100644 --- a/test/hyper/node/fire_vmm/jailer_test.exs +++ b/test/hyper/node/fire_vmm/jailer_test.exs @@ -20,8 +20,10 @@ defmodule Hyper.Node.FireVMM.JailerTest do # async: false because persistent_term is global state. setup do :persistent_term.put({Hyper.Config, :config_toml}, %{ - "firecracker" => "/usr/local/bin/firecracker", - "jailer" => "/usr/local/bin/jailer" + "tools" => %{ + "firecracker" => "/usr/local/bin/firecracker", + "jailer" => "/usr/local/bin/jailer" + } }) on_exit(fn -> :persistent_term.erase({Hyper.Config, :config_toml}) end) From 727d08e3587de6b3cf0f94868f5a3cd7859537d9 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Fri, 26 Jun 2026 18:24:11 +0000 Subject: [PATCH 42/46] refactor(config): nest cgroup + uid_gid_range under [jails] table Move the parent cgroup and the uid/gid allocation band out of top-level config.toml keys into a [jails] table, matching the documented format: [jails] uid_gid_range = [900000, 999999] cgroup = "hyper" uid_gid_range is now a [min, max] array rather than a {min, max} sub-table; the Rust helper keeps the rich UidGidRange type via serde(from = "[u32; 2]"). Both the node and the setuid helper read the same [jails] table, so they still cannot drift. --- lib/hyper/config.ex | 21 ++++--- native/suidhelper/src/config.rs | 58 ++++++++++++++----- .../suidhelper/tests/config_uid_gid_range.rs | 2 +- native/suidhelper/tests/e2e/config.rs | 2 +- 4 files changed, 58 insertions(+), 25 deletions(-) diff --git a/lib/hyper/config.ex b/lib/hyper/config.ex index c4d37c02..994c1fe4 100644 --- a/lib/hyper/config.ex +++ b/lib/hyper/config.ex @@ -6,7 +6,7 @@ defmodule Hyper.Config do single source of truth, `/etc/hyper/config.toml`, at runtime — never duplicated in `config :hyper`. The node and the helper parse the same file, so they cannot drift: `work_dir`, the `[tools]` binary paths (`firecracker`, `jailer`, ...), - `parent_cgroup`, and `[uid_gid_range]`. The file is read once on first access + and the `[jails]` table (`cgroup`, `uid_gid_range`). The file is read once on first access and cached in `:persistent_term`; an absent file (local dev / CI) yields the same built-in defaults the helper compiles in, so both sides still agree. @@ -135,10 +135,15 @@ defmodule Hyper.Config do @doc """ Name of the parent cgroup used as a supervision cgroup for all VMs. Read from - `parent_cgroup` in `#{@config_path}` (shared with the helper), default `"hyper"`. + `[jails] cgroup` in `#{@config_path}` (shared with the helper), default `"hyper"`. """ @spec parent_cgroup :: String.t() - def parent_cgroup, do: Map.get(config_toml(), "parent_cgroup", @default_parent_cgroup) + def parent_cgroup, do: Map.get(jails(), "cgroup", @default_parent_cgroup) + + # The `[jails]` table (VM placement/confinement, shared with the helper), or + # `%{}` when the file or table is absent. + @spec jails :: map() + defp jails, do: Map.get(config_toml(), "jails", %{}) @doc """ Path to the directory where all VM sockets are held. @@ -153,14 +158,14 @@ defmodule Hyper.Config do Range in which `Hyper` allocates uid/gids: each VM gets a fresh uid/gid pair in this range. Critical that no other process on the system uses this range. - Read from the `[uid_gid_range]` table (`min`/`max`) in `#{@config_path}` — the - same file the helper validates against, so the node only ever hands out uids the - helper will accept. Defaults to `#{inspect(@default_uid_gid_range)}` when absent. + Read from `[jails] uid_gid_range` (a `[min, max]` array) in `#{@config_path}` — + the same file the helper validates against, so the node only ever hands out uids + the helper will accept. Defaults to `#{inspect(@default_uid_gid_range)}` when absent. """ @spec uid_gid_range :: {integer(), integer()} def uid_gid_range do - case Map.get(config_toml(), "uid_gid_range") do - %{"min" => min, "max" => max} -> {min, max} + case Map.get(jails(), "uid_gid_range") do + [min, max] -> {min, max} _ -> @default_uid_gid_range end end diff --git a/native/suidhelper/src/config.rs b/native/suidhelper/src/config.rs index 104031dd..41b00a75 100644 --- a/native/suidhelper/src/config.rs +++ b/native/suidhelper/src/config.rs @@ -3,10 +3,10 @@ //! //! ## UID/GID range divergence //! -//! Elixir keeps `compile_env` default `{900_000, 999_999}` that governs which -//! UIDs the node hands *out*; this helper reads `[uid_gid_range]` from -//! config.toml to decide which UIDs it *accepts* (default `{900_000, 999_999}` -//! when the key is absent). Operators narrowing the range must set **both**. +//! Elixir keeps a default `{900_000, 999_999}` that governs which UIDs the node +//! hands *out*; this helper reads `[jails] uid_gid_range` from config.toml to +//! decide which UIDs it *accepts* (default `{900_000, 999_999}` when the key is +//! absent). Operators narrowing the range must set **both**. use crate::util::safe_bin::{self, SafeBin}; use crate::util::safe_file::{self, IsRegularFile, OnlyRootWritable, RootOwner, SafeFile}; @@ -45,14 +45,22 @@ pub enum BinError { const CONFIG_PATHSTR: &str = "/etc/hyper/config.toml"; const INSECURE_CONFIG_PATH_ENV: &str = "HYPER_SETUIDHELPER_CONFIG_PATH"; -/// UID/GID allocation band, read from `[uid_gid_range]` in config.toml. -/// Controls which UIDs the helper *accepts* from the BEAM — see module docs. +/// UID/GID allocation band, read from `[jails] uid_gid_range` in config.toml as +/// a two-element `[min, max]` array. Controls which UIDs the helper *accepts* +/// from the BEAM — see module docs. #[derive(Debug, Clone, Copy, Deserialize)] +#[serde(from = "[u32; 2]")] pub struct UidGidRange { pub min: u32, pub max: u32, } +impl From<[u32; 2]> for UidGidRange { + fn from([min, max]: [u32; 2]) -> Self { + Self { min, max } + } +} + // Band defaults match Elixir's `compile_env` allocation defaults so that an // unconfigured helper and an unconfigured node agree out of the box. const DEFAULT_UID_GID: (u32, u32) = (900_000, 999_999); @@ -91,12 +99,32 @@ pub struct Config { work_dir: PathBuf, #[serde(default)] tools: Tools, - #[serde(default = "default_parent_cgroup")] - parent_cgroup: String, #[serde(default)] + jails: Jails, +} + +/// The `[jails]` table: how the helper places and confines each VM jail. +/// +/// `cgroup` is the parent cgroup the jailer nests every VM beneath (default +/// `"hyper"`). `uid_gid_range` is the `[min, max]` band of UIDs/GIDs the helper +/// accepts from the BEAM; absent means the built-in default. A missing `[jails]` +/// table, or any missing key within it, falls back to these defaults. +#[derive(Debug, Clone, Deserialize)] +#[serde(default)] +pub struct Jails { + cgroup: String, uid_gid_range: Option, } +impl Default for Jails { + fn default() -> Self { + Self { + cgroup: default_parent_cgroup(), + uid_gid_range: None, + } + } +} + /// Paths to the external binaries the helper runs, the `[tools]` table. /// /// The device tools (`dmsetup`, `losetup`, `blockdev`) carry built-in defaults; @@ -156,8 +184,7 @@ impl Default for Config { Self { work_dir: default_work_dir(), tools: Tools::default(), - parent_cgroup: default_parent_cgroup(), - uid_gid_range: None, + jails: Jails::default(), } } } @@ -227,10 +254,10 @@ impl Config { .and_then(|p| SafeBin::from_path(p).map_err(BinError::Bin)) } - /// The jailer `--parent-cgroup` value. Defaults to `"hyper"`, matching the - /// Elixir node's `@parent_cgroup`. + /// The jailer `--parent-cgroup` value, from `[jails] cgroup`. Defaults to + /// `"hyper"`, matching the Elixir node's default. pub fn parent_cgroup(&self) -> &str { - &self.parent_cgroup + &self.jails.cgroup } /// The UID/GID band the helper accepts from the BEAM. Defaults to @@ -238,7 +265,8 @@ impl Config { /// A present range with min==0 or min>max is rejected at load time by /// [`Config::safe_load`], so this accessor is always total. pub fn uid_gid_range(&self) -> (u32, u32) { - self.uid_gid_range + self.jails + .uid_gid_range .map(|r| (r.min, r.max)) .unwrap_or(DEFAULT_UID_GID) } @@ -273,7 +301,7 @@ impl Config { return Err(LoadingError::Relative(path)); } - if let Some(r) = &config.uid_gid_range { + if let Some(r) = &config.jails.uid_gid_range { validate_uid_gid_range(r)?; } diff --git a/native/suidhelper/tests/config_uid_gid_range.rs b/native/suidhelper/tests/config_uid_gid_range.rs index 1f176e9f..1a6f4647 100644 --- a/native/suidhelper/tests/config_uid_gid_range.rs +++ b/native/suidhelper/tests/config_uid_gid_range.rs @@ -29,7 +29,7 @@ proptest! { fn valid_range_round_trips_via_toml(min in 1u32.., delta in 0u32..) { let max = min.saturating_add(delta); let body = format!( - "work_dir = \"/srv/hyper\"\n[uid_gid_range]\nmin = {min}\nmax = {max}\n" + "work_dir = \"/srv/hyper\"\n[jails]\nuid_gid_range = [{min}, {max}]\n" ); let config: Config = toml::from_str(&body).expect("valid TOML"); prop_assert_eq!(config.uid_gid_range(), (min, max)); diff --git a/native/suidhelper/tests/e2e/config.rs b/native/suidhelper/tests/e2e/config.rs index f1fbfc77..ea36b4e9 100644 --- a/native/suidhelper/tests/e2e/config.rs +++ b/native/suidhelper/tests/e2e/config.rs @@ -210,7 +210,7 @@ fn bad_uid_gid_range_exits_2_as_root() { // never receive because it skips its privilege drop when uid == 0. let p = write_root_config( tmp.path(), - "work_dir = \"/srv/hyper\"\n[uid_gid_range]\nmin = 0\nmax = 100\n", + "work_dir = \"/srv/hyper\"\n[jails]\nuid_gid_range = [0, 100]\n", ); let out = run_with_config(&p, &["sys-test"]); assert_eq!(out.status.code(), Some(2)); From 32d84ccada602618181178e834b9637db06e1ee9 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Sat, 27 Jun 2026 02:07:46 +0000 Subject: [PATCH 43/46] feat(config): read node tool paths from [tools] skopeo/mke2fs/umoci/suidhelper now read from the [tools] table in /etc/hyper/config.toml (was compile-time/app-env). Adds guard-narrowed tool_path/optional_tool_path helpers; rewrites Umoci.bin/0 as a case. --- lib/hyper/config.ex | 72 +++++++++++++++++++++++-------- lib/hyper/img/oci_loader/umoci.ex | 18 ++++---- 2 files changed, 62 insertions(+), 28 deletions(-) diff --git a/lib/hyper/config.ex b/lib/hyper/config.ex index 994c1fe4..b39ae96b 100644 --- a/lib/hyper/config.ex +++ b/lib/hyper/config.ex @@ -10,8 +10,10 @@ defmodule Hyper.Config do and cached in `:persistent_term`; an absent file (local dev / CI) yields the same built-in defaults the helper compiles in, so both sides still agree. - Node-only settings with no helper counterpart (`skopeo`/`umoci`/`mke2fs` paths, - `vmlinux`, the cluster topology) stay in `config :hyper`. + The node's own tools (`skopeo`, `umoci`, `mke2fs`, `suidhelper`) share that same + `[tools]` table — the helper simply ignores the keys it does not recognise, so + one table serves both. Only `vmlinux` and the cluster topology remain in + `config :hyper`. """ # The shared config file, read by both this node and the setuid helper. Absent @@ -25,9 +27,13 @@ defmodule Hyper.Config do @default_parent_cgroup "hyper" @default_uid_gid_range {900_000, 999_999} - @skopeo_path Application.compile_env(:hyper, :skopeo_path, "skopeo") - @umoci_path Application.compile_env(:hyper, :umoci_path, nil) - @mke2fs_path Application.compile_env(:hyper, :mke2fs_path, "mke2fs") + # Defaults for the node-only `[tools]` binaries (skopeo/umoci/mke2fs/suidhelper). + # These bare keys live alongside the helper's own tools in the `[tools]` table; + # the helper ignores the keys it does not recognise, so the two sides share one + # table without colliding. + @default_skopeo "skopeo" + @default_mke2fs "mke2fs" + @default_suid_helper "/usr/local/bin/hyper-suidhelper" @doc """ Root work directory for this node. All firecracker paths derive from it. @@ -96,6 +102,27 @@ defmodule Hyper.Config do end end + # A `[tools]` path with a built-in default: the configured string, or `default` + # when the key is absent (or set to a non-string, treated as unset). The + # `is_binary/1` guard pins the success type to `String.t()` so the public + # accessors stay precisely typed for Dialyzer rather than widening to `any()`. + @spec tool_path(String.t(), String.t()) :: String.t() + defp tool_path(key, default) do + case Map.get(tools(), key) do + path when is_binary(path) -> path + _ -> default + end + end + + # A `[tools]` path with no default: the configured string, or `nil` when unset. + @spec optional_tool_path(String.t()) :: String.t() | nil + defp optional_tool_path(key) do + case Map.get(tools(), key) do + path when is_binary(path) -> path + _ -> nil + end + end + @spec config_toml :: map() defp config_toml do case :persistent_term.get({__MODULE__, :config_toml}, nil) do @@ -185,32 +212,39 @@ defmodule Hyper.Config do @spec layer_dir :: Path.t() def layer_dir, do: Path.join(work_dir(), "layers") - @doc "Path to the skopeo binary (used by `Hyper.Img.OciLoader` to pull OCI images)." - def skopeo_path, do: @skopeo_path + @doc """ + Path to the skopeo binary (used by `Hyper.Img.OciLoader` to pull OCI images). + Read from `[tools] skopeo` in `#{@config_path}`, default `#{@default_skopeo}` (on `PATH`). + """ + @spec skopeo_path :: String.t() + def skopeo_path, do: tool_path("skopeo", @default_skopeo) @doc """ Operator-configured path to the umoci binary, or `nil` (the default) to let - `Hyper.Img.OciLoader.Umoci` download and manage a pinned default. + `Hyper.Img.OciLoader.Umoci` download and manage a pinned default. Read from + `[tools] umoci` in `#{@config_path}`. """ - def umoci_path, do: @umoci_path - - @doc "Path to the mke2fs binary (used by `Hyper.Img.OciLoader` to build the ext4 rootfs)." - def mke2fs_path, do: @mke2fs_path + @spec umoci_path :: String.t() | nil + def umoci_path, do: optional_tool_path("umoci") - # Where `cargo xtask install` (via `mix suidhelper.install`) drops the helper. - @default_suid_helper "/usr/local/bin/hyper-suidhelper" + @doc """ + Path to the mke2fs binary (used by `Hyper.Img.OciLoader` to build the ext4 rootfs). + Read from `[tools] mke2fs` in `#{@config_path}`, default `#{@default_mke2fs}` (on `PATH`). + """ + @spec mke2fs_path :: String.t() + def mke2fs_path, do: tool_path("mke2fs", @default_mke2fs) @doc """ Path to the setuid-root device helper (`hyper-suidhelper`). The node runs unprivileged and routes every `losetup`/`dmsetup`/`blockdev` operation through it. - Defaults to `#{@default_suid_helper}`, the install path used by `mix - suidhelper.install`. Runtime config (host-specific), so an operator who - installs it elsewhere can override per node without recompiling. + Read from `[tools] suidhelper` in `#{@config_path}`, default `#{@default_suid_helper}` + (the install path used by `mix suidhelper.install`), so an operator who installs + it elsewhere can override it per node without recompiling. """ - @spec suid_helper :: Path.t() - def suid_helper, do: Application.get_env(:hyper, :suid_helper, @default_suid_helper) + @spec suid_helper :: String.t() + def suid_helper, do: tool_path("suidhelper", @default_suid_helper) @doc """ Directory for per-VM scratch (writable-layer COW) files. Must be node-local and diff --git a/lib/hyper/img/oci_loader/umoci.ex b/lib/hyper/img/oci_loader/umoci.ex index 06929b58..dae13585 100644 --- a/lib/hyper/img/oci_loader/umoci.ex +++ b/lib/hyper/img/oci_loader/umoci.ex @@ -5,9 +5,9 @@ defmodule Hyper.Img.OciLoader.Umoci do Two sources, in priority order (mirrors `Hyper.Node.Vmlinux`): - 1. An operator-configured path via `config :hyper, umoci_path: - "/path/to/umoci"` (`Hyper.Config.umoci_path/0`). If set, it wins and is - never downloaded. + 1. An operator-configured path via `[tools] umoci` in + `/etc/hyper/config.toml` (`Hyper.Config.umoci_path/0`). If set, it wins + and is never downloaded. 2. Otherwise the pinned static binary downloaded by `ensure_installed/0` into `Hyper.Config.umoci_install_dir/0` (`/redist/umoci`). """ @@ -57,13 +57,13 @@ defmodule Hyper.Img.OciLoader.Umoci do """ @spec bin() :: Path.t() def bin do - configured = Config.umoci_path() + case Config.umoci_path() do + nil -> + {:ok, arch} = Sys.Arch.current() + default_path(arch) - if configured != nil do - configured - else - {:ok, arch} = Sys.Arch.current() - default_path(arch) + configured -> + configured end end From 9b72969d6ad7348e6c77398d898559f7c57006eb Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Sat, 27 Jun 2026 02:07:46 +0000 Subject: [PATCH 44/46] feat(config): merge optional /etc/hyper/config.exs at runtime runtime.exs merges an optional operator config from /etc/hyper/config.exs (override path via HYPER_CONFIG) last, so its values win. Skipped under :test. --- config/runtime.exs | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/config/runtime.exs b/config/runtime.exs index c9ca89fb..45dbba68 100644 --- a/config/runtime.exs +++ b/config/runtime.exs @@ -38,3 +38,18 @@ if config_env() != :test do config :opentelemetry, traces_exporter: :none end end + +# Operator overrides from a well-known location. An optional Elixir config file +# at /etc/hyper/config.exs (override the path with HYPER_CONFIG) is merged in +# last, so its values win over every default set above. An absent file is a +# no-op -- the normal case in dev and CI. Skipped under :test so the suite never +# reads host state. +if config_env() != :test do + hyper_config = System.get_env("HYPER_CONFIG") || "/etc/hyper/config.exs" + + if File.exists?(hyper_config) do + for {app, kw} <- Config.Reader.read!(hyper_config, env: config_env()) do + config app, kw + end + end +end From 46dfd96fd54d90db871e55a60fe614de790a0ec4 Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Sat, 27 Jun 2026 02:07:46 +0000 Subject: [PATCH 45/46] docs: highlight cookbook code blocks via makeup_syntect Adds makeup_syntect + a docs alias step aliasing bash/sh fences to the shell grammar, so toml/bash/sh/python/rust/markdown blocks highlight. --- mix.exs | 38 ++++++++++++++++++++++++++++++++++++++ mix.lock | 3 +++ 2 files changed, 41 insertions(+) diff --git a/mix.exs b/mix.exs index 5c620b81..fe3f48f4 100644 --- a/mix.exs +++ b/mix.exs @@ -71,6 +71,10 @@ defmodule Hyper.MixProject do {:junit_formatter, "~> 3.4", only: :test, runtime: false}, {:dialyxir, "~> 1.4", only: [:dev], runtime: false}, {:ex_doc, "~> 0.34", only: :dev, runtime: false}, + # Syntect-backed Makeup lexer: covers the doc languages that have no + # dedicated Makeup lexer (markdown, toml, bash, sh, python). Elixir/erlang + # still use their native lexers; this fills the rest in one dep. + {:makeup_syntect, "~> 0.1", only: :dev, runtime: false}, {:ecto_sql, "~> 3.13"}, {:grpc, "~> 1.0"}, {:grpc_server, "~> 1.0"}, @@ -109,6 +113,8 @@ defmodule Hyper.MixProject do extras: [ "README.md", "docs/cookbook/intro.md", + "docs/cookbook/install.md", + "docs/cookbook/config.md", "docs/cookbook/architecture.md", "docs/grpc.md" ], @@ -187,6 +193,32 @@ defmodule Hyper.MixProject do end # `mix check` - the strict gate. Runs fast checks first, slow ones (dialyzer) last. + # Give ```bash / ```sh real syntax highlighting in the docs. + # + # Two obstacles, both about *who registers the lexer last*: + # 1. makeup_syntect registers the shell grammar only under its raw syntect + # name "Shell-Unix-Generic" (ExDoc resolves fences by lexer name, not + # file extension), so ```bash / ```sh never reach it. + # 2. ExDoc itself registers a minimal `ExDoc.ShellLexer` for sh/bash/shell/ + # zsh (it only de-selects the `$ ` prompt; everything else is plain text) + # from `ExDoc.Application.start`, which runs during the `docs` task. + # + # So we start :ex_doc and :makeup_syntect here first (idempotent — the later + # `docs` task won't re-run their `start/2`), then register our shell aliases + # LAST so they win. Dev-only; runs as the `docs` alias's first step. + defp register_doc_lexers(_args) do + {:ok, _} = Application.ensure_all_started(:makeup_syntect) + {:ok, _} = Application.ensure_all_started(:ex_doc) + + Makeup.Registry.register_lexer(MakeupSyntect.Lexer, + options: [language: "Shell-Unix-Generic"], + names: ["bash", "sh", "shell", "zsh"], + extensions: [] + ) + + :ok + end + defp aliases do [ check: [ @@ -196,6 +228,12 @@ defmodule Hyper.MixProject do "test --warnings-as-errors", "dialyzer" ], + # makeup_syntect registers the shell grammar only under its raw syntect + # name "Shell-Unix-Generic", and ExDoc resolves fences by lexer *name* + # (not file extension), so ```bash / ```sh would fall back to plain text. + # Alias them to the shell grammar before ExDoc runs (same VM, so the + # registration is visible to the highlighter). + docs: ["loadpaths", ®ister_doc_lexers/1, "docs"], # Force a regeneration of the Firecracker bindings (ignores staleness). "firecracker.gen": ["compile.firecracker_gen --force"], # Force a regeneration of the gRPC bindings (ignores staleness). diff --git a/mix.lock b/mix.lock index 06780d8a..8012573e 100644 --- a/mix.lock +++ b/mix.lock @@ -1,6 +1,7 @@ %{ "acceptor_pool": {:hex, :acceptor_pool, "1.0.1", "d88c2e8a0be9216cf513fbcd3e5a4beb36bee3ff4168e85d6152c6f899359cdb", [:rebar3], [], "hexpm", "f172f3d74513e8edd445c257d596fc84dbdd56d2c6fa287434269648ae5a421e"}, "bunt": {:hex, :bunt, "1.0.0", "081c2c665f086849e6d57900292b3a161727ab40431219529f13c4ddcf3e7a44", [:mix], [], "hexpm", "dc5f86aa08a5f6fa6b8096f0735c4e76d54ae5c9fa2c143e5a1fc7c1cd9bb6b5"}, + "castore": {:hex, :castore, "1.0.19", "6903cabdfd9d1af46454126e7c8385186659dd33ecfb74a885cae52221ad6109", [:mix], [], "hexpm", "3669e6cab13f54c2df26b3e6833745d647f35b6e30d8ddd5975df0d5c842ca98"}, "chatterbox": {:hex, :ts_chatterbox, "0.15.1", "5cac4d15dd7ad61fc3c4415ce4826fc563d4643dee897a558ec4ea0b1c835c9c", [:rebar3], [{:hpack, "~> 0.3.0", [hex: :hpack_erl, repo: "hexpm", optional: false]}], "hexpm", "4f75b91451338bc0da5f52f3480fa6ef6e3a2aeecfc33686d6b3d0a0948f31aa"}, "cowboy": {:hex, :cowboy, "2.16.1", "fa04080b602ff25c40a7700f2dc0152dbc1ba26b42093ae0fa9bb7a337d5a242", [:make, :rebar3], [{:cowlib, ">= 2.16.0 and < 3.0.0", [hex: :cowlib, repo: "hexpm", optional: false]}, {:ranch, ">= 1.8.0 and < 3.0.0", [hex: :ranch, repo: "hexpm", optional: false]}], "hexpm", "b8ea4dd317a043e3177ec840cfa3bcb47cfb41035d3abb24d954dc7d51def399"}, "cowlib": {:hex, :cowlib, "2.17.1", "3e6053016d1ab245730f0af688755476dcedb1c25ed8fb5751f59a2bfdc0c9af", [:make, :rebar3], [], "hexpm", "ff08bd17e6dd931445b18af77315b9b5fe052407110964ad2588c686b57b5e3f"}, @@ -38,6 +39,7 @@ "makeup": {:hex, :makeup, "1.2.1", "e90ac1c65589ef354378def3ba19d401e739ee7ee06fb47f94c687016e3713d1", [:mix], [{:nimble_parsec, "~> 1.4", [hex: :nimble_parsec, repo: "hexpm", optional: false]}], "hexpm", "d36484867b0bae0fea568d10131197a4c2e47056a6fbe84922bf6ba71c8d17ce"}, "makeup_elixir": {:hex, :makeup_elixir, "1.0.1", "e928a4f984e795e41e3abd27bfc09f51db16ab8ba1aebdba2b3a575437efafc2", [:mix], [{:makeup, "~> 1.0", [hex: :makeup, repo: "hexpm", optional: false]}, {:nimble_parsec, "~> 1.2.3 or ~> 1.3", [hex: :nimble_parsec, repo: "hexpm", optional: false]}], "hexpm", "7284900d412a3e5cfd97fdaed4f5ed389b8f2b4cb49efc0eb3bd10e2febf9507"}, "makeup_erlang": {:hex, :makeup_erlang, "1.1.0", "835f7e60792e08824cda445639555d7bf1bbbddb1b60b306e33cb6f6db24dc74", [:mix], [{:makeup, "~> 1.0", [hex: :makeup, repo: "hexpm", optional: false]}], "hexpm", "1cd6780fb1dd1a03979abaed0fe82712b0625118fd5257d3ebbf73f960c73c3c"}, + "makeup_syntect": {:hex, :makeup_syntect, "0.1.4", "e1230c9e0513c667b226b21c83eb182e1ab581f65af9441edab1f9ac626acba6", [:mix], [{:makeup, "~> 1.2", [hex: :makeup, repo: "hexpm", optional: false]}, {:rustler, "~> 0.37.1", [hex: :rustler, repo: "hexpm", optional: true]}, {:rustler_precompiled, "~> 0.8.2", [hex: :rustler_precompiled, repo: "hexpm", optional: false]}], "hexpm", "5b624a434d9665786b9a352a5f3502b6c98e1996ede9936b20035ec140daef70"}, "merkle_map": {:hex, :merkle_map, "0.2.2", "f36ff730cca1f2658e317a3c73406f50bbf5ac8aff54cf837d7ca2069a6e251c", [:mix], [], "hexpm", "383107f0503f230ac9175e0631647c424efd027e89ea65ab5ea12eeb54257aaf"}, "mime": {:hex, :mime, "2.0.7", "b8d739037be7cd402aee1ba0306edfdef982687ee7e9859bee6198c1e7e2f128", [:mix], [], "hexpm", "6171188e399ee16023ffc5b76ce445eb6d9672e2e241d2df6050f3c771e80ccd"}, "mint": {:hex, :mint, "1.9.0", "d6f534c2a3e98b2a8cc749b4796eb77e9e3af79a76f96e4c74035a827de0d318", [:mix], [{:castore, "~> 0.1.0 or ~> 1.0", [hex: :castore, repo: "hexpm", optional: true]}, {:hpax, "~> 0.1.1 or ~> 0.2.0 or ~> 1.0", [hex: :hpax, repo: "hexpm", optional: false]}], "hexpm", "007154c7d8c43916aed3c93afd1f11aebbaa9c5ff4b7ba55ebe0d17ee0296042"}, @@ -57,6 +59,7 @@ "protobuf": {:hex, :protobuf, "0.17.0", "39e24e43c9648e148feba16ed51100b5b2028ea900b55460377b0476f6e10613", [:mix], [{:jason, "~> 1.2", [hex: :jason, repo: "hexpm", optional: true]}], "hexpm", "ca6c91f6f63e2c147b47f03eefd10b80538aa6fc55ff4b12b795efb786b0152f"}, "ranch": {:hex, :ranch, "2.2.0", "25528f82bc8d7c6152c57666ca99ec716510fe0925cb188172f41ce93117b1b0", [:make, :rebar3], [], "hexpm", "fa0b99a1780c80218a4197a59ea8d3bdae32fbff7e88527d7d8a4787eff4f8e7"}, "req": {:hex, :req, "0.6.2", "b9b2024f35bcf60a92cc8cad2eaaf9d4e7aace463ff74be1afe5986830184413", [:mix], [{:brotli, "~> 0.3.1", [hex: :brotli, repo: "hexpm", optional: true]}, {:ezstd, "~> 1.0", [hex: :ezstd, repo: "hexpm", optional: true]}, {:finch, "~> 0.21", [hex: :finch, repo: "hexpm", optional: false]}, {:jason, "~> 1.0", [hex: :jason, repo: "hexpm", optional: false]}, {:mime, "~> 2.0.6 or ~> 2.1", [hex: :mime, repo: "hexpm", optional: false]}, {:nimble_csv, "~> 1.0", [hex: :nimble_csv, repo: "hexpm", optional: true]}, {:plug, "~> 1.0", [hex: :plug, repo: "hexpm", optional: true]}], "hexpm", "cc9cd30a2ddd04989929b887178e1610c940456d962c6c3a52df6146d2eef9bf"}, + "rustler_precompiled": {:hex, :rustler_precompiled, "0.8.4", "700a878312acfac79fb6c572bb8b57f5aae05fe1cf70d34b5974850bbf2c05bf", [:mix], [{:castore, "~> 0.1 or ~> 1.0", [hex: :castore, repo: "hexpm", optional: false]}, {:rustler, "~> 0.23", [hex: :rustler, repo: "hexpm", optional: true]}], "hexpm", "3b33d99b540b15f142ba47944f7a163a25069f6d608783c321029bc1ffb09514"}, "ssl_verify_fun": {:hex, :ssl_verify_fun, "1.1.7", "354c321cf377240c7b8716899e182ce4890c5938111a1296add3ec74cf1715df", [:make, :mix, :rebar3], [], "hexpm", "fe4c190e8f37401d30167c8c405eda19469f34577987c76dde613e838bbc67f8"}, "stream_data": {:hex, :stream_data, "1.3.0", "bde37905530aff386dea1ddd86ecbf00e6642dc074ceffc10b7d4e41dfd6aac9", [:mix], [], "hexpm", "3cc552e286e817dca43c98044c706eec9318083a1480c52ae2688b08e2936e3c"}, "telemetry": {:hex, :telemetry, "1.4.2", "a0cb522801dffb1c49fe6e30561badffc7b6d0e180db1300df759faa22062855", [:rebar3], [], "hexpm", "928f6495066506077862c0d1646609eed891a4326bee3126ba54b60af61febb1"}, From ed6ad5d180f8a0a95fa04083906c9f47f9540a6e Mon Sep 17 00:00:00 2001 From: Marko Vejnovic Date: Sat, 27 Jun 2026 02:07:46 +0000 Subject: [PATCH 46/46] docs(cookbook): document [tools]/[jails], node tools, user config --- docs/cookbook/config.md | 278 +++++++++++++++++++++++++++++++++++++++ docs/cookbook/install.md | 248 ++++++++++++++++++++++++++++++++++ docs/cookbook/intro.md | 210 ++--------------------------- 3 files changed, 539 insertions(+), 197 deletions(-) create mode 100644 docs/cookbook/config.md create mode 100644 docs/cookbook/install.md diff --git a/docs/cookbook/config.md b/docs/cookbook/config.md new file mode 100644 index 00000000..cf59b7cf --- /dev/null +++ b/docs/cookbook/config.md @@ -0,0 +1,278 @@ +# Configuration + +Configuring `Hyper` is done through four layers, in priority: + + 1. Runtime `/etc/hyper/config.exs` is the canonical Elixir way to configure + the system. This allows you to inject arbitrary code to configure `Hyper`. + 2. `Hyper` will fall back to reading `/etc/hyper/config.toml` at runtime, on + bootup, on each node. + 3. `Hyper` will use its compile-time configuration through `config.ex`. + 4. `Hyper` will use defaults. + +**Note that not all layers allow all configuration fields to be tweaked.** This +is usually done for security. + +## Configuration Files + +### `/etc/hyper/config.exs` + +The `config.exs` file is exlusively used by the unprivileged `hyper` +application. The purpose of this file is to allow you to load configuration +values at runtime. If you are using a secrets manager, this is the right place +to load the secrets. + +### `/etc/hyper/config.toml` + +The `/etc/hyper/config.toml` file is used for static configuration. Unlike +`config.exs`, it is used by both `Hyper` and `hyper-suidhelper` which means +that it can impact the behavior of a process running under `root`. + +### Compile-Time Config + +The compile-time configuration is generally used to fine-tune the performance +of Hyper. You likely do not need to edit most of the configuration fields +exposed by this file for day-to-day usage, but they are available for you to +tweak. + +## Configuration Fields + +### Tool Configuration + +Hyper relies on a large number of external tools, all configured under the +`[tools]` table in `/etc/hyper/config.toml`. + +#### Privileged tools (run by the setuid helper) + +| Tool | Required | Default | `/etc/hyper/config.toml` | +|------|----------|---------|--------------------------| +| `firecracker` | Yes | - | `tools.firecracker` | +| `jailer` | Yes | - | `tools.jailer` | +| `dmsetup` | No | `/usr/sbin/dmsetup` | `tools.dmsetup` | +| `losetup` | No | `/usr/sbin/losetup` | `tools.losetup` | +| `blockdev` | No | `/usr/sbin/blockdev` | `tools.blockdev` | + +> #### Requirements {: .info} +> +> - These paths **can only** be configured through `/etc/hyper/config.toml`. +> Both `Hyper` and `hyper-setuidhelper` rely on these paths being identical. +> +> - The paths **must** be given as absolute paths. +> - The basename **must** match the configuration, eg. `firecracker` must have +> a path `/foo/bar/firecracker`. +> - The tools must be owned by the `root` user. +> - The tools must be exlusively writable by `root`. + +#### Node tools (run by the unprivileged node) + +| Tool | Required | Default | `/etc/hyper/config.toml` | +|------|----------|---------|--------------------------| +| `skopeo` | No | `skopeo` (on `PATH`) | `tools.skopeo` | +| `mke2fs` | No | `mke2fs` (on `PATH`) | `tools.mke2fs` | +| `umoci` | No | downloaded + cached under `/redist/umoci` | `tools.umoci` | +| `suidhelper` | No | `/usr/local/bin/hyper-suidhelper` | `tools.suidhelper` | + +> #### These are not privileged {: .info} +> +> The node runs these directly as the unprivileged hyper user, so — unlike the +> privileged tools above — they carry **no** root-ownership or basename +> requirement. `skopeo`/`mke2fs` default to the bare name resolved on `PATH`; +> leave `umoci` unset to let Hyper download and cache a pinned release. They +> share the one `[tools]` table with the privileged binaries — the helper simply +> ignores the keys it does not own. + +## The shared file: `/etc/hyper/config.toml` + +> #### Security {: .error} +> +> This file **must** be owned by `root` and be neither group- nor +> world-writable (e.g. mode `0644`). The setuid helper refuses to start +> otherwise — a present-but-untrusted file is treated as operator +> misconfiguration and is fatal (exit `2`), never silently ignored. + +### Root keys + +| Key | Type | Default | Meaning | +|-----|------|---------|---------| +| `work_dir` | string (absolute path) | `/srv/hyper` | Root of all node-local runtime state. Every other directory is derived from it. Must be an absolute path. Strongly recommended to sit on an NVMe drive. | + +The following directories are derived from `work_dir` and are **not** +independently configurable: + +| Path | Purpose | +|------|---------| +| `/jails` | Per-VM chroot directories | +| `/socks` | Per-VM control/gRPC sockets | +| `/scratch` | Per-VM copy-on-write writable layers | +| `/layers` | Read-only image layer store | +| `/redist` | Node-downloaded binaries (`vmlinux`, `umoci`) | + +### `[jails]` — confinement + +| Key | Type | Default | Meaning | +|-----|------|---------|---------| +| `cgroup` | string | `"hyper"` | Parent cgroup under which every VM cgroup is nested (passed to the jailer as `--parent-cgroup`). The operator must create `/sys/fs/cgroup/` and enable subtree control. | +| `uid_gid_range` | `[min, max]` | `[900000, 999999]` | UID/GID band each VM jail is allocated from. `min` must be `>= 1` and `<= max`; `min = 0` is rejected (uid 0 is root, and the jailer skips its privilege drop for uid 0). | + +> #### `uid_gid_range` is enforced on both sides {: .warning} +> +> The node only hands out UIDs in this range, and the helper only *accepts* +> UIDs in this range. Because both read the same file, narrowing the band is a +> single edit here — no second place to keep in sync. Nothing else on the host +> may use UIDs/GIDs in this range. + +### Complete example + +```toml +# Root of all node-local state. Strongly prefer an NVMe-backed mount. +work_dir = "/srv/hyper" + +# External binaries. The privileged ones (firecracker..blockdev) must be +# root-owned, not group/world-writable, absolute, and named exactly as their +# key; the node tools (skopeo/mke2fs/umoci/suidhelper) have no such requirement. +[tools] +firecracker = "/opt/firecracker/firecracker" # required; basename must be 'firecracker' +jailer = "/opt/firecracker/jailer" # required; basename must be 'jailer' +# dmsetup = "/usr/sbin/dmsetup" # optional (default shown) +# losetup = "/usr/sbin/losetup" # optional (default shown) +# blockdev = "/usr/sbin/blockdev" # optional (default shown) +# skopeo = "skopeo" # optional node tool (default shown) +# mke2fs = "mke2fs" # optional node tool (default shown) +# umoci = "/usr/bin/umoci" # optional; omit to auto-download +# suidhelper = "/usr/local/bin/hyper-suidhelper" # optional (default shown) + +[jails] +cgroup = "hyper" # default +uid_gid_range = [900000, 999999] # default +``` + +The minimal file is just `work_dir` plus the two required tools — everything +else defaults. + +## Node-only configuration (`config :hyper`) + +These have no helper counterpart and stay in `config :hyper`. The node's tool +paths (`skopeo`, `mke2fs`, `umoci`, `suidhelper`) used to live here but now read +from the `[tools]` table above — see [Tool Configuration](#tool-configuration). + +### Guest kernels + +| Key | Where read | Type | Default | Meaning | +|-----|-----------|------|---------|---------| +| `vmlinux` | runtime | `%{arch => path}` | `%{}` | Per-architecture guest kernel images, keyed by `Sys.Arch.t()`. The operator places kernels on the host and points these at them. | + +```elixir +config :hyper, + vmlinux: %{x86_64: "/srv/hyper/redist/vmlinux/vmlinux-x86_64"} +``` + +### Resource budget — `Hyper.Node.Config.Budget` + +The per-node resource budget. **Required**: the node refuses to boot if it is +absent. Set it in `config/runtime.exs`. Use the `Unit.*` quantities, never bare +numbers. + +| Key | Type | Meaning | +|-----|------|---------| +| `mem_max` | `Unit.Information.t()` | Hard memory cap for this node. | +| `disk_max` | `Unit.Information.t()` | Hard disk cap for this node. | +| `cpu_max_load` | float `0.0..1.0` | CPU-utilization fraction above which the node is considered full. | +| `disk_bw_cap` | `Unit.Bandwidth.t()` | Absolute disk throughput capacity. | +| `disk_bw_max_load` | float `0.0..1.0` | Fraction of `disk_bw_cap` past which disk is saturated. | +| `net_bw_cap` | `Unit.Bandwidth.t()` | Absolute network throughput capacity. | +| `net_bw_max_load` | float `0.0..1.0` | Fraction of `net_bw_cap` past which network is saturated. | + +```elixir +config :hyper, Hyper.Node.Config.Budget, + mem_max: Unit.Information.gib(4), + disk_max: Unit.Information.gib(4), + cpu_max_load: 0.8, + disk_bw_cap: Unit.Bandwidth.gibps(1), + disk_bw_max_load: 0.8, + net_bw_cap: Unit.Bandwidth.gibps(1), + net_bw_max_load: 0.8 +``` + +### gRPC server — `Hyper.Grpc.Config` + +The public gRPC interface. **Disabled by default.** + +| Key | Type | Default | Meaning | +|-----|------|---------|---------| +| `enabled` | boolean | `false` | Whether the server starts. | +| `port` | port number | `50051` | Listen port. | +| `cred` | `GRPC.Credential.t()` \| `nil` | `nil` | TLS credential, or `nil` for plaintext. | +| `adapter_opts` | keyword | `[]` | Forwarded to the server adapter, e.g. `[ip: {0, 0, 0, 0}]`. | + +```elixir +config :hyper, Hyper.Grpc.Config, + enabled: true, + port: 50_051, + cred: GRPC.Credential.new(ssl: [certfile: "/path/cert.pem", keyfile: "/path/key.pem"]) +``` + +> #### Co-located nodes {: .info} +> +> Every node binds `:port`. Running multiple nodes on one host requires giving +> each a distinct port. Build the TLS credential where you load your keys +> (e.g. `config/runtime.exs`); Hyper never reads the filesystem on your behalf. + +### Layer garbage collector — `Hyper.Img.Db.Gc.Config` + +A cluster-wide singleton that prunes unreferenced image layers. Every field has +a default; set only what you change. Durations are `Unit.Time` values, so +overrides belong in `config/runtime.exs`. Set `enabled: false` to never start it. + +| Key | Type | Default | Meaning | +|-----|------|---------|---------| +| `enabled` | boolean | `true` | Run the collector at all. | +| `batch_size` | `pos_integer` | `200` | Rows per keyset page (smaller = finer pause granularity). | +| `batch_pause` | `Unit.Time.t()` | `100ms` | Pause between pages within a sweep. | +| `sweep_interval` | `Unit.Time.t()` | `60s` | Rest between completed sweeps. | +| `acquire_interval` | `Unit.Time.t()` | `5s` | How often a standby retries to become the active singleton. | +| `retry` | `Unit.Time.t()` | `60s` | Backoff when the medium or DB is unavailable. | +| `statement_timeout` | `Unit.Time.t()` | `5s` | Cap on each GC DB statement so it can't pin a backend. | +| `grace_period` | `Unit.Time.t()` | `1h` | Never prune a blob younger than this (protects a row whose file is still being published). | + +```elixir +config :hyper, Hyper.Img.Db.Gc.Config, + enabled: true, + sweep_interval: Unit.Time.s(30), + grace_period: Unit.Time.s(60 * 60) +``` + +### Orphaned-resource reaper — `Hyper.Node.Reaper.Config` + +A per-node sweeper that reclaims orphaned firecracker cgroups and `hyper-rw-*` +device-mapper volumes left behind by unclean BEAM deaths. Uses a two-strike +grace period (an orphan must be seen on two consecutive ticks before it is +reaped). Set `enabled: false` to never start it. + +| Key | Type | Default | Meaning | +|-----|------|---------|---------| +| `enabled` | boolean | `true` | Run the reaper at all. | +| `interval` | `Unit.Time.t()` | `60s` | Rest between reap ticks. | + +```elixir +config :hyper, Hyper.Node.Reaper.Config, + enabled: true, + interval: Unit.Time.s(30) +``` + +### Telemetry (OpenTelemetry) + +Tracing is configured in `config/runtime.exs` from environment variables: + +| Variable | Effect | +|----------|--------| +| `HONEYCOMB_API_KEY` | Export to `https://api.honeycomb.io` with this key. | +| `OTEL_EXPORTER_OTLP_ENDPOINT` | If `HONEYCOMB_API_KEY` is unset, export to this OTLP/HTTP endpoint (e.g. a local Collector), no auth header. | + +If neither is set, tracing is disabled. + +### Database and cluster topology + +The image-metadata database (`Hyper.Img.Db.Repo`, a standard Ecto/PostgreSQL +repo) and the cluster topology (`:libcluster`) are configured in +`config/config.exs` like any Elixir app. PostgreSQL is a required runtime +dependency — the node will not boot without a reachable instance. See +[Installation](install.md) for connection setup. diff --git a/docs/cookbook/install.md b/docs/cookbook/install.md new file mode 100644 index 00000000..8b22c877 --- /dev/null +++ b/docs/cookbook/install.md @@ -0,0 +1,248 @@ +# Quick Start + +This document provides the quickest start available to get Hyper running. + +## Configuration + +Before you can use `Hyper`, you must do a large amount of configuration. The +following guide must be applied on all nodes you run `Hyper` on. + +Before proceeding, ensure you meet all of these hard requirements: + +| Requirement | Test | +|-------------|------| +| [KVM](https://linux-kvm.org/page/Main_Page) available | `stat /dev/kvm` returns zero. | +| You have root access through `sudo`. | - | +| Your machine has cgroups V2 | `stat -fc %T /sys/fs/cgroup` returns zero. | + +### OS Packages + + + +### Ubuntu + +You can install the required packages by running: + +```sh +sudo apt update && sudo apt install -y \ + coreutils \ + e2fsprogs \ + libc-bin \ + linux-modules-extra-$(uname -r) \ + lvm2 \ + skopeo \ + util-linux +``` + +### Rocky + +You can install the required packages by running: + +```sh +sudo dnf install -y \ + coreutils \ + e2fsprogs \ + glibc-common \ + kernel-modules-extra-$(uname -r) \ + lvm2 \ + skopeo \ + util-linux +``` + +> #### Untested {: .warning} +> +> Rocky has not been tested, but should work. + + + +### Device Mapper Config + +Hyper relies on `dm-snapshot` and `dm-thin` to build COW filesystems. Load the +modules and confirm the targets are present: + +```sh +sudo modprobe dm_snapshot dm_thin_pool loop +sudo dmsetup targets # must list snapshot, thin, and thin-pool +``` + +> #### Persistent Config {: .warning} +> +> Loading modules via `modprobe` is ephemeral and will be reset on next boot. +> To make your config persistent: +> +> ```sh +> printf 'dm_snapshot\ndm_thin_pool\nloop\n' \ +> | sudo tee /etc/modules-load.d/hyper.conf +> ``` + +### PostgreSQL + +Hyper needs a **PostgreSQL** server reachable from every node - it is the image +database and the only stateful external dependency. + +For local development the quickest path is Docker. The connection details below +match the defaults in `config/config.exs` (`Hyper.Img.Db.Repo`): + +```sh +docker run -d --name hyper-pg \ + -e POSTGRES_USER=postgres \ + -e POSTGRES_PASSWORD=postgres \ + -e POSTGRES_DB=hyper_dev \ + -p 5432:5432 \ + postgres:16 +``` + +> #### Persistence {: .warning} +> +> Note that the example container should not be used in production -- it will +> be deleted on boot. +> +> We highly suggest you get a managed PostgresSQL instance. The following +> commonly used options are available: +> +> - [AWS RDS](https://aws.amazon.com/rds/postgresql/) if you're in the AWS +> ecosystem. +> - [GCP CloudSQL](https://cloud.google.com/sql) if you're in the GCP +> ecosystem. +> +> The author uses GCP. + +### Configuration + +It is mandatory that you create an `/etc/hyper/config.toml` file on every node. +A reasonable starting point is: + +```toml +# The working directory for hyper. Hyper will create a directory tree in this +# directory and running images, sockets and scratch space will be created in +# this directory. We **strongly** encourage this be mounted on an NVMe drive. +work_dir = "/srv/hyper" + +# Paths to every external binary hyper uses. All paths must be absolute. +# +# The privileged binaries the setuid helper runs (firecracker, jailer, dmsetup, +# losetup, blockdev) must be root-owned and not group/world writable -- the +# helper refuses them otherwise. The node-run tools (skopeo, umoci, mke2fs) have +# no such requirement. +[tools] +# **required**. basename **must** be 'firecracker'. +firecracker = "/opt/firecracker/firecracker" + +# **required**. basename **must** be 'jailer'. +jailer = "/opt/firecracker/jailer" + +# optional -- privileged device tools, default to /usr/sbin/. +# dmsetup = "/usr/sbin/dmsetup" +# losetup = "/usr/sbin/losetup" +# blockdev = "/usr/sbin/blockdev" + +# optional -- node-run tools. skopeo/mke2fs default to the name on PATH; omit +# umoci to let hyper download and cache a pinned release. +# skopeo = "skopeo" +# mke2fs = "mke2fs" +# umoci = "/usr/bin/umoci" +# suidhelper = "/usr/local/bin/hyper-suidhelper" + +[jails] +# The valid range of user/group IDs in which new VMs will be spawned. Hyper +# will create new VM jails for each VM within the given range. +uid_gid_range = [900000, 999999] +# optional +cgroup = "hyper" +``` + +> #### Security {: .error} +> +> This file **must** be owned by `root`, not group and not world writable. +> `Hyper` will refuse to boot otherwise. + +For more details on configuring and tuning Hyper, we suggest you see the +[configuration guide](config.md). + +### Cgroups + +Hyper uses cgroups to impose limits on each VM. Each VM has its own cgroup, +which is spawned ephemerally, for the lifetime of the VM. These cgroups are all +managed by a parent cgroup which you must create. You can name this cgroup +whatever you like, as long as it matches the `jails.cgroup` value in the +`/etc/hyper/config.toml`: + +```sh +sudo mkdir -p /sys/fs/cgroup/hyper +``` + +You must allow permissions on `cpu` and `memory` control on the subtree: + +```sh +echo '+cpu +memory' | sudo tee /sys/fs/cgroup/hyper/cgroup.subtree_control +``` + +> #### Security {: .error} +> +> Note that Hyper does not manage the `cgroup` with its user -- it rather +> delegates to `hyper-suidhelper`, which is why `/sys/fs/cgroup/hyper` should +> be `root:root` owned. + +> #### Persistence {: .warning} +> +> The configuration, as given, will not survive reboots. To persist it, you can +> use `systemd-tempfiles`: +> +> ```sh +> echo 'd /sys/fs/cgroup/hyper 0755 root root -' \ +> | sudo tee /etc/tmpfiles.d/hyper-cgroup.conf +> ``` + +### User Configuration + +Hyper must **not** run as `root`, and you should not run it as your login user +either. Instead, give it a dedicated, unprivileged system user. The BEAM runs +as this user; every operation that genuinely needs root is routed through the +setuid helper (see [SUID Helper](#suid-helper)), so the node itself never holds +privilege. + +Create the user — system account, no login shell: + +```sh +sudo useradd --system --shell /usr/sbin/nologin --home-dir /srv/hyper hyper +``` + +Start Hyper as this user (for example `sudo -u hyper ...`, or `User=hyper` in a +systemd unit). The rest of this section covers the few permissions it needs — +and the ones it deliberately does **not**. + +#### Working directory + +The node builds its entire on-disk tree (`jails`, `socks`, `scratch`, `layers`, +`redist`) under `work_dir` (from `/etc/hyper/config.toml`, default `/srv/hyper`) +**as this user**. It must therefore own that directory: + +```sh +sudo mkdir -p /srv/hyper +sudo chown hyper:hyper /srv/hyper +``` + +## Installation + +### SUID Helper + +Hyper does not run as `root`. Running Hyper as root is considered unsafe and an +anti-pattern. Unfortunately, Hyper needs root for certain classes of system +operations. This is achieved through a side-car binary called +`hyper-setuidhelper`, which you must install manually. + +> #### Versioning {: .warning} +> +> The `hyper-setuidhelper` binary is versioned together with the version of +> `Hyper`, meaning that mismatched versions between the `hyper-setuidhelper` +> and `Hyper` itself will not work and Hyper will fail to boot. + +Installing this binary can be done by downloading it from the [Github Releases +page](https://github.com/harmont-dev/hyper/releases) and executing: + +```sh +sudo install -o root -g root -m 4755 \ + path/to/downloaded/hyper-suidehelper \ + /usr/local/bin/hyper-suidhelper +``` + diff --git a/docs/cookbook/intro.md b/docs/cookbook/intro.md index 57aa4ff0..04500feb 100644 --- a/docs/cookbook/intro.md +++ b/docs/cookbook/intro.md @@ -13,183 +13,6 @@ of the aforementioned systems. The absolute best way to understand `Hyper` and how it works is to play around with it. -## Getting Started - -The absolute best way to get started with `Hyper` is to play with it. - -### Requirements - -#### External services - -Hyper needs a **PostgreSQL** server reachable from every node - it is the image -database and the only stateful external dependency. - -For local development the quickest path is Docker. The connection details below -match the defaults in `config/config.exs` (`Hyper.Img.Db.Repo`): - -```sh -docker run -d --name hyper-pg \ - -e POSTGRES_USER=postgres \ - -e POSTGRES_PASSWORD=postgres \ - -e POSTGRES_DB=hyper_dev \ - -p 5432:5432 \ - postgres:16 -``` - -Once it is up, create and migrate the schema (the repo is not in `ecto_repos`, -so pass it with `-r`): - -```sh -mix ecto.create -r Hyper.Img.Db.Repo -mix ecto.migrate -r Hyper.Img.Db.Repo -``` - -The container is ephemeral; `docker start hyper-pg` brings it back after a -reboot. To point Hyper at an existing server instead, override the -`Hyper.Img.Db.Repo` block in your `config.exs`. - -#### System binaries - -These are used by the unprivileged node directly; each must be on the node's -`PATH` (the bracketed override is the `config :hyper` key you can set if the -binary lives elsewhere): - - - [`skopeo`](https://github.com/containers/skopeo) - pulls OCI images - (`skopeo_path`) - - [`e2fsprogs`](https://github.com/tytso/e2fsprogs) - provides `mke2fs`, which - builds the ext4 rootfs (`mke2fs_path`) - - `du`, `getent` (from **coreutils** and **glibc**) - rootfs sizing and user - resolution. Present on essentially every distro. - -The privileged device binaries - `losetup`, `blockdev` (from **util-linux**) -and `dmsetup` (from **lvm2** / device-mapper) - are run only by the setuid -helper, never named by the unprivileged caller. Their paths therefore live in -the helper's own config, `/etc/hyper/config.toml`, and default to -`/usr/sbin/{losetup,blockdev,dmsetup}`. - -**The config file must exist** to set `firecracker` and `jailer` (no built-in -defaults for those). The device-tool paths (`dmsetup`, `losetup`, `blockdev`) -and `work_dir` do have built-in defaults, so if you only need those defaults -and are not running VMs you may omit the file entirely. When the file is -present it must be root-owned and not group/other-writable, or the helper -refuses to start (a present-but-untrusted file is treated as an attack signal, -unlike a missing one): - -```toml -# /etc/hyper/config.toml (root-owned, mode 0644) -work_dir = "/srv/hyper" - -# REQUIRED - no default. Each must be an absolute path to a root-owned, -# non-group/world-writable binary named exactly "firecracker" or "jailer" -# (the helper validates the basename). Run `mix firecracker.install` to -# download the pinned release and print these values. -firecracker = "/opt/firecracker/firecracker" -jailer = "/opt/firecracker/jailer" - -# Optional device-tool overrides; default to /usr/sbin/{dmsetup,losetup,blockdev}. -# Each must be root-owned and not group/world-writable. -dmsetup = "/usr/sbin/dmsetup" -losetup = "/usr/sbin/losetup" -blockdev = "/usr/sbin/blockdev" - -# Optional. Governs which uid/gid values the helper accepts when launching the -# jailer. Must satisfy min > 0 and min <= max. Defaults to {900000, 999999}. -# If you narrow this range, set the same bounds in `config :hyper, uid_gid_range:` -# so the node hands out only uids the helper will accept. -[uid_gid_range] -min = 900000 -max = 999999 -``` - -`dmsetup` (lvm2) is frequently *not* installed by default - check that one -first. - -#### Kernel features - -The host kernel must provide: - - - **KVM** - `/dev/kvm` must exist and be accessible to the per-VM users (see - the `uid_gid_range` configuration). - - **cgroup v2** - the unified hierarchy mounted at `/sys/fs/cgroup`. v1-only - hosts are not supported. - - **device-mapper targets** `snapshot`, `thin`, and `thin-pool` - from the - `dm_snapshot` (provides `snapshot`) and `dm_thin_pool` (provides `thin` and - `thin-pool`) modules. Hyper refuses to start without all three; on boot it - fails with `{:missing_dm_targets, [...]}` listing whichever are absent. - - **loop devices** - the `loop` module, used to attach layer images as block - devices. - -Load the modules and confirm the targets are present: - -```sh -sudo modprobe dm_snapshot dm_thin_pool loop -sudo dmsetup targets # must list snapshot, thin, and thin-pool -``` - -If `modprobe` reports the module is missing, the running kernel lacks it - -minimal cloud images often strip device-mapper. On Debian/Ubuntu, install the -extra modules for the running kernel, then load them: - -```sh -sudo apt-get install -y linux-modules-extra-$(uname -r) -sudo modprobe dm_snapshot dm_thin_pool loop -``` - -Make the modules load on every boot: - -```sh -printf 'dm_snapshot\ndm_thin_pool\nloop\n' | sudo tee /etc/modules-load.d/hyper.conf -``` - -#### Privileged setup - - - The **setuid-root device helper** (`hyper-suidhelper`) must be installed. - Run `mix suidhelper.install`, which builds, stamps, and places it - setuid-root on `PATH`. Every privileged operation (losetup, dmsetup, mknod, - chroot jails) routes through it; the BEAM itself runs unprivileged. - - The final `sudo install` step runs without a controlling terminal (Mix - captures the nested `cargo` output), so on a typical `tty_tickets` sudo - setup it cannot prompt for a password. If it fails, the build has already - stamped the binary -- just run the copy yourself: - - ```sh - sudo install -o root -g root -m 4755 \ - native/suidhelper/target/release/hyper-suidhelper \ - /usr/local/bin/hyper-suidhelper - ``` - - A **parent cgroup** named by `cgroup_parent` (default `hyper`) must exist - under the cgroup-v2 hierarchy; Hyper creates each VM's cgroup beneath it and - fails to boot with `:missing_parent_cgroup` if it is absent. Create it and - delegate the `cpu` and `memory` controllers so the per-VM cgroups can set - `cpu.max` / `memory.max`: - - ```sh - sudo mkdir -p /sys/fs/cgroup/hyper - echo '+cpu +memory' | sudo tee /sys/fs/cgroup/hyper/cgroup.subtree_control - ``` - - If that last write errors, the root hierarchy is not delegating those - controllers down yet - enable them there first, then retry the line above: - - ```sh - echo '+cpu +memory' | sudo tee /sys/fs/cgroup/cgroup.subtree_control - ``` - - The cgroup hierarchy is memory-backed, so `/sys/fs/cgroup/hyper` does **not** - survive a reboot. Re-create it each boot, or persist it with - `systemd-tmpfiles`: - - ```sh - echo 'd /sys/fs/cgroup/hyper 0755 root root -' \ - | sudo tee /etc/tmpfiles.d/hyper-cgroup.conf - ``` - - The host UID/GID range must be free for Hyper to allocate per-VM users - from. The node's range is set by `uid_gid_range` in `config :hyper`; the - helper independently reads `[uid_gid_range]` from `/etc/hyper/config.toml` - (see below) and only accepts jailer `--uid`/`--gid` within that range. - Keep the two in sync. - #### Auto-redistributed `umoci` and the guest `vmlinux` kernels are downloaded, checksum-verified, and @@ -207,31 +30,24 @@ snippet to paste in. ### Configuration -Running `Hyper` is involved and requires a large number of pre-requisites. The -configuration of `:hyper` can be done by creating a `config :hyper` entry in -your `config.exs`. Refer to the given snippet for details on each -configuration. +Almost all host configuration — `work_dir`, the `[tools]` binary paths +(`firecracker`, `jailer`, `dmsetup`, ...), and the `[jails]` table (`cgroup`, +`uid_gid_range`) — lives in `/etc/hyper/config.toml` (the single source of +truth shared with the setuid helper, shown above), and every node-local path +(`jails`, `socks`, `scratch`, `layers`) is derived from `work_dir`. None of it is +repeated in `config :hyper`. + +The node's own tool paths (`skopeo`, `mke2fs`, `umoci`, `suidhelper`) now live in +the `[tools]` table of `/etc/hyper/config.toml` alongside the privileged binaries, +so `config :hyper` holds only the per-architecture guest kernels (each with a +default, so the block may be omitted): ```elixir config :hyper, - # You must create a parent cgroup on your system. Continue reading for - # further details. - cgroup_parent: "hyper", - jailer_chroot_base: "/srv/hyper/jails", - socket_dir: "/srv/hyper/socks", - scratch_dir: "/srv/hyper/scratch", - # Must match the [uid_gid_range] table in /etc/hyper/config.toml so the node - # hands out only uids the helper will accept. - uid_gid_range: {900_000, 999_999}, - layer_dir: "/srv/hyper/layers" + # Per-architecture guest kernel images placed on the host. + vmlinux: %{x86_64: "/srv/hyper/redist/vmlinux/vmlinux-x86_64"} ``` -The `firecracker` and `jailer` binary paths are **not** set here — they are read -from `/etc/hyper/config.toml` (the single source of truth shared with the setuid -helper). See the `config.toml` example above. - - - ### Usage