Skip to content

chore: Get a VM running#31

Draft
markovejnovic wants to merge 47 commits into
mainfrom
chore/get-a-vm-running
Draft

chore: Get a VM running#31
markovejnovic wants to merge 47 commits into
mainfrom
chore/get-a-vm-running

Conversation

@markovejnovic

Copy link
Copy Markdown
Contributor

No description provided.

Mix spawns subprocesses in their own session with no controlling terminal
(erl_child_setup calls setsid), so the nested `sudo install` could never
prompt for a password and always failed with "a terminal is required".

Split the task: `cargo xtask stamp` builds + stamps unprivileged, then the
privileged copy runs via sudo only when it is already non-interactive
(`sudo -n` succeeds); otherwise it prints the exact command to run by hand.
The unprivileged node used to pass `--bin <path>` to the setuid helper for
every losetup/dmsetup/blockdev op. Letting the caller name the binary the
helper escalates to run is a needless trust hole, even with SafeBin checks.

Move the paths into the helper-owned config (/etc/hyper/config.toml) with
sane defaults (/usr/sbin/{losetup,dmsetup,blockdev}); the helper validates
each as a SafeBin (absolute, root-owned, non-writable, exact basename) at
dispatch, as the real uid, before acquiring root. The `--bin` argument is
gone from both sides.

Also:
- Add a `dmsetup targets` op so the dm-target readiness probe runs through
  the helper (it opens /dev/mapper/control, which needs root) instead of
  shelling dmsetup directly as the BEAM user.
- An absent config file now falls back to the built-in defaults (trusted,
  compiled into the root-owned binary); a present-but-untrusted file stays
  fatal. Drop the Elixir-side *_path config and per-tool presence checks.
- Add :mix to the dialyzer PLT so the Mix tasks resolve Mix.raise/shell.
Document the Docker Postgres setup matching config.exs defaults, the now
config-sourced (and optional) device-binary paths, and the sudo/tty caveat
for mix suidhelper.install.
Spell out modprobe + dmsetup targets verify, the linux-modules-extra
fallback for stripped cloud kernels, and persisting via modules-load.d -
the missing_dm_targets boot failure points here.
mkdir under the cgroup-v2 hierarchy, enable cpu/memory in subtree_control
(root-level fallback), and persist via systemd-tmpfiles since cgroupfs is
memory-backed. The :missing_parent_cgroup boot failure points here.
Budget.Advertiser advertises on init, which calls Hyper.Node.Layer.active/0
and selects on Hyper.Node.Layer.Registry. That registry is owned by
Hyper.Node.Layer, which was ordered after Budget.Supervisor - so boot crashed
with `unknown registry: Hyper.Node.Layer.Registry`. Order Layer first.
ThinPool traps exits (for terminate/2 device teardown), so the normal close
of each System.cmd port arrived as an unhandled {:EXIT, port, :normal} and
logged as an error. Add a handle_info clause to ignore normal exits.

Tracing defaulted to Honeycombs endpoint but only sent the auth header when
HONEYCOMB_API_KEY was set, so keyless runs 401d on every batch. Export only
when a key or a custom OTEL endpoint is configured; otherwise disable the
exporter.
An unclean shutdown (SIGKILL, so terminate/2 never ran) leaves hyper-thinpool
behind; the next boot failed with "device-mapper: create ioctl ... Device or
resource busy". Best-effort remove any pre-existing pool before rebuilding,
ahead of zero_metadata so a still-live pool is never corrupted.
Hyper.Img.OciLoader.test_system/0 (checks skopeo/umoci/mke2fs) existed but was
never called from Hyper.Node.test_system, so a missing skopeo only surfaced at
image-load time. Wire it in after Umoci.ensure_installed so the node refuses to
boot when an OCI loader tool is absent, like every other host requirement.
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Layer.Server, Img.Mutable and Img.Server all trap exits (for terminate/2
device teardown) and shell privileged commands through System.cmd, which links
a transient port to the process. The port's normal close arrived as
{:EXIT, port, :normal} with no matching handle_info clause - Layer.Server
crash-looped on it while mounting a layer, which surfaced to create_vm as
:no_capacity. Add the same ignore-clause already in ThinPool to each.
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Test Results

371 tests  +75   370 ✅ +75   6s ⏱️ +3s
 64 suites + 9     1 💤 ± 0 
  2 files   ± 0     0 ❌ ± 0 

Results for commit 293293b. ± Comparison against base commit 9ee5de4.

This pull request removes 1 and adds 76 tests. Note that renamed tests count towards both.
hyper-suidhelper::e2e_config ‑ missing_config_exits_2
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args contain --id, --uid, --gid with the opts values
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args do not contain privileged flags owned by the suidhelper
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args end with --api-sock /api.socket
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args include --cgroup cpu.max and memory.max for :micro type
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test args start with the jailer subcommand
Elixir.Hyper.Node.FireVMM.JailerTest ‑ test binary is the suid helper
Elixir.Hyper.Node.Reaper.PlanPropertiesTest ‑ property Mutable.dm_name/1 round-trips through rw_ids for a real vm_id
Elixir.Hyper.Node.Reaper.PlanPropertiesTest ‑ property a live vm_id is never a reap candidate
Elixir.Hyper.Node.Reaper.PlanPropertiesTest ‑ property only twice-seen orphans are reaped; current is carried forward
Elixir.Hyper.Node.Reaper.PlanPropertiesTest ‑ property rw_ids excludes thinpool, img, and non-rw junk
…

♻️ This comment has been updated with latest results.

DynamicSupervisor.start_child rejected the FireVMM child spec with
{:invalid_child_spec, ...} because the map used a :vm_id key where a supervisor
child spec requires :id - so no VM could ever boot. Use :id.

Add @SPEC child_spec(Opts.t()) :: Supervisor.child_spec(). child_spec/1 is a
plain def (not a typed @callback), so without a spec dialyzer never compared
the returned map against the child-spec contract and the typo passed the gate.
With the spec, the bad key fails dialyzer as invalid_contract.
place/3 reduced every candidate failure into a blanket {:error, :no_capacity},
so a real boot error on the only candidate was indistinguishable from genuine
lack of capacity. Log the actual refusal reason at each candidate.
An unclean shutdown (SIGKILL / :erlang.halt, where terminate/2 never runs)
leaves hyper dm devices and loop devices behind. The next boot then crashed in
ThinPool.init with "device-mapper: create ioctl on hyper-thinpool: Device or
resource busy", and a leaked thin volume held the pool open so the previous
remove-the-pool-only reclaim could not clear it.

Add read-only `dmsetup ls` and `losetup --list` ops to the suidhelper, and a
Hyper.Node.Reclaim pass that runs once before any device GenServer starts: it
removes every hyper-prefixed dm device leaf-first (retrying leftovers until a
pass clears nothing new, so stacked snapshots and the pool-under-volume case
resolve) then detaches loop devices backing files under the data dirs. Replaces
ThinPool.init's pool-only reclaim. Best-effort; logs and continues.
Extends the suidhelper config with four new fields needed before the
jailer exec path can be wired up:

- firecracker / jailer: Option<PathBuf> — no built-in default; accessors
  return BinError::Unconfigured when absent so callers get a clear
  "operator must configure this" signal rather than a path-validation error.
- parent_cgroup: String — defaults to "hyper", matching Elixir's
  @parent_cgroup (operators narrowing must keep both in sync).
- uid_gid_range: Option<UidGidRange> — total accessor returning
  (900_000, 999_999) when absent; a present range with min==0 or min>max
  is fatal at safe_load time, consistent with the "present-but-untrusted
  is fatal" trust model.

Adds BinError (Unconfigured | Bin) so firecracker/jailer accessors have a
richer error type than the existing tool accessors (which can never be
Unconfigured). Adds validate_uid_gid_range as a public pure function so
the refusal contract can be property-tested without touching the file system.

New test target config_uid_gid_range exercises: absent → default,
valid range round-trips via TOML, min==0 always rejected, min>max always
rejected. Extends e2e/config with: Unconfigured on absent keys, basename
mismatch rejected, Ok on root-owned correct-name binaries (root-guarded),
bad uid_gid_range binary exit 2 (root-guarded).
The BEAM no longer names a privileged binary path.  `Jailer.command/1`
now launches `hyper-suidhelper jailer` with only the id, uid/gid, cgroup
flags, and api-sock; the helper reads firecracker/jailer binary paths,
chroot base, parent cgroup, and cgroup version from its trusted
/etc/hyper/config.toml, re-acquires root, and execve's the jailer (same
pid, so MuonTrap owns the lifetime).

Changes:
- config.ex: add config_toml/0 (file → persistent_term cache); rewrite
  work_dir/0 over it; add firecracker_bin/0 + jailer_bin/0 via
  fetch_bin!/1 (raises when key absent); remove firecracker_install_dir/0.
- provider.ex: deleted.
- jailer.ex: rewire command/1 and exec_name/0; drop Provider alias.
- daemon.ex: note that supervised process is the helper (execve → jailer).
- node.ex: replace Provider.ensure_installed() with check_firecracker_bins/0.
- hyper.ex: prefix gen_vm_id/0 with "v" so ids never start with "-".
- jailer_test.exs: new pure test suite; load-bearing assertion that args
  contain no --exec-file / --chroot-base-dir / -- flags.
…r/jailer

Downloads the pinned Firecracker v1.16.0 release via Redist.Targz.install/3
(download → SHA-256 verify → extract), copies the version-stamped binaries to
bare-basename paths (<prefix>/firecracker and <prefix>/jailer) required by the
suidhelper's SafeBin validator, and prints the config snippets the operator
needs to paste.
…nstall

Remove firecracker/jailer from the auto-redistributed list; operators now
install them via `mix firecracker.install [--prefix <dir>]`. Update
/etc/hyper/config.toml example to show the required firecracker/jailer keys
(no default, root-owned + non-world-writable, bare basenames validated by the
helper) and the optional [uid_gid_range] table. Update `config :hyper` snippet
to drop the auto-download TODOs and point at the install-produced paths.
Document that uid_gid_range in config :hyper and [uid_gid_range] in
config.toml must be kept in sync.
The Elixir node was fetching TOML keys "firecracker_bin"/"jailer_bin" while
the setuid helper (and the TOML example in docs) uses "firecracker"/"jailer".
A correctly-installed host therefore crashed on launch with an unset-key
error even when the file was present.

- config.ex: fix fetch_bin! keys to "firecracker"/"jailer"; add non-raising
  firecracker_bin_configured/0 and jailer_bin_configured/0 returning
  {:ok, path} | :error for use by pre-launch checks.
- node.ex: rewrite check_firecracker_bins/0 to use the non-raising accessors
  so a missing key returns {:error, :firecracker_not_configured} instead of
  raising, honouring the @SPEC contract.
- firecracker.install.ex: drop the dead "config :hyper, firecracker_bin:"
  snippet from print_config/1 — nothing reads Application env for these paths.
- docs/cookbook/intro.md: remove firecracker_bin/jailer_bin from config :hyper
  block; clarify paths live only in /etc/hyper/config.toml.
- jailer_test.exs: align stub map keys to "firecracker"/"jailer" (the bug
  that masked this); add positive --cgroup assertion for :micro type.
- Cargo.toml: remove redundant toml dev-dependency (already a normal dep).
The task runs unprivileged, so the binaries land owned by the invoking
user. SafeBin in the suidhelper refuses any jailer/firecracker not owned
by root, so every launch fails closed until they are chowned. Print the
exact chown/chmod commands (with the real installed paths) instead of a
vague 'ensure root-owned' note.
The supervised process is MuonTrap.Daemon, so route the jailed
process's stdout+stderr (guest serial console included) to the Logger
via log_output, and map the exit status into {:firecracker_exited,
status} so a crash names the real exit code instead of the opaque
:error_exit_status. Add explicit launch/launch-failure log lines keyed
by vm id, so a boot loop is visible without reading console scrollback.
The :awaiting_api probe swallowed the describe_instance error and
stopped with a bare :daemon_unready, hiding why a healthy-looking
firecracker is unreachable. Log the last probe error on deadline and
carry it in the stop reason ({:daemon_unready, reason}), so a host->jail
socket permission/path problem is diagnosable instead of looking like an
unexplained 5s restart loop.
… the node user

The jailer drops firecracker to a per-VM uid/gid and chroots it, so the
API socket it creates at <jail>/root/api.socket is owned by that per-VM
id, mode 0755. Connecting a unix socket needs write permission, so the
unprivileged node controller gets EACCES on connect().

grant-api confines the socket path under JAIL_BASE (SafePath + O_NOFOLLOW
walk, fd-relative on the pinned root dir), verifies the leaf is a real
socket via fstatat(AT_SYMLINK_NOFOLLOW) (a planted file/symlink is
refused, never touched), then chowns it to the helper's caller
(getuid/getgid, the real=caller ids inside the privileged scope) and
chmods 0660. A not-yet-created socket is reported Pending, not an error.

Adds SafeDir::stat and SafeDir::chmod fd-relative primitives.
… can connect

AwaitingApi now hands the jailed API socket to the node user (via the new
chroot-jail grant-api op) before each readiness probe. firecracker
creates the socket owned by the per-VM uid, so the controller gets EACCES
until the helper chowns it; granting first lets the probe (and every
later API call) connect.

The grant runs once (tracked by State.api_granted); :socket_pending and
transient grant errors keep the controller waiting until the existing
boot deadline rather than crashing, via a shared deadline-aware
keep_probing path.
gen_vm_id used Base.url_encode64, which emits - and _. firecracker
rejects _ in an instance id (InvalidInstanceId / "Invalid char (_)"),
so any id containing one crash-looped the jailer at boot. Switch to
lowercase base32 ([a-z2-7], alphanumeric only) - the intersection of the
firecracker, dm/jailer, and registry-key constraints. Strengthen the
property from "no leading -" to "strictly alphanumeric".
Starting a process with a {:via, Horde.Registry, _} name makes OTP run
gen:get_proc_name right after start: it calls whereis_name immediately
after the synchronous register. Horde materialises the name into local
ETS only asynchronously (DeltaCRDT diff loop), so under registry churn
the read loses the race and startup aborts with
{:process_not_registered_via, Horde.Registry}. The crash-storm from the
bad vm_id flooded the CRDT and tipped this over, killing FireVMM.State.

Add Routing.register_self/1 and have the supervisor, client, and state
machine register from their own init (started unnamed). Names stay
cluster-resolvable via via/1 once the diff propagates - callers already
tolerate that lag. Core starts unnamed entirely (resolved by nobody).
grant-api chowned the API socket to the caller, but the jailer leaves
<id>/root as 0700 owned by the per-VM uid. connect() needs search (+x)
on every ancestor, so the unprivileged node still got EACCES traversing
into root - the socket's owner was irrelevant. The op now also opens that
one directory to the caller's group: owner stays the per-VM uid
(firecracker needs it), chgrp to the caller's gid, chmod 0710 (owner rwx,
group --x traverse-not-list, other none). Unrelated users stay locked
out; only the socket and its parent's group/mode move.

Add SafeDir::chmod_self/chgrp_self (fchmod/fchown on the pinned root fd,
not by name - TOCTOU-safe). Extend the grant test to assert root is
chgrp'd to the caller and chmod'd 0710.
@markovejnovic markovejnovic force-pushed the chore/get-a-vm-running branch from c9ebaaf to 4196153 Compare June 26, 2026 02:45
The rootfs device handed to chroot-jail staging is /dev/mapper/hyper-rw-<id>,
a symlink into the dm farm. SafeFile<IsBlockDevice> opens it O_NOFOLLOW (it must
never follow an attacker-supplied symlink), so the dm symlink itself was rejected
with "file is not of the required type" and the guest never booted.

Resolve the device to its real /dev/dm-N node via canonicalize before the open.
Safe because the BlockDev lexical guard already proved the name is a hyper-owned
dm device and /dev/mapper is a root-owned 0755 dir an unprivileged caller cannot
redirect; loop nodes canonicalize to themselves. The re-opened target is still
verified IsBlockDevice.
Both staging and config-apply failures end the boot via a :one_for_all
supervisor restart, so without an explicit log the reason vanished into the
restart cycle and the VM just appeared to relaunch for no visible cause. Log
the reason at each failure path before stopping.
chroot-jail remove rmdir'd the per-VM cgroup leaf and swallowed ENOTEMPTY as
success, so a firecracker still alive in the cgroup was never killed and leaked
- holding its rootfs dm device and loop devices, wedging the host (a restart
storm left 344 orphaned firecrackers that needed sudo pkill to clear).

Write "1" to the leaf's cgroup.kill (v2; the jailer runs --cgroup-version 2)
before rmdir, SIGKILLing the whole subtree regardless of session - the jailer
setsid's firecracker out of the process group, so MuonTrap's group-kill misses
it but cgroup.kill does not. Reorder so the cgroup teardown (the kill) runs
before the chroot removal, and retry the rmdir while the killed cgroup drains
(EBUSY/ENOTEMPTY tolerated; a persistent busy is left for a later sweep rather
than failing a relaunch). Adds SafeDir::write_file (O_WRONLY|O_NOFOLLOW, no
O_CREAT) for the pseudo-file write, fd-relative on the pinned leaf dir.
Two bugs let a stopped VM's firecracker survive:

1. Jailer.cgroup_dir/1 computed /sys/fs/cgroup/<parent>/<exec>/<id>, but the
   jailer (cgroup v2) places firecracker at <parent>/<id> with no <exec> level
   (confirmed via /proc/<pid>/cgroup = 0::/<parent>/<id>). So the teardown's
   cgroup remove targeted a path that never existed - a silent no-op. Drop the
   <exec> level; add Jailer.cgroup_parent_dir/0 (the dir whose subdirs are the
   vm_id leaves) for the reaper to enumerate.

2. Daemon only cleared the jail on (re)start, never on a final stop, and relied
   on MuonTrap's port-close to kill firecracker - which it cannot, since the
   jailer setsid's firecracker into its own session. A graceful VM stop thus
   left firecracker running, holding its dm/loop devices. Make Daemon a
   trap_exit GenServer whose terminate/2 runs the helper's cgroup.kill teardown,
   so firecracker is guaranteed dead before its mutable dm layer is torn down.
   A linked MuonTrap.Daemon exit still stops the server so Core cold-boots.
A SIGKILL'd BEAM runs no terminate/2, so a firecracker (plus its cgroup and
hyper-rw-<id> dm volume) can outlive its owner with no vm_id ever rebooting to
clean it. Reaper is a per-node periodic, liveness-aware GC: each tick it diffs
the live local VMs (supervisor children + routing) against the on-host per-VM
cgroup leaves and hyper-rw-* dm volumes, and reaps the orphans via the helper's
cgroup.kill teardown + dm remove.

Liveness-aware on purpose: unlike boot-time Reclaim (which clears every hyper-*
before any owner starts), a periodic GC must never touch the live thinpool,
base images, or a running VM - so it only ever considers hyper-rw-* keyed to a
vm_id, and a two-strike grace (reap only an id orphaned across two consecutive
ticks) makes reaping a live or mid-boot VM structurally impossible.

The decision core (Reaper.Plan) is pure and property-tested: a live id is never
a candidate, only twice-seen orphans reap, and thinpool/img names are never
candidates. Mirrors the Hyper.Img.Db.Gc shell/pure-core/config trio.
The VM id was the only generated id in the system (image ids are content
hashes, user ids are pool integers), and it carries a real contract: a strict
`[a-z2-7]` charset that is the intersection of firecracker's instance-id rules,
dm/jailer naming, and path-component safety. Give it a home.

  * move `Hyper.gen_vm_id/0` -> `Hyper.Vm.Id.generate/0`, with the charset
    rationale as the module doc;
  * make `Hyper.Vm.Id.t()` the canonical id type and migrate every
    `Hyper.Vm.id()` reference to it (dropping the duplicate `@type id` alias on
    `Hyper.Vm`, which keeps `t :: pid()` for the VM handle);
  * move the id charset property out of jailer_test into a dedicated
    test/hyper/vm/id_test.exs.

No behavior change. Hyper.Img.id/Hyper.Node.Users.id stay inline - they have no
generator or charset contract to encapsulate.
The privileged e2e `execs_jailer_..._empty_env_as_root` asserted the recorder's
/proc/self/environ was literally empty, but the recorder is a /bin/sh script and
the shell self-sets PWD (and under bash _/SHLVL) on startup - so the assertion
failed in CI even though the helper does execve the jailer with an empty envp.

Prove the real property instead: no CALLER variable survives. The helper is
spawned with HYPER_* config vars in its own env, so their absence in the
recorder is the leak canary; allow only shell-set PWD/_/SHLVL.
The Hyper.Vm.id() -> Hyper.Vm.Id.t() migration widened the @SPEC past the line
limit; mix format wraps it. (CI format gate.)
Two changes, one theme — stop duplicating host config and read /etc/hyper/config.toml
at runtime so the node and the setuid helper can never drift.

[tools] section: the helper's tool binaries (firecracker, jailer, dmsetup,
losetup, blockdev) now live under a `[tools]` table instead of flat top-level
keys. Rust models it as a dedicated `Tools` struct (device tools default,
firecracker/jailer remain Option -> Unconfigured at use). Elixir reads
firecracker/jailer from `[tools]`; `mix firecracker.install` prints the `[tools]`
snippet.

Runtime-read shared config: `parent_cgroup` and `uid_gid_range` were specified
in BOTH config.toml (helper) and `config :hyper` (node) and had to be kept in
sync by hand. The node now reads them from config.toml at runtime, with defaults
matching the helper's `Config::default`. `layer_dir` is derived from `work_dir`
(`<work_dir>/layers`) rather than a separate key. The `config :hyper,
cgroup_parent/uid_gid_range/layer_dir` block is gone; an absent config.toml
yields the same built-in defaults on both sides.
Move the parent cgroup and the uid/gid allocation band out of top-level
config.toml keys into a [jails] table, matching the documented format:

    [jails]
    uid_gid_range = [900000, 999999]
    cgroup = "hyper"

uid_gid_range is now a [min, max] array rather than a {min, max} sub-table;
the Rust helper keeps the rich UidGidRange type via serde(from = "[u32; 2]").
Both the node and the setuid helper read the same [jails] table, so they
still cannot drift.
skopeo/mke2fs/umoci/suidhelper now read from the [tools] table in
/etc/hyper/config.toml (was compile-time/app-env). Adds guard-narrowed
tool_path/optional_tool_path helpers; rewrites Umoci.bin/0 as a case.
runtime.exs merges an optional operator config from /etc/hyper/config.exs
(override path via HYPER_CONFIG) last, so its values win. Skipped under :test.
Adds makeup_syntect + a docs alias step aliasing bash/sh fences to the
shell grammar, so toml/bash/sh/python/rust/markdown blocks highlight.
origin/main (#32) refactored Hyper.Config into Hyper.Cfg.* modules
(Cfg.Tools/Jails/Dirs/Budget/Otel/Toml/...) and deleted config.ex. This
branch had independent VM-boot work on the same files. Resolution:

- config.ex: accept main's deletion; rewrite all Hyper.Config.X callers to
  the Hyper.Cfg.{Tools,Jails,Dirs,Img}.Y equivalents (jailer, node, reclaim,
  dmsetup).
- suid_helper/{blockdev,dmsetup,losetup}, jailer: keep THIS branch's
  helper-routed architecture (no --bin; the merged Rust helper reads tool
  paths from config and owns the privileged jailer flags).
- node.ex: keep check_firecracker_bins + dropped FireVMM.Provider (branch),
  but load budget via Hyper.Cfg.Budget (main).
- fire_vmm/provider.ex: stays deleted (branch dropped Provider).
- umoci.ex, runtime.exs, config.md: take main (already carry these features,
  refactored; config.md is main's comprehensive restructure).
- jailer_test: stub via Hyper.Cfg.Toml.put_cache/reload.

Gate green: compile --warnings-as-errors, format, 250 tests, dialyzer 0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant