…md-cli
Adds a `vision` built-in tool that analyses images and videos locally using
MiniCPM-V 4.6 — a 1.3B multimodal model running via llama.cpp's
llama-mtmd-cli, with no cloud API required.
## What's new
**Tool (`vision`)**
- Accepts images (JPEG, PNG, GIF, WebP, BMP) and videos (MP4, MOV, AVI,
MKV, WebM)
- Videos: ffprobe reads duration, ffmpeg extracts N evenly-spaced frames,
all frames sent as a multi-image call to the model (configurable via
`video_frames`, default 8)
- Security: O_NOFOLLOW open (symlink protection), danger.CheckOperation
classification, all output wrapped in wrapUntrusted() with provenance tag
- Setup instructions in every error path (missing binary, missing model,
missing mmproj, missing ffmpeg)
**Docker (`docker/Dockerfile`)**
- New `minicpm` multi-stage build: downloads pre-built llama-mtmd-cli
(llama.cpp b9549) for amd64/arm64 from the official GitHub release, then
fetches MiniCPM-V-4_6-Q4_K_M.gguf (529 MB) and mmproj-model-f16.gguf
(1.1 GB) from HuggingFace into /usr/local/share/minicpm-v/models/
- Overridable via --build-arg MINICPM_QUANT=Q8_0 and LLAMA_VERSION
- Runtime stage copies binary + models; no new runtime deps (libstdc++6
already present for whisper)
**Config (`internal/config/loader.go`)**
- New VisionConfig struct: ModelsDir, BinaryPath, VideoFrames
- Wired into FileConfig, ResolvedConfig, resolveVision(), mergeFile()
**Tests**
- 13 tests in cmd/odek/vision_tool_test.go: empty path, invalid JSON, file
not found, symlink rejected, missing binary, missing model, missing mmproj,
mock happy-path image (4 extensions), custom prompt, mock happy-path video
(with mock ffprobe+ffmpeg via PATH override), missing ffmpeg fallback,
schema shape
- 3 tests in internal/config/vision_test.go: resolveVision defaults,
zero-frames backfill, custom values round-trip
**Docs**
- docs/CHEATSHEET.md: new Image & Video Understanding section with config
snippet and field reference
- docs/SECURITY.md: vision added to untrusted-content table, always-untrusted
list, and skills provenance gate paragraph
- docs/CONFIG.md + docs/TELEGRAM.md: smart-previews bullet updated
- docker/README.md: new Image & video understanding (out of the box) section
- README.md: vision added to external-content ingestion list
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
visionbuilt-in tool that analyses images and videos locally using MiniCPM-V 4.6 (1.3B multimodal model) viallama-mtmd-cli— no cloud API, no external serviceminicpmmulti-stage build that downloads a pre-builtllama-mtmd-clibinary (llama.cpp b9549, amd64/arm64) and fetches the model GGUF (Q4_K_M, 529 MB) + vision projector (mmproj, 1.1 GB) from HuggingFace at build timeVisionConfigadded to the config system; all existingbuiltinTools()call sites updated;visionregistered as a built-in tool alongsidetranscribeWhat the tool does
llama-mtmd-cliin single-turn modeffprobereads duration →ffmpegextracts N evenly-spaced frames → multi-image call with all framesOutput is wrapped in
wrapUntrusted()(same astranscribe) — classified as always-untrusted in the security model.Config
{ "vision": { "models_dir": "~/.odek/minicpm-v/models", "binary_path": "/usr/local/bin/llama-mtmd-cli", "video_frames": 8 } }All fields are optional — the Docker image path
/usr/local/share/minicpm-v/models/is auto-detected.Docker build args
Test plan
go test ./... -count=1TestVision_*(13 tests) — unit + mock-binary happy paths for images and videoTestResolveVision_*(3 tests) — config resolver defaults and custom valuesdocker compose --profile restricted up --build(requires internet access to HuggingFace during build)visionreturns a descriptionframes > 0in the result🤖 Generated with Claude Code