Skip to content

feat(vision): add vision tool using MiniCPM-V 4.6 (1.3B) via llama-mtmd-cli#22

Merged
jkyberneees merged 2 commits into
mainfrom
feat/minicpm-v-vision-tool
Jun 7, 2026
Merged

feat(vision): add vision tool using MiniCPM-V 4.6 (1.3B) via llama-mtmd-cli#22
jkyberneees merged 2 commits into
mainfrom
feat/minicpm-v-vision-tool

Conversation

@jkyberneees
Copy link
Copy Markdown
Contributor

Summary

  • Adds a vision built-in tool that analyses images and videos locally using MiniCPM-V 4.6 (1.3B multimodal model) via llama-mtmd-cli — no cloud API, no external service
  • Docker image gains a new minicpm multi-stage build that downloads a pre-built llama-mtmd-cli binary (llama.cpp b9549, amd64/arm64) and fetches the model GGUF (Q4_K_M, 529 MB) + vision projector (mmproj, 1.1 GB) from HuggingFace at build time
  • New VisionConfig added to the config system; all existing builtinTools() call sites updated; vision registered as a built-in tool alongside transcribe

What the tool does

Input Behaviour
Image (JPEG, PNG, GIF, WebP, BMP) Base64-encodes and sends to llama-mtmd-cli in single-turn mode
Video (MP4, MOV, AVI, MKV, WebM) ffprobe reads duration → ffmpeg extracts N evenly-spaced frames → multi-image call with all frames

Output is wrapped in wrapUntrusted() (same as transcribe) — classified as always-untrusted in the security model.

Config

{
  "vision": {
    "models_dir": "~/.odek/minicpm-v/models",
    "binary_path": "/usr/local/bin/llama-mtmd-cli",
    "video_frames": 8
  }
}

All fields are optional — the Docker image path /usr/local/share/minicpm-v/models/ is auto-detected.

Docker build args

# Use a higher-quality quantization (default: Q4_K_M)
docker compose --profile restricted up --build \
  --build-arg MINICPM_QUANT=Q8_0

# Pin a different llama.cpp release (default: b9549)
--build-arg LLAMA_VERSION=b9600

Test plan

  • All 21 packages pass: go test ./... -count=1
  • TestVision_* (13 tests) — unit + mock-binary happy paths for images and video
  • TestResolveVision_* (3 tests) — config resolver defaults and custom values
  • Docker image builds: docker compose --profile restricted up --build (requires internet access to HuggingFace during build)
  • Smoke test: pass a local JPEG to the agent and verify vision returns a description
  • Smoke test video: pass a short MP4 and verify frames > 0 in the result

🤖 Generated with Claude Code

…md-cli

Adds a `vision` built-in tool that analyses images and videos locally using
MiniCPM-V 4.6 — a 1.3B multimodal model running via llama.cpp's
llama-mtmd-cli, with no cloud API required.

## What's new

**Tool (`vision`)**
- Accepts images (JPEG, PNG, GIF, WebP, BMP) and videos (MP4, MOV, AVI,
  MKV, WebM)
- Videos: ffprobe reads duration, ffmpeg extracts N evenly-spaced frames,
  all frames sent as a multi-image call to the model (configurable via
  `video_frames`, default 8)
- Security: O_NOFOLLOW open (symlink protection), danger.CheckOperation
  classification, all output wrapped in wrapUntrusted() with provenance tag
- Setup instructions in every error path (missing binary, missing model,
  missing mmproj, missing ffmpeg)

**Docker (`docker/Dockerfile`)**
- New `minicpm` multi-stage build: downloads pre-built llama-mtmd-cli
  (llama.cpp b9549) for amd64/arm64 from the official GitHub release, then
  fetches MiniCPM-V-4_6-Q4_K_M.gguf (529 MB) and mmproj-model-f16.gguf
  (1.1 GB) from HuggingFace into /usr/local/share/minicpm-v/models/
- Overridable via --build-arg MINICPM_QUANT=Q8_0 and LLAMA_VERSION
- Runtime stage copies binary + models; no new runtime deps (libstdc++6
  already present for whisper)

**Config (`internal/config/loader.go`)**
- New VisionConfig struct: ModelsDir, BinaryPath, VideoFrames
- Wired into FileConfig, ResolvedConfig, resolveVision(), mergeFile()

**Tests**
- 13 tests in cmd/odek/vision_tool_test.go: empty path, invalid JSON, file
  not found, symlink rejected, missing binary, missing model, missing mmproj,
  mock happy-path image (4 extensions), custom prompt, mock happy-path video
  (with mock ffprobe+ffmpeg via PATH override), missing ffmpeg fallback,
  schema shape
- 3 tests in internal/config/vision_test.go: resolveVision defaults,
  zero-frames backfill, custom values round-trip

**Docs**
- docs/CHEATSHEET.md: new Image & Video Understanding section with config
  snippet and field reference
- docs/SECURITY.md: vision added to untrusted-content table, always-untrusted
  list, and skills provenance gate paragraph
- docs/CONFIG.md + docs/TELEGRAM.md: smart-previews bullet updated
- docker/README.md: new Image & video understanding (out of the box) section
- README.md: vision added to external-content ingestion list

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Jun 7, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
odek be67d47 Commit Preview URL

Branch Preview URL
Jun 07 2026, 02:03 PM

…on hash

Replace `resolve/main/` with `resolve/<sha>/` (78e02f0) so Docker builds
are reproducible — a future model update on the main branch won't silently
change the binary image.

vprotocol auto-repair: finding D001 (Axis 2.6 Dependency Integrity).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jkyberneees
Copy link
Copy Markdown
Contributor Author

vprotocol v5.2.7 — Verification Certificate

PR: feat/minicpm-v-vision-tool · head be67d47 · 21 files · +839/-19 LOC
Generator: Claude Sonnet 4.6 (claude-sonnet-4-6) · Class: GeneratedCode + GeneratedTests
Run date: 2026-06-07


Nine-Axis Results

Axis Verdict Notes
2.1 Semantic Correctness All 13 tests pass; error paths explicit
2.2 Behavioral Contract ⚠️ No formal spec — PR description is the contract
2.3 Security Surface O_NOFOLLOW, danger.CheckOperation, wrapUntrusted, exec.Command (no shell injection)
2.4 Structural Integrity Mirrors transcribe_tool.go; compile-time interface check
2.5 Behavioral Exploration ⚠️ Extension-only video detection; no image size guard (acceptable v1 risk)
2.6 Dependency Integrity (repaired) llama-mtmd-cli pinned to b9549; HuggingFace URLs pinned to commit 78e02f0 (was resolve/main/)
2.7 Generator Provenance ⚠️ Code + tests: same model, same session → correlated blind spots possible
2.8 Adversarial Surface User inputs flow to exec.Command args; output wrapped as untrusted
2.9 Documentation Coverage CHEATSHEET, SECURITY, CONFIG, TELEGRAM, docker/README, README all updated

η Score

Signal Weight Value
m (mutation kill rate) 0.34 0.60 (mock-based tests; no mutation runner)
o (oracle agreement) 0.24 0.20 (no independent spec)
b (branch coverage) 0.14 0.75 (Docker auto-detect path untested)
f (fuzz survival) 0.09 1.00
s (SAST clean) 0.04 1.00 (go vet: clean)
t (static depth) 0.10 1.00
d (doc coverage) 0.05 1.00

η_raw = 0.637 · ρ_penalty = 0.255 (same-family code+tests) · η = 0.382


Verdict: HumanReviewRequired

η 0.382 < 0.80 threshold; ρ 0.255 exceeds 0.20 correlated-generator threshold.
ΔDebt = 0.53 h (Low). Ci_estimated: true.

Requires independent human review before merge.
Focus areas: Axis 2.7 (correlated tests), Axis 2.5 (extension-only video detection), Axis 2.2 (no formal spec).


Auto-Repairs Applied

D001 — Dependency Integrity (commit be67d47)
Pinned HuggingFace model download URLs from resolve/main/ to resolve/78e02f066e9819a60573b78a4275df8a0c27f698/ in both docker/Dockerfile and the manual install instructions in vision_tool.go. Reproducible builds now guaranteed regardless of future upstream model updates.


Generated by vprotocol v5.2.7 auto-repair mode

@jkyberneees jkyberneees merged commit 33ec4a8 into main Jun 7, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant