feat: PuLID-Flux identity-injection support#1595
Open
RapidMark wants to merge 1 commit into
Open
Conversation
Adds PuLID-Flux identity injection to the Flux denoise path: a pulid.hpp module, the id-embedding threaded through flux.hpp and stable-diffusion.cpp, CLI flags in examples/common, and scripts/pulid_extract_id.py to produce the embedding. The id-embedding is stored as a gguf container (a single fp16 tensor) and loaded through the same gguf_init_from_file path as the pulid_ca weights, so there's no bespoke binary header.
d70feb4 to
b7249f0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds PuLID-Flux identity injection to the Flux denoise loop (works on CUDA / Vulkan / HIP / Metal). Given a single source portrait, generated images preserve the source person's face across arbitrary scenes and prompts. Pure-ggml implementation (every op has a backend kernel), so it's cross-vendor by construction.
This PR has three stages, all folded in below: (1) the original feature (#1542), (2) the gguf id-embedding rework (this PR, supersedes #1542), and (3) a rebase onto current master after the #1615 src-layout refactor.
1. The feature (original submission, #1542)
Mirrors the reference ToTheBeginning/PuLID (
encoders_transformer.py+flux/model.py) and the PuLID v0.9.1 hook schedule (every 2nd of the double blocks, every 4th of the single blocks).What's included
src/model/adapter/pulid.hpp—PuLIDPerceiverAttentionCA, the cross-attention module (Q from image tokens, K/V from the ID embedding). Pure-ggml graph; runs on CPU / CUDA / Vulkan / Metal without backend-specific code.src/model/diffusion/flux.hpp— adds thepulid_ca.<i>child blocks toFlux(constructed conditionally when PuLID weights are present), inserts the cross-attention between transformer blocks at the reference intervals (every 2nd double, every 4th single), and threads the identity embedding + weight throughforward/forward_orig/compute/build_graph.skip_layers+ PuLID is explicitly refused (would misalign the hook schedule).src/stable-diffusion.cpp— loads thepulid_ca.*weights viamodel_loaderunder the existingmodel.diffusion_model.prefix so they bind to the new blocks naturally, and loads the id-embedding, wrapping it as asd::Tensor<float>passed viaDiffusionParams.include/stable-diffusion.h— public API:sd_pulid_params_t(per-generation embedding path + weight),pulid_weights_pathonsd_ctx_params_t,pulid_paramsonsd_img_gen_params_t.examples/common/common.{cpp,h}— three CLI flags:--pulid-weights,--pulid-id-embedding,--pulid-id-weight.src/model/diffusion/model.hpp— extendsDiffusionParamsto carry the embedding + weight;FluxModel::computeforwards both.docs/pulid.md— usage, embedding format, supported PuLID versions (v0.9.0 / v0.9.1; v1.1 deferred), memory-budget notes, and the three-way SHA-256 falsification recipe.scripts/pulid_extract_id.py— reference precompute tool that produces the id-embedding from a source portrait.Why split extraction from injection
PuLID-Flux's identity extractor is a stack of three large PyTorch models (ArcFace + EVA-CLIP-L + IDFormer). Porting all three to C++/ggml would add thousands of lines for code that runs once per source person. By making sd.cpp consume a precomputed embedding, the C++ surface stays small (~600 lines), the heavy ML stack runs once on any PyTorch backend, and PuLID is decoupled from active development on insightface / EVA-CLIP / IDFormer.
Verification
The three-way SHA-256 falsification recipe in
docs/pulid.mddistinguishes "wired but inert" from "actively altering the trajectory":--pulid-*flags--pulid-id-weight 0.0--pulid-id-weight 1.0Verified on three backends from the same source:
-DSD_VULKAN=ON): A == B byte-identical, A != C, C preserves identity.--backend "diffusion=vulkan1"): A == B, A != C, C visually equivalent to the AMD output at the same seed (different bytes per the usual cross-backend nondeterminism).-DSD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86build, CUDA 13.2): A == B byte-identical, A != C, C preserves identity. The PerceiverAttentionCA pure-ggml graph runs unchanged across all three — no backend-specific conditionals.Measurements
Per-image sampling at 512×512 / 4 steps / Flux Schnell Q4 + PuLID:
-DSD_CUDA=ONbuildbatch_count=3confirms long-lived-process amortization: per-image sampling drops from 19.6 s (cold) to ~11 s (warm) as the model stays resident across iterations. Tested with Flux Schnell Q4_K_S at 512²/4 and Flux Dev Q4_K_S at 768²/20. At 1024² the VAE decode needs a large single compute buffer that can exceed a consumer card's max-allocation limit → use--vae-tiling(or route the VAE to a roomier backend); this is existing sd.cpp behavior, not PuLID-specific, but documented indocs/pulid.mdsince PuLID users hit it.Not yet supported (in docs/pulid.md)
id_adapter_attn_layers.*). Follow-up.--true-cfgnegative branch — PuLID only injects on the positive conditioning path in the reference; this matches.Backward compatibility
Non-PuLID generations are unaffected: PuLID is only constructed when the loader sees a
pulid_ca.*tensor. A regression run without--pulid-*flags is byte-identical to pre-patch.2. Update — gguf id-embedding (this PR, supersedes #1542)
In the original submission the id-embedding was a bespoke
PULIDV0132-byte-header binary with a hand-rolled parser. Per @Green-Sky's review on #1542, this PR reworks it into a standard GGUF container: a single fp16 tensorpulid_idof shape[2048, 32], loaded viagguf_init_from_fileexactly like thepulid_ca.*weights. The custom header and parser are gone — one fewer on-disk format, and the embedding loads through the same path as every other tensor.scripts/pulid_extract_id.pywrites the gguf;docs/pulid.mddocuments the gguf layout.Opened fresh rather than force-pushing #1542 because that branch had drifted onto an old master and its history was tangled; a clean rebase was easier to review.
3. Follow-up — rebased onto the src/ layout refactor (#1615)
Master landed the #1615 "reorganize src model layout" refactor while this PR was open. This rebase re-homes PuLID onto the new tree, no functional change:
src/pulid.hpp→src/model/adapter/pulid.hpp(includes updated tocore/ggml_extend.hpp+model/common/block.hpp).src/flux.hpp→src/model/diffusion/flux.hpp.FluxParamsrenamedFluxConfig, weight-based config now inFluxConfig::detect_from_weights; PuLID auto-detection (pulid_ca.→pulid_enabled) moved into it.DiffusionParamsplumbing followssrc/diffusion_model.hpp→src/model/diffusion/model.hpp; includes updated to the newsrc/{core,model}/...paths.The feature, hook schedule, and CLI/API surface are unchanged from stages 1–2.
Validation (rebased branch)
sd-cli+sd-serverlink).--diffusion-fa/--vae-tiling. Embedding loaded throughgguf_init_from_file(PuLID id-embedding: loaded [2048, 32] type=f16), identity injected across all steps, source identity preserved in a scene unrelated to the reference photo.Usage
The
.pulidembdis a gguf with one fp16 tensorpulid_id [2048,32](produce it withscripts/pulid_extract_id.py).