Merge dev changes#1
Conversation
- Add try/except fallback with recover=True for malformed XML in fix_tables() - Use OxmlElement with qn() instead of etree.SubElement for new grid columns - Remove unused _cached_jinja_env variables - Clean up redundant comments
performance: improve performance for large template files
Delete the module-level logger and several logger.warning calls in docxtpl/template.py. Added while debugging and should be removed.
Improve documentation in map_tree to explain the optimization: the code swaps the entire <w:body> via root.remove() + root.insert() to avoid O(n) per-child lxml operations, which is effectively O(1) on the document root. Clarify that the body's index is preserved so element order (body before sectPr) remains intact, and spell out the fallback behavior (child-by-child copy) if the body isn't a direct child or if remove/insert fails. Add additional safety and explanatory comments.
Enhance header/footer processing by detecting Jinja tags split across Word XML runs: check both intact tags (_JINJA_PATTERN) and open-tag fragments (_RE_JINJA_OPEN) when scanning part XML. Use a generator to iterate part XML strings once, and keep the existing exception fallback to unconditionally render headers/footers if the fast-path check fails (e.g. malformed XML). Also add clarifying comments about properties and footnotes skipping behaviour and make minor comment style fixes.
Add a fast-path to DocxTemplate.resolve_listing that returns the input XML unchanged when no Listing special characters are present. The check looks for tab, newline, bell and form-feed ("\t", "\n", "\a", "\f") and avoids running the heavier resolution logic in the common case, improving performance without changing behavior.
Introduce pre-compiled regex patterns (_RE_TAG_STRIP and _RE_COMMENT_STRIP) to strip surrounding <w:y> tags from template tags like {%y ...%}, {{y ...}} and comments {#y ...#}. Replace repeated re.sub loops with iteration over these patterns to avoid recompiling the same regexes on every call, reduce code duplication, and improve performance/maintainability.
Clean up docxtpl/template.py by removing unused imports: functools, logging, and Template from jinja2. Keeps Environment and meta from jinja2 and does not change runtime behavior; this reduces linter warnings and unnecessary dependencies.
Update comment in docxtpl/template.py to clarify the fallback behavior when processing headers and footers. The comment now explains the fallback guards against unexpected part structure (e.g. blob is None or missing attributes) rather than implying it handles malformed XML; malformed XML would still fail in build_headers_footers_xml. No functional change.
performance: further performance optimizations for large documents
performance: further performance optimizations for large documents
Avoid calling python-docx per-image by generating a CT_Inline-based XML template once and using str.format() to fill sentinels (keeping compatibility with installed python-docx). Add caching of generated image XML per (part, descriptor, width, height) to skip repeated I/O, SHA1 work and header parsing. Use package.get_or_add_image_part and relate_to with RT.IMAGE, compute scaled_dimensions, assign shape_id from docx_ids_index, and xml-escape filenames. Also add a _image_cache dict on DocxTemplate and adjust hyperlink handling to use the local part variable.
Add an O(1) SHA1 index for image parts and a fast _get_or_add_image_part helper on DocxTemplate to avoid python-docx's O(n) linear scan and repeated SHA1 recomputation. Initialize the index in the constructor (_init_image_parts_index), seed it from existing image parts, and maintain a sequential partname counter to prevent partname collisions. Update InlineImage to call tpl._get_or_add_image_part (which returns (image_part, image)) instead of package.get_or_add_image_part, and use the returned Image object. This improves performance and reduces redundant SHA1 work when inserting/looking up images.
Replace the SHA1-based image-part index with a descriptor-keyed cache (_image_descriptor_index) to deduplicate images by file-path (O(1)) and avoid expensive SHA1 hashing. For string path descriptors the cache is used to return existing (image_part, image) tuples; non-string descriptors (e.g. file-like objects) fall back to always creating a new part. Keeps sequential partname assignment and appends new ImagePart to the package; caches the result for string descriptors. This improves performance when adding many images (e.g. large photos) by eliminating repeated SHA1 computation.
Cache only the expensive image metadata (rId, dimensions, filename) per (part, descriptor, width, height) instead of the full inline XML. A fresh shape_id is now assigned for every insertion so drawing IDs remain unique (important for headers/footers/footnotes which aren't renumbered by fix_docpr_ids()). This preserves performance benefits (avoids repeated image part lookup, hashing and header parsing) while preventing duplicate drawing IDs; cx/cy are stored as ints and filename is xml-escaped when cached.
Use id() for non-hashable image descriptors (e.g. file-like objects) when building the image cache key to avoid TypeError on dict lookup. Also escape double quotes in image filenames for XML attribute usage by passing a mapping to xml_escape so quotes become ". Cache semantics and per-insertion shape_id assignment are otherwise unchanged.
Avoid using len() of image parts to pick the next image partname index, which could collide when numbering is non-contiguous. Instead scan existing image partnames (using partname.baseURI when available, otherwise str(partname)), extract numeric suffixes with a regex (/image(\d+)\.), track the maximum index, and set the image part counter to that max. This ensures new image partnames won't reuse an already-present index.
Replace conditional use of partname.baseURI with a direct str(partname) conversion when iterating image parts. This makes the code rely on a consistent string representation for part names (used by the /imageN.ext regex) and avoids depending on the presence of a baseURI attribute across different part implementations.
Replace the hardcoded docx_ids_index initialization with a routine that scans all package parts (body, headers, footers, footnotes) for wp:docPr elements and sets the counter above the maximum found id (minimum 1000). This prevents id collisions when inserting new drawings into parts that were not renumbered by fix_docpr_ids. The new method is called during initialization and safely skips non-XML or unreadable parts.
Treat image.filename == None (e.g., BytesIO/file-like descriptors) as an empty string before calling xml_escape so XML attribute generation matches python-docx behavior. Added a clarifying comment and ensure the escaped filename is stored in the cache to avoid None-related issues when rendering.
Only build and use a cache key when the image_descriptor is hashable. Previously id() was used for non-hashable descriptors (e.g. file-like objects), which could risk aliasing after GC and lead to incorrect deduplication. Now the code attempts to construct a cache key with the descriptor and falls back to skipping caching for unhashable descriptors; cache entries are only read/written when a valid cache_key exists. Filename normalization and per-insertion shape_id behavior are unchanged.
performance: image processing optimizations
performance: image processing optimizations
🤖 Augment PR SummarySummary: This PR merges a set of performance-oriented changes for rendering docx templates and inserting inline images. Changes:
Technical Notes: Introduces module-level cached Jinja2 environments and descriptor-based image-part dedup to reduce repeated parsing and template compilation. 🤖 Was this summary useful? React with 👍 or 👎 |
| # For unhashable descriptors (file-like objects), skip caching | ||
| # entirely — using id() would risk aliasing after GC. | ||
| try: | ||
| cache_key = (id(part), image_descriptor, self.width, self.height) |
There was a problem hiding this comment.
cache_key caching is currently gated on hashability, but many file-like descriptors (e.g., io.BytesIO / file handles) are hashable, so they may still get cached despite the comment saying caching is skipped. If the same stream object is reused/mutated between insertions within a render, this can reuse a stale (rId, cx/cy, filename) tuple and insert the wrong image/dimensions.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| xml = inline.xml | ||
|
|
||
| # Replace sentinel values with format string placeholders | ||
| xml = xml.replace(str(_SHAPE_ID), "{shape_id}") |
There was a problem hiding this comment.
The XML template generation relies on plain .replace() for the numeric sentinels (_SHAPE_ID, _CX_INT, _CY_INT), so any accidental collision (e.g., the same 9-digit value appearing in another attribute, or two sentinels matching each other) could substitute {shape_id}/{cx}/{cy} in the wrong place and yield invalid image XML.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 177822b39c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self._JINJA_PATTERN.search(xml) | ||
| or self._RE_JINJA_OPEN.search(xml) |
There was a problem hiding this comment.
Preserve rendering for custom header delimiters
When callers pass a jinja_env with non-default delimiters (for example variable_start_string='[[') and place those tags in a header or footer, this hard-coded default-delimiter pre-scan returns false and render() never calls build_headers_footers_xml, leaving the header/footer placeholders unrendered. The body is still rendered with the supplied environment, so this is a regression from the previous behavior of always rendering header/footer parts with jinja_env.
Useful? React with 👍 / 👎.
No description provided.