Merge dev changes by chrismaddalena · Pull Request #1 · GhostManager/python-docx-template

chrismaddalena · 2026-06-05T22:28:07Z

No description provided.

- Add try/except fallback with recover=True for malformed XML in fix_tables() - Use OxmlElement with qn() instead of etree.SubElement for new grid columns - Remove unused _cached_jinja_env variables - Clean up redundant comments

performance: improve performance for large template files

…emplate

Delete the module-level logger and several logger.warning calls in docxtpl/template.py. Added while debugging and should be removed.

Improve documentation in map_tree to explain the optimization: the code swaps the entire <w:body> via root.remove() + root.insert() to avoid O(n) per-child lxml operations, which is effectively O(1) on the document root. Clarify that the body's index is preserved so element order (body before sectPr) remains intact, and spell out the fallback behavior (child-by-child copy) if the body isn't a direct child or if remove/insert fails. Add additional safety and explanatory comments.

Enhance header/footer processing by detecting Jinja tags split across Word XML runs: check both intact tags (_JINJA_PATTERN) and open-tag fragments (_RE_JINJA_OPEN) when scanning part XML. Use a generator to iterate part XML strings once, and keep the existing exception fallback to unconditionally render headers/footers if the fast-path check fails (e.g. malformed XML). Also add clarifying comments about properties and footnotes skipping behaviour and make minor comment style fixes.

Add a fast-path to DocxTemplate.resolve_listing that returns the input XML unchanged when no Listing special characters are present. The check looks for tab, newline, bell and form-feed ("\t", "\n", "\a", "\f") and avoids running the heavier resolution logic in the common case, improving performance without changing behavior.

Introduce pre-compiled regex patterns (_RE_TAG_STRIP and _RE_COMMENT_STRIP) to strip surrounding <w:y> tags from template tags like {%y ...%}, {{y ...}} and comments {#y ...#}. Replace repeated re.sub loops with iteration over these patterns to avoid recompiling the same regexes on every call, reduce code duplication, and improve performance/maintainability.

Clean up docxtpl/template.py by removing unused imports: functools, logging, and Template from jinja2. Keeps Environment and meta from jinja2 and does not change runtime behavior; this reduces linter warnings and unnecessary dependencies.

Update comment in docxtpl/template.py to clarify the fallback behavior when processing headers and footers. The comment now explains the fallback guards against unexpected part structure (e.g. blob is None or missing attributes) rather than implying it handles malformed XML; malformed XML would still fail in build_headers_footers_xml. No functional change.

performance: further performance optimizations for large documents

Avoid calling python-docx per-image by generating a CT_Inline-based XML template once and using str.format() to fill sentinels (keeping compatibility with installed python-docx). Add caching of generated image XML per (part, descriptor, width, height) to skip repeated I/O, SHA1 work and header parsing. Use package.get_or_add_image_part and relate_to with RT.IMAGE, compute scaled_dimensions, assign shape_id from docx_ids_index, and xml-escape filenames. Also add a _image_cache dict on DocxTemplate and adjust hyperlink handling to use the local part variable.

Add an O(1) SHA1 index for image parts and a fast _get_or_add_image_part helper on DocxTemplate to avoid python-docx's O(n) linear scan and repeated SHA1 recomputation. Initialize the index in the constructor (_init_image_parts_index), seed it from existing image parts, and maintain a sequential partname counter to prevent partname collisions. Update InlineImage to call tpl._get_or_add_image_part (which returns (image_part, image)) instead of package.get_or_add_image_part, and use the returned Image object. This improves performance and reduces redundant SHA1 work when inserting/looking up images.

Replace the SHA1-based image-part index with a descriptor-keyed cache (_image_descriptor_index) to deduplicate images by file-path (O(1)) and avoid expensive SHA1 hashing. For string path descriptors the cache is used to return existing (image_part, image) tuples; non-string descriptors (e.g. file-like objects) fall back to always creating a new part. Keeps sequential partname assignment and appends new ImagePart to the package; caches the result for string descriptors. This improves performance when adding many images (e.g. large photos) by eliminating repeated SHA1 computation.

Cache only the expensive image metadata (rId, dimensions, filename) per (part, descriptor, width, height) instead of the full inline XML. A fresh shape_id is now assigned for every insertion so drawing IDs remain unique (important for headers/footers/footnotes which aren't renumbered by fix_docpr_ids()). This preserves performance benefits (avoids repeated image part lookup, hashing and header parsing) while preventing duplicate drawing IDs; cx/cy are stored as ints and filename is xml-escaped when cached.

Use id() for non-hashable image descriptors (e.g. file-like objects) when building the image cache key to avoid TypeError on dict lookup. Also escape double quotes in image filenames for XML attribute usage by passing a mapping to xml_escape so quotes become ". Cache semantics and per-insertion shape_id assignment are otherwise unchanged.

Avoid using len() of image parts to pick the next image partname index, which could collide when numbering is non-contiguous. Instead scan existing image partnames (using partname.baseURI when available, otherwise str(partname)), extract numeric suffixes with a regex (/image(\d+)\.), track the maximum index, and set the image part counter to that max. This ensures new image partnames won't reuse an already-present index.

Replace conditional use of partname.baseURI with a direct str(partname) conversion when iterating image parts. This makes the code rely on a consistent string representation for part names (used by the /imageN.ext regex) and avoids depending on the presence of a baseURI attribute across different part implementations.

Replace the hardcoded docx_ids_index initialization with a routine that scans all package parts (body, headers, footers, footnotes) for wp:docPr elements and sets the counter above the maximum found id (minimum 1000). This prevents id collisions when inserting new drawings into parts that were not renumbered by fix_docpr_ids. The new method is called during initialization and safely skips non-XML or unreadable parts.

Treat image.filename == None (e.g., BytesIO/file-like descriptors) as an empty string before calling xml_escape so XML attribute generation matches python-docx behavior. Added a clarifying comment and ensure the escaped filename is stored in the cache to avoid None-related issues when rendering.

Only build and use a cache key when the image_descriptor is hashable. Previously id() was used for non-hashable descriptors (e.g. file-like objects), which could risk aliasing after GC and lead to incorrect deduplication. Now the code attempts to construct a cache key with the descriptor and falls back to skipping caching for unhashable descriptors; cache entries are only read/written when a valid cache_key exists. Filename normalization and per-insertion shape_id behavior are unchanged.

performance: image processing optimizations

augmentcode · 2026-06-05T22:32:36Z

🤖 Augment PR Summary

Summary: This PR merges a set of performance-oriented changes for rendering docx templates and inserting inline images.

Changes:

Inline images: derive a one-time inline-image XML template from python-docx and render it via str.format(), plus cache per-(part, descriptor, width, height) lookups.
Add image-part indexing and global wp:docPr ID initialization to reduce collisions across body/headers/footers/footnotes.
Optimize patch_xml() by precompiling regex patterns and reducing repeated re.compile()/re.sub() overhead.
Skip header/footer rendering work when no Jinja syntax is detected.
Speed up table fixing by preferring parse_xml() with a permissive fallback parser.
Add early-exit in listing resolution when no special characters are present.
Update Poetry metadata (package name/description/authors).

Technical Notes: Introduces module-level cached Jinja2 environments and descriptor-based image-part dedup to reduce repeated parsing and template compilation.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-06-05T22:32:37Z

+        # For unhashable descriptors (file-like objects), skip caching
+        # entirely — using id() would risk aliasing after GC.
+        try:
+            cache_key = (id(part), image_descriptor, self.width, self.height)


cache_key caching is currently gated on hashability, but many file-like descriptors (e.g., io.BytesIO / file handles) are hashable, so they may still get cached despite the comment saying caching is skipped. If the same stream object is reused/mutated between insertions within a render, this can reuse a stale (rId, cx/cy, filename) tuple and insert the wrong image/dimensions.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-06-05T22:32:37Z

+    xml = inline.xml
+
+    # Replace sentinel values with format string placeholders
+    xml = xml.replace(str(_SHAPE_ID), "{shape_id}")


The XML template generation relies on plain .replace() for the numeric sentinels (_SHAPE_ID, _CX_INT, _CY_INT), so any accidental collision (e.g., the same 9-digit value appearing in another attribute, or two sentinels matching each other) could substitute {shape_id}/{cx}/{cy} in the wrong place and yield invalid image XML.

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 177822b39c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-05T22:34:00Z

+                    self._JINJA_PATTERN.search(xml)
+                    or self._RE_JINJA_OPEN.search(xml)


Preserve rendering for custom header delimiters

When callers pass a jinja_env with non-default delimiters (for example variable_start_string='[[') and place those tags in a header or footer, this hard-coded default-delimiter pre-scan returns false and render() never calls build_headers_footers_xml, leaving the header/footer placeholders unrendered. The body is still rendered with the supplied environment, so this is a regression from the previous behavior of always rendering header/footer parts with jinja_env.

Useful? React with 👍 / 👎.

Bonggoprasetyanto and others added 27 commits December 24, 2025 13:57

perf: parse_xml + body mutation optimization

5ea2340

Fix poetry configuration - add required fields

2dd1a29

fix: improve XML handling and cleanup code

ec0b7e1

- Add try/except fallback with recover=True for malformed XML in fix_tables() - Use OxmlElement with qn() instead of etree.SubElement for new grid columns - Remove unused _cached_jinja_env variables - Clean up redundant comments

Small comment clean-up

e455da7

Merge pull request elapouya#630 from start-software/develop

3727096

performance: improve performance for large template files

perf: optimize body replacement and header/footer processing in DocxT…

e0fb809

…emplate

Remove logging warnings in template.py

c82d2a4

Delete the module-level logger and several logger.warning calls in docxtpl/template.py. Added while debugging and should be removed.

Remove unused imports from template.py

c042ae2

Clean up docxtpl/template.py by removing unused imports: functools, logging, and Template from jinja2. Keeps Environment and meta from jinja2 and does not change runtime behavior; this reduces linter warnings and unnecessary dependencies.

Merge pull request #1 from start-software/performance-optimizations

a0564e1

performance: further performance optimizations for large documents

Merge pull request elapouya#637 from start-software/develop

10079b1

performance: further performance optimizations for large documents

Merge pull request elapouya#2 from start-software/image-optimizations

47ca344

performance: image processing optimizations

Merge pull request elapouya#638 from start-software/develop

177822b

performance: image processing optimizations

augmentcode Bot reviewed Jun 5, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge dev changes#1

Merge dev changes#1
chrismaddalena wants to merge 27 commits into
masterfrom
dev

chrismaddalena commented Jun 5, 2026

Uh oh!

augmentcode Bot commented Jun 5, 2026

Uh oh!

augmentcode Bot left a comment

Uh oh!

augmentcode Bot Jun 5, 2026 •

edited

Loading

Uh oh!

augmentcode Bot Jun 5, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		self._JINJA_PATTERN.search(xml)
		or self._RE_JINJA_OPEN.search(xml)

Conversation

chrismaddalena commented Jun 5, 2026

Uh oh!

augmentcode Bot commented Jun 5, 2026

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

augmentcode Bot Jun 5, 2026 •

edited

Loading

augmentcode Bot Jun 5, 2026 •

edited

Loading