Skip to content

fix(ingestion): exclude venv/node_modules/cache dirs from analyze + all retrieval APIs#255

Merged
theagenticguy merged 1 commit into
mainfrom
fix/exclude-venv-vendor-dirs
Jun 26, 2026
Merged

fix(ingestion): exclude venv/node_modules/cache dirs from analyze + all retrieval APIs#255
theagenticguy merged 1 commit into
mainfrom
fix/exclude-venv-vendor-dirs

Conversation

@theagenticguy

Copy link
Copy Markdown
Owner

Problem

codehub analyze could ingest virtualenv, vendored-dependency, and tool-cache content into the index. The single source of truth for unconditional directory exclusion — HARDCODED_IGNORES in packages/ingestion/src/pipeline/gitignore.ts — listed .venv but not the bare venv, nor common Python/JS/build/cache directories (.tox, .mypy_cache, .pytest_cache, .ruff_cache, bower_components, .pnpm-store, .yarn, .gradle, .parcel-cache, .cache, .idea, .vscode).

Because every retrieval surface is store-backed (query, context, impact, sql, pack all read store.sqlite, which is built at scan time), whatever scan ingested leaked into all of them. The fix belongs at scan time, at the one list — no per-tool guards needed.

What changed

  • HARDCODED_IGNORES extended. The scan walker matches each path segment's exact name (hardcoded.has(name) in phases/scan.ts), so a bare name excludes that directory at any depth, not just the repo root. Added: venv, bower_components, .pnpm-store, .yarn, .tox, .mypy_cache, .pytest_cache, .ruff_cache, .gradle, .parcel-cache, .cache, .idea, .vscode. The list is now grouped and commented by category.
  • Deliberately NOT hardcoded: vendor, env, out, bin, obj. These commonly hold first-party source — this repo itself keeps source under packages/ingestion/src/pipeline/phases/vendor/graphty-leiden.ts. A hardcoded ignore can't be re-included via .gitignore !-negation, so hardcoding vendor would make OpenCodeHub stop indexing its own code. These are left to the repo's own .gitignore, where real vendored deps are already excluded.
  • .gitignore honoring on analyze was already correct (loadGitignoreChain + layered shouldIgnore wired into the scan walk). This PR adds end-to-end regression coverage for it.

Why retrieval needs no separate change

Audited all 28 MCP tools. Every path that returns file paths/content is safe-by-construction once scan exclusion is correct:

  • Store-backed tools (query, context, impact, sql, detect_changes, pack default engine) only emit nodes the scan kept.
  • query snippet reads touch the filesystem only for nodes already in the store.
  • Group tools (group_contracts, group_cross_repo_links) and list_findings_delta read persisted .codehub registry/SARIF JSON, not arbitrary repo files.
  • The MCP scan tool spawns external scanners that receive HARDCODED_IGNORES via the wrappers (same source of truth).
  • The only FS-walking retrieval path is the opt-in legacy repomix pack engine, which relies on repomix's own node_modules/.gitignore defaults (out of scope; slated for removal in M7).

Tests

  • scan.test.ts: asserts every HARDCODED_IGNORES dir is skipped at root and nested; that venv/.venv/node_modules never appear in scan output; and that a user-.gitignore'd directory is excluded end-to-end through the scan phase.
  • gitignore.test.ts: unit guard pinning the required names, asserting vendor is absent, and that entries are bare segments (no globs/slashes/dupes).
  • Full suite green: ingestion 633, scanners 94, CLI 343 — 0 failures. Whole-repo biome lint clean (686 files). banned-strings PASS.

Docs

  • README "Design choices worth knowing" gains a First-party source only row documenting the exclusion contract and the deliberate ambiguous-name exclusions.

🤖 Generated with Claude Code

…ll retrieval

`HARDCODED_IGNORES` listed `.venv` but not the bare `venv`, nor common
Python/JS/build/cache directories, so `codehub analyze` ingested
virtualenv, vendored-dependency, and tool-cache content into the index.
Every retrieval surface (`query`, `context`, `impact`, `sql`, `pack`) is
store-backed, so that content leaked into all of them.

Extend the single source of truth — the scan walker matches each path
segment's exact name (`hardcoded.has(name)`), so a bare name excludes the
directory at any depth. Added: `venv`, `bower_components`, `.pnpm-store`,
`.yarn`, `.tox`, `.mypy_cache`, `.pytest_cache`, `.ruff_cache`, `.gradle`,
`.parcel-cache`, `.cache`, `.idea`, `.vscode`.

Deliberately NOT hardcoded: `vendor`, `env`, `out`, `bin`, `obj` — these
commonly hold first-party source (this repo keeps source under
`packages/ingestion/src/pipeline/phases/vendor/`). A hardcoded ignore
can't be re-included via `.gitignore !`-negation, so these are left to the
repo's own `.gitignore`. `.gitignore` honoring on analyze was already
correct (`loadGitignoreChain` + layered `shouldIgnore` in the scan walk);
this change adds end-to-end regression coverage for it.

Tests: scan-level fixtures assert every `HARDCODED_IGNORES` dir is skipped
at root and nested, that venv/node_modules never appear in output, and
that a user-`.gitignore`'d dir is excluded end-to-end; a unit guard pins
the required names and asserts `vendor` is absent. Docs: README "Design
choices" row documents the exclusion contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@theagenticguy theagenticguy merged commit 881d925 into main Jun 26, 2026
38 checks passed
@theagenticguy theagenticguy deleted the fix/exclude-venv-vendor-dirs branch June 26, 2026 15:17
@github-actions github-actions Bot mentioned this pull request Jun 26, 2026
theagenticguy pushed a commit that referenced this pull request Jun 26, 2026
🤖 Automated release via release-please
---


<details><summary>root: 0.10.1</summary>

##
[0.10.1](root-v0.10.0...root-v0.10.1)
(2026-06-26)


### Bug Fixes

* **ingestion:** exclude venv/node_modules/cache dirs from analyze + all
retrieval APIs
([#255](#255))
([881d925](881d925))
</details>

<details><summary>cli: 0.10.1</summary>

##
[0.10.1](cli-v0.10.0...cli-v0.10.1)
(2026-06-26)


### Chores

* **cli:** Synchronize opencodehub versions
</details>

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant