fix(ingestion): exclude venv/node_modules/cache dirs from analyze + all retrieval APIs#255
Merged
Merged
Conversation
…ll retrieval `HARDCODED_IGNORES` listed `.venv` but not the bare `venv`, nor common Python/JS/build/cache directories, so `codehub analyze` ingested virtualenv, vendored-dependency, and tool-cache content into the index. Every retrieval surface (`query`, `context`, `impact`, `sql`, `pack`) is store-backed, so that content leaked into all of them. Extend the single source of truth — the scan walker matches each path segment's exact name (`hardcoded.has(name)`), so a bare name excludes the directory at any depth. Added: `venv`, `bower_components`, `.pnpm-store`, `.yarn`, `.tox`, `.mypy_cache`, `.pytest_cache`, `.ruff_cache`, `.gradle`, `.parcel-cache`, `.cache`, `.idea`, `.vscode`. Deliberately NOT hardcoded: `vendor`, `env`, `out`, `bin`, `obj` — these commonly hold first-party source (this repo keeps source under `packages/ingestion/src/pipeline/phases/vendor/`). A hardcoded ignore can't be re-included via `.gitignore !`-negation, so these are left to the repo's own `.gitignore`. `.gitignore` honoring on analyze was already correct (`loadGitignoreChain` + layered `shouldIgnore` in the scan walk); this change adds end-to-end regression coverage for it. Tests: scan-level fixtures assert every `HARDCODED_IGNORES` dir is skipped at root and nested, that venv/node_modules never appear in output, and that a user-`.gitignore`'d dir is excluded end-to-end; a unit guard pins the required names and asserts `vendor` is absent. Docs: README "Design choices" row documents the exclusion contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merged
theagenticguy
pushed a commit
that referenced
this pull request
Jun 26, 2026
🤖 Automated release via release-please --- <details><summary>root: 0.10.1</summary> ## [0.10.1](root-v0.10.0...root-v0.10.1) (2026-06-26) ### Bug Fixes * **ingestion:** exclude venv/node_modules/cache dirs from analyze + all retrieval APIs ([#255](#255)) ([881d925](881d925)) </details> <details><summary>cli: 0.10.1</summary> ## [0.10.1](cli-v0.10.0...cli-v0.10.1) (2026-06-26) ### Chores * **cli:** Synchronize opencodehub versions </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
codehub analyzecould ingest virtualenv, vendored-dependency, and tool-cache content into the index. The single source of truth for unconditional directory exclusion —HARDCODED_IGNORESinpackages/ingestion/src/pipeline/gitignore.ts— listed.venvbut not the barevenv, nor common Python/JS/build/cache directories (.tox,.mypy_cache,.pytest_cache,.ruff_cache,bower_components,.pnpm-store,.yarn,.gradle,.parcel-cache,.cache,.idea,.vscode).Because every retrieval surface is store-backed (
query,context,impact,sql,packall readstore.sqlite, which is built at scan time), whatever scan ingested leaked into all of them. The fix belongs at scan time, at the one list — no per-tool guards needed.What changed
HARDCODED_IGNORESextended. The scan walker matches each path segment's exact name (hardcoded.has(name)inphases/scan.ts), so a bare name excludes that directory at any depth, not just the repo root. Added:venv,bower_components,.pnpm-store,.yarn,.tox,.mypy_cache,.pytest_cache,.ruff_cache,.gradle,.parcel-cache,.cache,.idea,.vscode. The list is now grouped and commented by category.vendor,env,out,bin,obj. These commonly hold first-party source — this repo itself keeps source underpackages/ingestion/src/pipeline/phases/vendor/graphty-leiden.ts. A hardcoded ignore can't be re-included via.gitignore!-negation, so hardcodingvendorwould make OpenCodeHub stop indexing its own code. These are left to the repo's own.gitignore, where real vendored deps are already excluded..gitignorehonoring on analyze was already correct (loadGitignoreChain+ layeredshouldIgnorewired into the scan walk). This PR adds end-to-end regression coverage for it.Why retrieval needs no separate change
Audited all 28 MCP tools. Every path that returns file paths/content is safe-by-construction once scan exclusion is correct:
query,context,impact,sql,detect_changes,packdefault engine) only emit nodes the scan kept.querysnippet reads touch the filesystem only for nodes already in the store.group_contracts,group_cross_repo_links) andlist_findings_deltaread persisted.codehubregistry/SARIF JSON, not arbitrary repo files.scantool spawns external scanners that receiveHARDCODED_IGNORESvia the wrappers (same source of truth).repomixpack engine, which relies on repomix's ownnode_modules/.gitignoredefaults (out of scope; slated for removal in M7).Tests
scan.test.ts: asserts everyHARDCODED_IGNORESdir is skipped at root and nested; thatvenv/.venv/node_modulesnever appear in scan output; and that a user-.gitignore'd directory is excluded end-to-end through the scan phase.gitignore.test.ts: unit guard pinning the required names, assertingvendoris absent, and that entries are bare segments (no globs/slashes/dupes).banned-stringsPASS.Docs
🤖 Generated with Claude Code