Terminal ETL tool for BitVanes — zero-trust document chunking for RAG. Runs the same pipeline as the web app, natively. Features both an interactive TUI (ratatui) and headless mode (CI/CD piping).
cargo install --git https://github.com/BitVanes/cli.gitDownload the latest binary from GitHub Releases:
# Linux/macOS
curl -L https://github.com/BitVanes/cli/releases/latest/download/bitvanes-v0.1.0-x86_64-linux.tar.gz | tar xz
sudo mv bitvanes /usr/local/bin/
# Verify
bitvanes --version# Single file → JSON output
bitvanes -i document.md -o chunks.json
# Directory (recursive) → Arrow IPC
bitvanes -i ./docs/ -f markdown -m 512 -o chunks.arrow
# Multiple PII patterns
bitvanes -i ./docs/ --scrub email,ssn,credit_card -o chunks.csvRun bitvanes with no arguments to launch the terminal UI:
bitvanes
Four screens (press ? anywhere for full keybinds):
- File Browser — navigate directories, select files with Space
- Config — adjust format, tokenizer, max tokens, PII patterns
- Results — scrollable chunk preview table with stats; save to a path
- Help — full keybind reference
Keybinds:
↑↓/jk— navigate / scrollEnter— open directory / toggle selection / processSpace— select fileTab— cycle screens (Browser → Config → Results)m— cycle max tokens (128 → 256 → 512 → 1024)t— cycle tokenizere/s/a— toggle email/ssn/aws PII scrubbing (Config screen)s— save chunks to the output path (Results screen)e— edit the output path; type a new name,Enterconfirms,Esccancels. Output format is inferred from the extension:.json(default),.csv,.arrowb— back to file browser?— toggle helpq/Esc— quit
Processing runs on a background thread, so the UI stays responsive on large
files (a spinner shows progress). CLI flags (--format, --max-tokens,
--tokenizer, --scrub, --config) are honoured when launching the TUI.
Export a profile from the BitVanes web app, then replay it identically:
bitvanes -c profile.json -i ./documents/ -o output.arrowcat doc.md | bitvanes -f markdown -m 256 -o - | python process.pybitvanes [OPTIONS]
Input:
-i, --input <PATH> File or directory (recursive scan)
-c, --config <FILE> Profile JSON from web export
--no-tui Headless mode
Pipeline:
-f, --format <FORMAT> markdown | text | html | json | pdf (auto-detected)
-t, --tokenizer <NAME> cl100k_base | o200k_base | r50k_base | ...
-m, --max-tokens <N> Max tokens per chunk
--scrub <PATTERNS> Comma-separated PII patterns
Output:
-o, --output <FILE> .arrow | .csv | .json | - (stdout)
| Format | Extension | Use case |
|---|---|---|
| Arrow IPC | .arrow |
DuckDB / Polars / LanceDB direct ingestion |
| CSV | .csv |
Spreadsheet import, human inspection |
| JSON | .json |
Python / Node RAG pipelines |
| Format | Extensions | Parser |
|---|---|---|
| Markdown | .md |
pulldown-cmark |
| Text | .txt |
Paragraph-based splitting |
| HTML | .html |
scraper (html5ever) |
| JSON | .json |
Structural (one chunk per object/leaf) |
.pdf |
pdf-extract (native, text-layer only) |
Format is auto-detected from file extension. Override with --format.
Scanned/image-only PDFs have no extractable text layer and are reported
as invalid input (OCR is out of scope; the web app uses PDF.js).
- 6 OpenAI tokenizers: cl100k_base, o200k_base, r50k_base, p50k_base, p50k_edit, o200k_harmony (all embedded at compile time)
- 7 PII patterns: email, SSN, phone, credit card (Luhn), AWS keys, GitHub PATs, JWTs
- Structural context: heading ancestry preserved per chunk
- Parallel processing: rayon multi-core for directory batch processing
- Profile replay: byte-for-byte identical output to the web app
git clone https://github.com/BitVanes/cli.git
cd cli
cargo build --release
./target/release/bitvanes --helpThe CLI depends on bitvanes-core via a
git dependency (tag v0.1.1, with the ipc, csv, cli-pdf, and
parallel features). No manual checkout of the core repo needed.
MIT OR Apache-2.0