bitvanes-cli

Terminal ETL tool for BitVanes — zero-trust document chunking for RAG. Runs the same pipeline as the web app, natively. Features both an interactive TUI (ratatui) and headless mode (CI/CD piping).

Install

From source

cargo install --git https://github.com/BitVanes/cli.git

From release

Download the latest binary from GitHub Releases:

# Linux/macOS
curl -L https://github.com/BitVanes/cli/releases/latest/download/bitvanes-v0.1.0-x86_64-linux.tar.gz | tar xz
sudo mv bitvanes /usr/local/bin/

# Verify
bitvanes --version

Usage

Process a file or directory

# Single file → JSON output
bitvanes -i document.md -o chunks.json

# Directory (recursive) → Arrow IPC
bitvanes -i ./docs/ -f markdown -m 512 -o chunks.arrow

# Multiple PII patterns
bitvanes -i ./docs/ --scrub email,ssn,credit_card -o chunks.csv

Interactive TUI mode

Run bitvanes with no arguments to launch the terminal UI:

bitvanes

Four screens (press ? anywhere for full keybinds):

File Browser — navigate directories, select files with Space
Config — adjust format, tokenizer, max tokens, PII patterns
Results — scrollable chunk preview table with stats; save to a path
Help — full keybind reference

Keybinds:

↑↓ / jk — navigate / scroll
Enter — open directory / toggle selection / process
Space — select file
Tab — cycle screens (Browser → Config → Results)
m — cycle max tokens (128 → 256 → 512 → 1024)
t — cycle tokenizer
e/s/a — toggle email/ssn/aws PII scrubbing (Config screen)
s — save chunks to the output path (Results screen)
e — edit the output path; type a new name, Enter confirms, Esc cancels. Output format is inferred from the extension: .json (default), .csv, .arrow
b — back to file browser
? — toggle help
q / Esc — quit

Processing runs on a background thread, so the UI stays responsive on large files (a spinner shows progress). CLI flags (--format, --max-tokens, --tokenizer, --scrub, --config) are honoured when launching the TUI.

Profile replay (from web app)

Export a profile from the BitVanes web app, then replay it identically:

bitvanes -c profile.json -i ./documents/ -o output.arrow

Unix piping

cat doc.md | bitvanes -f markdown -m 256 -o - | python process.py

All options

bitvanes [OPTIONS]

Input:
  -i, --input <PATH>         File or directory (recursive scan)
  -c, --config <FILE>        Profile JSON from web export
      --no-tui               Headless mode

Pipeline:
  -f, --format <FORMAT>      markdown | text | html | json | pdf (auto-detected)
  -t, --tokenizer <NAME>     cl100k_base | o200k_base | r50k_base | ...
  -m, --max-tokens <N>       Max tokens per chunk
      --scrub <PATTERNS>     Comma-separated PII patterns

Output:
  -o, --output <FILE>        .arrow | .csv | .json | - (stdout)

Output formats

Format	Extension	Use case
Arrow IPC	`.arrow`	DuckDB / Polars / LanceDB direct ingestion
CSV	`.csv`	Spreadsheet import, human inspection
JSON	`.json`	Python / Node RAG pipelines

Supported document formats

Format	Extensions	Parser
Markdown	`.md`	`pulldown-cmark`
Text	`.txt`	Paragraph-based splitting
HTML	`.html`	`scraper` (html5ever)
JSON	`.json`	Structural (one chunk per object/leaf)
PDF	`.pdf`	`pdf-extract` (native, text-layer only)

Format is auto-detected from file extension. Override with --format. Scanned/image-only PDFs have no extractable text layer and are reported as invalid input (OCR is out of scope; the web app uses PDF.js).

Features

6 OpenAI tokenizers: cl100k_base, o200k_base, r50k_base, p50k_base, p50k_edit, o200k_harmony (all embedded at compile time)
7 PII patterns: email, SSN, phone, credit card (Luhn), AWS keys, GitHub PATs, JWTs
Structural context: heading ancestry preserved per chunk
Parallel processing: rayon multi-core for directory batch processing
Profile replay: byte-for-byte identical output to the web app

Build from source

git clone https://github.com/BitVanes/cli.git
cd cli
cargo build --release
./target/release/bitvanes --help

The CLI depends on bitvanes-core via a git dependency (tag v0.1.1, with the ipc, csv, cli-pdf, and parallel features). No manual checkout of the core repo needed.

License

MIT OR Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bitvanes-cli

Install

From source

From release

Usage

Process a file or directory

Interactive TUI mode

Profile replay (from web app)

Unix piping

All options

Output formats

Supported document formats

Features

Build from source

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

bitvanes-cli

Install

From source

From release

Usage

Process a file or directory

Interactive TUI mode

Profile replay (from web app)

Unix piping

All options

Output formats

Supported document formats

Features

Build from source

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages