BGI - Big Indexer
<!-- mcp-name: io.github.ahmedxuhri/bigindexer -->BGI is a static architecture analysis tool for large codebases.
It groups code units by behavioral role and emits explicit architectural boundaries.
Project domain: bigindexer.com
Use via MCP Registry
Big Indexer is published in the MCP Registry as io.github.ahmedxuhri/bigindexer.
pip install bigindexer==0.1.3
bgi mcp --graph bgi-graph.json --fuse-graph fuse-graph.jsonValidation: https://bigindexer.com/validation
What problem this solves
Most architecture graphs fail at scale in two ways:
- too many noisy edges
- giant clusters that collapse unrelated components together
BGI is built to keep both under control, so the output remains usable on large repos.
What you can do with it
- "Where should this boundary be before we refactor?"
BGI groups units by behavioral role (COV tokens + DRS clustering) so likely component boundaries are visible. - "Which subsystem coupling is risky?"
BGI surfaces high-coupling seams and fuse-boundary signals between clusters so integration risk is easier to spot. - "How do we plug architecture data into automation?"
BGI emits machine-readable artifacts (bgi-graph.json,fuse-graph.json) plus optional human context (bigindexer.md). - "How do we make AI changes less random?"
MCP tools (task_fingerprint,behavioral_twins,twin_context) ground prompts in in-repo behavior patterns. - "Can I run this automatically on PRs as a live example?"
Yes — use the dedicated action repoahmedxuhri/bigindexer-pr-risk-botto auto-comment PRs with blast radius, seams, and risk hints.
30-second demo
Run BGI on the included fixture repo:
git clone https://github.com/ahmedxuhri/bigindexer
cd bigindexer
pip install -e .
bgi scan tests/fixtures --lang python --out /tmp/bgi-example.json
head -50 /tmp/bgi-example.jsonObserved result on this repository:
- units:
12 - edges:
14 - clusters:
2 - max cluster in sample:
6units
One produced edge looks like:
{
"source": "auth_module.py::AuthService::__init__",
"target": "auth_module.py::AuthService::__del__",
"key": "COV.INIT",
"lock": "COV.TEARDOWN",
"type": "HARD"
}Why this matters: instead of raw syntax references only, you get behavioral relationships plus cluster structure that can drive architecture decisions.
Plain-English glossary
| BGI term | Plain meaning |
|---|---|
| COV token | A behavior label for a unit (for example: FETCH, PERSIST, AUTHENTICATE) |
| Key-Lock edge | A behavioral connection between two units with complementary roles |
| DRS cluster | A unit-level grouping by behavioral role. Mostly intra-file in practice. File-level architectural components are better expressed via the BGI edge graph or the fuse-graph boundary signal — see external benchmark |
| Fuse edge / fuse event | A refused merge because cluster growth hit the cap; treated as boundary signal |
| Spectral masks | Scope rules that limit where matching is allowed (global, directory, file) |
Architecture in one view
Source files
->
Gate 1: fingerprint unit behavior (COV tokens)
->
Gate 2: create behavioral edges with scoped matching
->
Gate 3: cluster with hard size cap + boundary emission
->
Artifacts: bgi-graph.json, fuse-graph.json, bigindexer.md, optional routes/graphml/htmlCore approach:
- TOKEN-CENSUS - classify token frequency per repo.
- SPECTRAL-MASKS - restrict match scope by token frequency.
- FUSE-MAP - cap cluster growth and record refused merges.
- MASK-4-GATE-3 - use import proximity as clustering signal.
- WATER-CLOCK +
.scm- single-pass query extraction path in Gate 1.
Why BGI is different from common alternatives
| Capability | LSP / SCIP index | Call-graph + generic community detection | BGI |
|---|---|---|---|
| Fast symbol lookup | Strong | Medium | Available (Phase 6 index) |
| Behavioral token model | No | Usually no | Yes |
| Hard-bounded clustering | No | Usually no | Yes (unit-level) |
| First-class boundary artifact | No | Usually no | Yes (fuse-graph.json) |
| Scope-constrained edge generation | Limited | Rare | Yes (spectral masks) |
External head-to-head benchmark (Louvain on BGI's edges vs Louvain on raw imports, scored against package layout): BGI's edges win on Python (django F1 0.38 vs 0.29, MoJoFM 0.45 vs 0.34) and currently tie/lose on Go due to lower cross-file edge density on tier-2 scanners. Full results and methodology in docs/VALIDATION_EVIDENCE.md.
Evidence (current, verifiable)
Large-repo scale evidence
Comparable kubernetes sample (go comparable mode, 162,917 units):
- Gate 1:
141.964s - Gate 2:
67.261s(historical comparable baseline:138.869s) - Gate 3:
9.359s - Total:
218.584s - Max cluster:
1.113% - Fuse events:
0
Artifact: output/validation/kubernetes-optionb-controlled-median-v21.json
Quality guard evidence (beyond raw speed)
- Gate 2 scope safety tests block invalid cross-scope merges (see
tests/test_gate2.py). - Gate 3 tests verify no legacy namespace over-merge without import evidence (see
tests/test_gate3.py). - Current full suite status:
python3 -m pytest tests/ -x -q(project baseline target remains passing).
Evidence summary
- Current published validation set: 100 scored runs across 5 repos and 3 models.
- Full 20-run post-shipment benchmark refresh for BGI-TWIN context (
task → COV → top-3 twins + seam + rubric) is complete: actionability 4.75/5 (p04 slice: 4.8/5), boundary 1.0, hallucinations 0. - Independent-model replication is now complete on azure/gpt-4o (20 runs) and gemini/auto (20 runs): GPT-4o actionability 4.85/5, Gemini actionability 4.25/5, both with zero hallucinations; Gemini boundary 0.95 reflects one genuine
django/p02miss. - Still missing: labeled precision/recall benchmark on an external corpus and head-to-head quantitative benchmark vs external tools on the same labeled dataset.
Language support tiers (explicit)
BGI does not treat all languages equally; support is tiered:
- Query-backed (
.scm):python,typescript,tsx,javascript,go,rust,java,csharp,php,ruby,kotlin,scala - Tree-sitter scanner + rule path:
c,lua,elixir - Generic regex fallback by extension:
swift,r,dart,bash,nim,zig,haskell,ocaml,fsharp,clojure,erlang,matlab,vb,crystal,cobol,groovy
Use this as a reliability signal: query-backed and dedicated scanner tiers are stronger than generic fallback.
Cross-file edge density caveat: the language tiers above describe parser quality. A separate axis is cross-file behavioral edge density — how many key-lock pairs the scanner produces that link units in different files. Tier-1 (.scm-backed) languages produce dense cross-file edges. Tier-2 scanner-backed languages currently produce sparser cross-file edges because their token mix is dominated by structural tokens (INTAKE/OUTPUT/CONDITIONAL/LOOP) that gate-2 deliberately scopes to same-file to prevent O(N²) noise. The user-visible MCP product (boundary detection, twin retrieval, AI-assistant context) still works on tier-2 languages — see the validation evidence — but cluster-recovery benchmarks against import-graph baselines reflect this density gap. Concrete numbers in docs/VALIDATION_EVIDENCE.md.
Limitations and non-goals
- BGI is static analysis; it does not ingest runtime traces.
- Cross-file semantic resolution is heuristic and language-dependent.
- Cluster-size health is measured; full external precision/recall is not yet published.
- Shared-host benchmarking introduces variance; decisions should use controlled medians.
Install
pip install -e .Quickstart commands
# scan
bgi scan /path/to/repo --lang auto --out bgi-graph.json
# optional outputs
bgi scan /path/to/repo --lang auto \
--fuse-graph fuse-graph.json \
--routes routes.json \
--graphml graph.graphml \
--html
# incremental
bgi scan /path/to/repo --lang auto --incremental --cache .bgi-cache.json
# diff
bgi diff /path/before /path/after --lang auto --out diff.json
# run MCP server over generated artifacts
bgi mcp --graph bgi-graph.json --fuse-graph fuse-graph.jsonExample MCP usage pattern (from your client prompt):
Use MCP tool twin_context for:
"Add endpoint that validates input and persists data."
Return top twin candidate, seam suggestion, and rubric checklist.Telemetry
BGI ships with opt-in, off-by-default anonymous telemetry. To enable:
export BGI_TELEMETRY=1
bgi mcp --graph bgi-graph.json --fuse-graph fuse-graph.jsonWhat's collected when enabled: BGI version, OS, repo size bucket, and a 12-char hash of your repo's git remote (so we can deduplicate "same repo seen twice" without ever knowing which repo). What's never collected: file paths, source code, repo names, user identity, or IP addresses. Full schema and disable instructions in docs/TELEMETRY.md.
Documentation map
MEMORANDUM.md- design contracts and invariantsdocs/LANGUAGE_SUPPORT.md- language implementation detailsdocs/CONTRIBUTING_LANGUAGES.md- language contribution guidedocs/INDEX_SCHEMA.md- interactive index schemadocs/QUERY_PLANNER.md- query planner scoringdocs/MCP_SETUP.md- MCP server setup and usagedocs/MCP_WITH_CONTINUE.md- 5-minute Continue + BGI walkthroughdocs/TELEMETRY.md- opt-in telemetry: what we collect and how to disablehttps://bigindexer.com/validation- public validation evidencedocs/MCP_QUICKSTART_DEMO.md- 5-minute demo walkthroughdocs/MCP_EXAMPLE_TRANSCRIPTS.md- real-world MCP tool invocation examplesdocs/MCP_REAL_TRANSCRIPT.md- unedited transcript from FastAPI analysisscripts/mcp-demo.sh- automated demo script for multiple CLIs and repositories
License and Copyright
- License: Apache License 2.0 (
LICENSE) - Contributor terms: Developer Certificate of Origin (
DCO) enforced on pull requests