Architectural Decisions – Intelligence Engine
ADR-001: Python as Primary Language
Decision: Use Python 3.12+ as the primary backend language.
Rationale:
- py-tree-sitter works natively with no WASM overhead
- KuzuDB has excellent Python bindings
- LanceDB, sentence-transformers, and FastMCP are all Python-first
- FastAPI provides high-performance async web serving
- Consistent with the broader tooling ecosystem
Alternatives considered: Node.js (better for browser-based tools, but introduces WASM overhead for Tree-sitter and lacks native KuzuDB bindings).
ADR-002: NetworkX First, Then KuzuDB
Decision: Start with NetworkX for graph storage, migrate to KuzuDB when the schema stabilized.
Rationale:
- Zero extra dependency for the initial MVP
- Faster iteration while exploring entity/relationship schemas
- KuzuDB migration path was clear (same Cypher query language)
Outcome: Both backends are maintained. KuzuDB is the default (Cypher support, persistence, domain-scoped tables). NetworkX remains as a fallback for environments where KuzuDB is unavailable.
ADR-003: Option B – Separate Entity Tables Per Domain
Decision: Create separate Entity_<domain> and Rel_<domain>_* tables for each domain, rather than a single Entity table with a domain column.
Rationale:
- Schema safety – Domain-specific properties (e.g.,
complexityfor code,latitudefor archaeology) live in their own table columns without NULL pollution across domains - Query performance – Cypher queries on
Entity_codescan only code entities, not the entire entity set - Independent evolution – Adding properties to one domain never affects another domain’s table schema
- Clear separation – Each domain’s graph is structurally isolated, reducing accidental cross-domain queries
Trade-offs:
- Cross-domain queries require explicit JOINs or UNION across tables
- More tables in KuzuDB (one per domain + one per relationship type per domain)
- Schema migration requires per-domain handling
Alternatives considered:
- Option A (single Entity table): Simpler to implement but leads to wide, sparse tables as domains accumulate domain-specific columns
- Option C (separate databases per domain): Too isolated – prevents cross-domain queries entirely
ADR-004: (project, domain) as Routing Key
Decision: All operations are routed by the tuple (project, domain).
Rationale:
- A single project can contain entities from multiple domains (e.g., code + documentation)
- Domain-specific extractors, search profiles, and health checks are selected by this key
- Shared mode uses
projectcolumn filtering; domain selects the correct table set - The web UI exposes this as a project dropdown with domain selector
ADR-005: Schema Versioning from Day One
Decision: Every domain schema includes a version field. The indexer records schema_version in project metadata at index time.
Rationale:
- Enables detection of stale indexes (schema changed but project not re-indexed)
- Future: automatic re-indexing when schema version changes
- Future: schema migration tooling for breaking changes
- Low cost to implement, high value for maintainability
Current behavior: The indexer compares stored schema_version against the current schema. If they differ, the project is flagged as stale in ie_status output.
ADR-006: 3-Way RRF Fusion for Search
Decision: Combine BM25, semantic, and graph search using Reciprocal Rank Fusion with fixed weights (0.35, 0.40, 0.25).
Rationale:
- BM25 excels at exact identifier matching (function names, class names)
- Semantic search handles natural language queries and conceptual similarity
- Graph search adds structural context (callers, imports, related entities)
- RRF is simple, robust, and doesn’t require training data
- Fixed weights work well in practice; could be made configurable per domain via search profiles
Formula: score = sum(1 / (k + rank_i)) across strategies, k = 60.
ADR-007: Shared Database Mode as Optional
Decision: Support both per-project (isolated) and shared (multi-tenant) storage modes, with a migration tool to switch between them.
Rationale:
- Per-project mode is simpler and works for most use cases
- Shared mode enables cross-project Cypher queries and global analysis
- Making it optional avoids forcing the complexity of multi-tenant filtering on users who don’t need it
- The migration tool handles bidirectional conversion with verification
Shared mode implementation:
- KuzuDB: single database with
projectcolumn on all entity/relationship tables - LanceDB: single vector store with
projectcolumn - BM25: remains per-project (in-memory, rebuilt on load)
ADR-008: Archaeology as Validation Domain
Decision: Use archaeology as the first non-code domain to validate the domain generalization architecture.
Rationale:
- Maximally different from code – no AST, no Tree-sitter, completely different entity and relationship semantics
- Requires custom extractor (YAML/JSON input), proving the extractor abstraction works
- Has rich relationship types (spatial, temporal, material) that exercise the schema system
- Real-world domain with genuine analytical value, not a toy example
- If archaeology works, any structured domain can be added
Outcome: Successfully validated:
- YAML schema definition drives table creation
- Custom extractor module loading works
- Domain-specific properties, relationships, and search profiles function correctly
- Web UI handles non-code entities without code-specific assumptions
ADR-009: Read-Only Serving with Input Sanitization
Decision: The web server and MCP server open KuzuDB in read_only=True mode. All Cypher input is validated to block write operations.
Rationale:
- Multiple readers can coexist without lock contention
- Prevents accidental or malicious data modification through the API
- Write operations (indexing, migration) require stopping the server first – an acceptable trade-off for personal use
- Cypher validation strips comments and strings before checking for write keywords (DELETE, CREATE, SET, DROP, MERGE, ALTER, COPY, REMOVE, DETACH)
ADR-010: AI Data Preservation Across Re-Indexing
Decision: Extract AI-generated data (summaries, Q&A history) before re-indexing and restore it afterward.
Rationale:
- LLM summaries are expensive to regenerate (API cost + time)
- Q&A history represents accumulated knowledge that shouldn’t be lost
- Re-indexing destroys and recreates graph entities, which would lose all AI annotations
- The extraction/restoration approach is simple and reliable – no schema changes needed
Implementation: ai_overlay.py extracts AI data keyed by entity identity before rebuild, then matches and restores after the new graph is populated.