Architectural Decisions – Intelligence Engine

ADR-001: Python as Primary Language

Decision: Use Python 3.12+ as the primary backend language.

Rationale:

  • py-tree-sitter works natively with no WASM overhead
  • KuzuDB has excellent Python bindings
  • LanceDB, sentence-transformers, and FastMCP are all Python-first
  • FastAPI provides high-performance async web serving
  • Consistent with the broader tooling ecosystem

Alternatives considered: Node.js (better for browser-based tools, but introduces WASM overhead for Tree-sitter and lacks native KuzuDB bindings).


ADR-002: NetworkX First, Then KuzuDB

Decision: Start with NetworkX for graph storage, migrate to KuzuDB when the schema stabilized.

Rationale:

  • Zero extra dependency for the initial MVP
  • Faster iteration while exploring entity/relationship schemas
  • KuzuDB migration path was clear (same Cypher query language)

Outcome: Both backends are maintained. KuzuDB is the default (Cypher support, persistence, domain-scoped tables). NetworkX remains as a fallback for environments where KuzuDB is unavailable.


ADR-003: Option B – Separate Entity Tables Per Domain

Decision: Create separate Entity_<domain> and Rel_<domain>_* tables for each domain, rather than a single Entity table with a domain column.

Rationale:

  • Schema safety – Domain-specific properties (e.g., complexity for code, latitude for archaeology) live in their own table columns without NULL pollution across domains
  • Query performance – Cypher queries on Entity_code scan only code entities, not the entire entity set
  • Independent evolution – Adding properties to one domain never affects another domain’s table schema
  • Clear separation – Each domain’s graph is structurally isolated, reducing accidental cross-domain queries

Trade-offs:

  • Cross-domain queries require explicit JOINs or UNION across tables
  • More tables in KuzuDB (one per domain + one per relationship type per domain)
  • Schema migration requires per-domain handling

Alternatives considered:

  • Option A (single Entity table): Simpler to implement but leads to wide, sparse tables as domains accumulate domain-specific columns
  • Option C (separate databases per domain): Too isolated – prevents cross-domain queries entirely

ADR-004: (project, domain) as Routing Key

Decision: All operations are routed by the tuple (project, domain).

Rationale:

  • A single project can contain entities from multiple domains (e.g., code + documentation)
  • Domain-specific extractors, search profiles, and health checks are selected by this key
  • Shared mode uses project column filtering; domain selects the correct table set
  • The web UI exposes this as a project dropdown with domain selector

ADR-005: Schema Versioning from Day One

Decision: Every domain schema includes a version field. The indexer records schema_version in project metadata at index time.

Rationale:

  • Enables detection of stale indexes (schema changed but project not re-indexed)
  • Future: automatic re-indexing when schema version changes
  • Future: schema migration tooling for breaking changes
  • Low cost to implement, high value for maintainability

Current behavior: The indexer compares stored schema_version against the current schema. If they differ, the project is flagged as stale in ie_status output.


Decision: Combine BM25, semantic, and graph search using Reciprocal Rank Fusion with fixed weights (0.35, 0.40, 0.25).

Rationale:

  • BM25 excels at exact identifier matching (function names, class names)
  • Semantic search handles natural language queries and conceptual similarity
  • Graph search adds structural context (callers, imports, related entities)
  • RRF is simple, robust, and doesn’t require training data
  • Fixed weights work well in practice; could be made configurable per domain via search profiles

Formula: score = sum(1 / (k + rank_i)) across strategies, k = 60.


ADR-007: Shared Database Mode as Optional

Decision: Support both per-project (isolated) and shared (multi-tenant) storage modes, with a migration tool to switch between them.

Rationale:

  • Per-project mode is simpler and works for most use cases
  • Shared mode enables cross-project Cypher queries and global analysis
  • Making it optional avoids forcing the complexity of multi-tenant filtering on users who don’t need it
  • The migration tool handles bidirectional conversion with verification

Shared mode implementation:

  • KuzuDB: single database with project column on all entity/relationship tables
  • LanceDB: single vector store with project column
  • BM25: remains per-project (in-memory, rebuilt on load)

ADR-008: Archaeology as Validation Domain

Decision: Use archaeology as the first non-code domain to validate the domain generalization architecture.

Rationale:

  • Maximally different from code – no AST, no Tree-sitter, completely different entity and relationship semantics
  • Requires custom extractor (YAML/JSON input), proving the extractor abstraction works
  • Has rich relationship types (spatial, temporal, material) that exercise the schema system
  • Real-world domain with genuine analytical value, not a toy example
  • If archaeology works, any structured domain can be added

Outcome: Successfully validated:

  • YAML schema definition drives table creation
  • Custom extractor module loading works
  • Domain-specific properties, relationships, and search profiles function correctly
  • Web UI handles non-code entities without code-specific assumptions

ADR-009: Read-Only Serving with Input Sanitization

Decision: The web server and MCP server open KuzuDB in read_only=True mode. All Cypher input is validated to block write operations.

Rationale:

  • Multiple readers can coexist without lock contention
  • Prevents accidental or malicious data modification through the API
  • Write operations (indexing, migration) require stopping the server first – an acceptable trade-off for personal use
  • Cypher validation strips comments and strings before checking for write keywords (DELETE, CREATE, SET, DROP, MERGE, ALTER, COPY, REMOVE, DETACH)

ADR-010: AI Data Preservation Across Re-Indexing

Decision: Extract AI-generated data (summaries, Q&A history) before re-indexing and restore it afterward.

Rationale:

  • LLM summaries are expensive to regenerate (API cost + time)
  • Q&A history represents accumulated knowledge that shouldn’t be lost
  • Re-indexing destroys and recreates graph entities, which would lose all AI annotations
  • The extraction/restoration approach is simple and reliable – no schema changes needed

Implementation: ai_overlay.py extracts AI data keyed by entity identity before rebuild, then matches and restores after the new graph is populated.


This site uses Just the Docs, a documentation theme for Jekyll.