SmallLLMExecutor (v0.3.27)
Orchestrate 3B-8B quantized language models via MCP self-description layers for cost-optimized, edge-capable task execution.
Overview
The SmallLLMExecutor is SPINE’s 6th executor type, designed for tasks where a full-size flagship model is overkill. It wraps small, fast models (CodeLlama 7B, Qwen2.5-Coder 3B, Phi-3.5, DeepSeek-Coder) and provides them with structured MCP context to compensate for their limited capabilities.
graph TD
subgraph SLE["SmallLLMExecutor"]
subgraph MCP["MCP Self-Description Layers"]
L0["L0: Instructions — server identity, tool guide"]
L1["L1: Schema — tool parameter reference"]
L2["L2: Resources — fetched from MCP servers"]
L3["L3: Prompts — workflow steps from MCP servers"]
end
LLM["Small LLM 3B-8B params<br/>Ollama local or Anthropic Haiku API"]
EXEC["TOOL_CALL: format — MCP Execution<br/>via MCPSessionPool persistent"]
end
MCP --> LLM --> EXEC
style SLE fill:#0f172a,stroke:#334155,color:#e2e8f0
style MCP fill:#1e293b,stroke:#7c3aed,color:#e2e8f0
style L0 fill:#7c3aed,stroke:#6d28d9,color:#fff
style L1 fill:#7c3aed,stroke:#6d28d9,color:#fff
style L2 fill:#7c3aed,stroke:#6d28d9,color:#fff
style L3 fill:#7c3aed,stroke:#6d28d9,color:#fff
style LLM fill:#2563eb,stroke:#1d4ed8,color:#fff
style EXEC fill:#0d9488,stroke:#0f766e,color:#fff
MCP Self-Description Layers
The key insight: small models achieve significantly better tool usage accuracy when given structured context at multiple layers, rather than relying on raw schemas alone. SPINE provides this context at four layers:
| Layer | Content | Token Budget |
|---|---|---|
| L0 | Server identity, tool selection guide, workflow | ~1024 |
| L1 | Tool parameter reference (compact schema) | ~512 |
| L2 | Resources fetched from MCP servers | ~1024 |
| L3 | Workflow prompts from MCP servers | ~1024 |
Total context budget: ~4096 tokens (configurable per layer).
Configuration
from spine.orchestrator.executors.small_llm_executor import SmallLLMExecutor, SmallLLMConfig
config = SmallLLMConfig(
model_name="qwen2.5-coder:3b",
provider="ollama", # "ollama" | "anthropic"
base_url="http://localhost:11434",
max_context_tokens=4096,
mcp_servers=["research-agent-mcp", "evaluation-mcp"],
temperature=0.1,
)
executor = SmallLLMExecutor(config)
result = executor.execute(task, project_path)
Supported Providers
| Provider | Models | Caching |
|---|---|---|
| Ollama (local) | CodeLlama 7B, Qwen2.5-Coder 3B, Phi-3.5, DeepSeek-Coder | KV cache |
| Anthropic (API) | Haiku 4.5 | Prompt caching |
CLI Usage
# Run with SmallLLMExecutor
python -m spine.orchestrator run --project /path --executor small-llm
# Combined with Dynamic Routing
python -m spine.orchestrator run --project /path \
--executor router \
--route ANALYSIS:small-llm \
--route CODE:subagent
Scenario Template
# _templates/scenarios/small-llm-mcp-task.yaml
global:
operator: "SPINE SmallLLMExecutor"
command:
task: "${TASK_DESCRIPTION}"
context:
background: "${MCP_INSTRUCTIONS}" # L0
references:
- "${TOOL_SCHEMA}" # L1
- "${CONTEXT_RESOURCES}" # L2
constraints:
format: "${WORKFLOW_PROMPT}" # L3
token_budget:
l0_instructions: 1024
l1_schema: 512
l2_resources: 1024
l3_prompts: 1024
MCPSessionPool Integration (v0.3.28)
SmallLLMExecutor uses MCPSessionPool for persistent MCP connections instead of spawning a new subprocess per tool call:
- Before (v0.3.27): ~110-220ms overhead per MCP tool call (subprocess spawn)
- After (v0.3.28): Near-zero overhead (persistent connection via background event loop)
See MCP Session Pool for details.
Key Design Decisions
- Implements Executor interface — Transparent to AgenticLoop, composes with Dynamic Routing
- Token budget management — Each L0-L3 layer has a configurable budget, content truncated to fit
- TOOL_CALL: format — Simple text format (
TOOL_CALL: tool_name(param=value)) instead of JSON tool_use, optimized for small model output parsing - Graceful degradation — Falls back to direct execution if MCP context unavailable