Lab 03: Simple Verification

Verify AI outputs before accepting them as complete.

Objectives

By the end of this lab, you will:

Understand why verification matters
Implement basic verification patterns
Handle verification failures with retry
Know when to accept “good enough”

Prerequisites

Lab 02 completed (external state)
Understanding of the loop pattern

The Core Problem

AI models can be confidently wrong. A loop without verification might:

# Without verification - dangerous!
result = ai.complete(task)
mark_complete(task, result)  # Did it actually succeed?

Consider these failure modes:

Task: “Write a function that returns the sum of two numbers”
Result: A function that returns the product (wrong but plausible)
Task: “List 5 US state capitals”
Result: Lists 4 capitals, or includes a city that isn’t a capital

Verification catches these errors before they compound.

The Core Concept

result = ai.complete(task)

if verify(result, task.criteria):
    mark_complete(task)
else:
    mark_for_retry(task)

Verification creates a quality gate between “AI produced output” and “task is done.”

Step 1: Define Verification Criteria

Extend the task schema to include verification criteria:

{
  "tasks": [
    {
      "id": 1,
      "description": "Write a haiku about loops",
      "status": "pending",
      "result": null,
      "criteria": {
        "type": "haiku",
        "requirements": ["exactly 3 lines", "5-7-5 syllable pattern"]
      },
      "attempts": 0,
      "max_attempts": 3
    }
  ]
}

Step 2: Create the Verifier

Create verifier.py:

"""
Verifier - Lab 03

Verifies AI outputs against task criteria.
Uses Claude to evaluate results for quality and correctness.
"""

import anthropic


client = anthropic.Anthropic()


def verify_result(task: dict, result: str) -> dict:
    """
    Verify a result against task criteria.

    Returns:
        {
            "passed": bool,
            "confidence": float (0-1),
            "feedback": str
        }
    """
    criteria = task.get("criteria", {})

    # Build verification prompt
    prompt = f"""You are a quality verification assistant. Your job is to determine if a result meets the specified criteria.

TASK: {task['description']}

CRITERIA:
{format_criteria(criteria)}

RESULT TO VERIFY:
{result}

Evaluate whether the result meets ALL criteria. Be strict but fair.

Respond in this exact format:
PASSED: yes or no
CONFIDENCE: a number from 0.0 to 1.0
FEEDBACK: brief explanation of your evaluation
"""

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=300,
        messages=[{"role": "user", "content": prompt}]
    )

    return parse_verification_response(response.content[0].text)


def format_criteria(criteria: dict) -> str:
    """Format criteria dictionary for the prompt."""
    if not criteria:
        return "- Result should reasonably complete the task"

    lines = []
    if "type" in criteria:
        lines.append(f"- Type: {criteria['type']}")
    if "requirements" in criteria:
        for req in criteria["requirements"]:
            lines.append(f"- {req}")

    return "\n".join(lines) if lines else "- Result should reasonably complete the task"


def parse_verification_response(text: str) -> dict:
    """Parse Claude's verification response."""
    lines = text.strip().split("\n")

    passed = False
    confidence = 0.5
    feedback = ""

    for line in lines:
        line = line.strip()
        if line.upper().startswith("PASSED:"):
            value = line.split(":", 1)[1].strip().lower()
            passed = value in ["yes", "true", "1"]
        elif line.upper().startswith("CONFIDENCE:"):
            try:
                confidence = float(line.split(":", 1)[1].strip())
                confidence = max(0.0, min(1.0, confidence))  # Clamp to 0-1
            except ValueError:
                confidence = 0.5
        elif line.upper().startswith("FEEDBACK:"):
            feedback = line.split(":", 1)[1].strip()

    return {
        "passed": passed,
        "confidence": confidence,
        "feedback": feedback
    }


def quick_verify(result: str, expected_type: str) -> bool:
    """
    Quick heuristic verification without AI.
    Use for simple checks before calling Claude.
    """
    if expected_type == "haiku":
        lines = [l for l in result.strip().split("\n") if l.strip()]
        return len(lines) == 3

    if expected_type == "list":
        # Check if result contains bullet points or numbers
        return any(c in result for c in ["-", "•", "1.", "1)"])

    if expected_type == "code":
        # Check for code indicators
        return any(kw in result for kw in ["def ", "function ", "class ", "const ", "let ", "var "])

    # Default: assume valid
    return True

Step 3: Update State Manager

Add retry tracking to state_manager.py:

def add_task(self, description: str, criteria: dict = None) -> dict:
    """Add a new task with optional verification criteria."""
    task = {
        "id": len(self.tasks) + 1,
        "description": description,
        "status": "pending",
        "result": None,
        "criteria": criteria or {},
        "attempts": 0,
        "max_attempts": 3,
        "verification": None  # Will hold verification result
    }
    self.tasks.append(task)
    self.save()
    return task

def increment_attempts(self, task_id: int):
    """Increment attempt counter for a task."""
    for task in self.tasks:
        if task["id"] == task_id:
            task["attempts"] = task.get("attempts", 0) + 1
            self.save()
            return

def mark_failed(self, task_id: int, reason: str):
    """Mark a task as permanently failed."""
    for task in self.tasks:
        if task["id"] == task_id:
            task["status"] = "failed"
            task["failure_reason"] = reason
            self.save()
            return

def mark_for_retry(self, task_id: int, feedback: str):
    """Reset task to pending for retry."""
    for task in self.tasks:
        if task["id"] == task_id:
            task["status"] = "pending"
            task["last_feedback"] = feedback
            self.save()
            return

Step 4: Update the Main Loop

Create loop_with_verification.py:

"""
Loop with Verification - Lab 03

Demonstrates verified task completion with retry logic.
"""

import anthropic
from state_manager import StateManager
from verifier import verify_result, quick_verify


client = anthropic.Anthropic()


def complete_task(task: dict) -> str:
    """Send task to Claude, including any feedback from previous attempts."""
    prompt = task["description"]

    # Include feedback from failed verification attempts
    if task.get("last_feedback"):
        prompt += f"\n\nPrevious attempt feedback: {task['last_feedback']}"
        prompt += "\nPlease address this feedback in your response."

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text


def process_task(state: StateManager, task: dict) -> bool:
    """
    Process a single task with verification.
    Returns True if task completed successfully.
    """
    task_id = task["id"]
    attempt = task.get("attempts", 0) + 1
    max_attempts = task.get("max_attempts", 3)

    print(f"\n[Task {task_id}] Attempt {attempt}/{max_attempts}")
    print(f"  Description: {task['description'][:50]}...")

    # Increment attempt counter
    state.increment_attempts(task_id)
    state.mark_in_progress(task_id)

    # Generate result
    result = complete_task(task)
    print(f"  Generated result ({len(result)} chars)")

    # Quick heuristic check first (cheap)
    expected_type = task.get("criteria", {}).get("type")
    if expected_type and not quick_verify(result, expected_type):
        print(f"  ✗ Failed quick verification (not a valid {expected_type})")
        handle_failure(state, task, f"Result doesn't appear to be a valid {expected_type}")
        return False

    # Full AI verification (more expensive)
    print(f"  Verifying with AI...")
    verification = verify_result(task, result)

    if verification["passed"]:
        print(f"  ✓ Verified! (confidence: {verification['confidence']:.0%})")
        state.mark_complete(task_id, result)
        return True
    else:
        print(f"  ✗ Verification failed: {verification['feedback']}")
        handle_failure(state, task, verification["feedback"])
        return False


def handle_failure(state: StateManager, task: dict, feedback: str):
    """Handle a failed verification - retry or mark failed."""
    task_id = task["id"]
    attempts = task.get("attempts", 0)
    max_attempts = task.get("max_attempts", 3)

    if attempts >= max_attempts:
        print(f"  Max attempts reached. Marking as failed.")
        state.mark_failed(task_id, f"Failed after {attempts} attempts: {feedback}")
    else:
        print(f"  Marking for retry...")
        state.mark_for_retry(task_id, feedback)


def main():
    state = StateManager("state.json")

    # Load sample tasks if empty
    if len(state.tasks) == 0:
        print("Loading sample tasks with verification criteria...")

        state.add_task(
            "Write a haiku about programming",
            criteria={
                "type": "haiku",
                "requirements": [
                    "exactly 3 lines",
                    "follows 5-7-5 syllable pattern approximately",
                    "relates to programming or coding"
                ]
            }
        )

        state.add_task(
            "List exactly 5 benefits of version control",
            criteria={
                "type": "list",
                "requirements": [
                    "exactly 5 items",
                    "each item describes a benefit",
                    "items are about version control (git, etc.)"
                ]
            }
        )

        state.add_task(
            "Write a Python function that checks if a number is prime",
            criteria={
                "type": "code",
                "requirements": [
                    "valid Python syntax",
                    "function named is_prime or similar",
                    "returns True for prime, False otherwise",
                    "handles edge cases (0, 1, negative)"
                ]
            }
        )

    # Show stats
    stats = state.get_stats()
    print(f"\nTasks: {stats['completed']} completed, "
          f"{stats['pending']} pending, "
          f"{stats.get('failed', 0)} failed")

    # Process tasks
    while state.has_pending():
        task = state.get_next()
        process_task(state, task)

    # Final report
    print("\n" + "=" * 50)
    print("FINAL REPORT")
    print("=" * 50)

    for task in state.tasks:
        status_icon = {
            "completed": "✓",
            "failed": "✗",
            "pending": "○"
        }.get(task["status"], "?")

        print(f"\n{status_icon} Task {task['id']}: {task['description'][:40]}...")
        print(f"  Status: {task['status']}")
        print(f"  Attempts: {task.get('attempts', 0)}")

        if task["status"] == "failed":
            print(f"  Reason: {task.get('failure_reason', 'Unknown')}")


if __name__ == "__main__":
    main()

Step 5: Run and Observe

python loop_with_verification.py

Example output:

Loading sample tasks with verification criteria...

Tasks: 0 completed, 3 pending, 0 failed

[Task 1] Attempt 1/3
  Description: Write a haiku about programming...
  Generated result (47 chars)
  Verifying with AI...
  ✓ Verified! (confidence: 95%)

[Task 2] Attempt 1/3
  Description: List exactly 5 benefits of version control...
  Generated result (312 chars)
  Verifying with AI...
  ✗ Verification failed: Listed 6 benefits instead of 5
  Marking for retry...

[Task 2] Attempt 2/3
  Description: List exactly 5 benefits of version control...
  Generated result (245 chars)
  Verifying with AI...
  ✓ Verified! (confidence: 92%)

[Task 3] Attempt 1/3
  Description: Write a Python function that checks if a n...
  Generated result (421 chars)
  ✗ Failed quick verification (not a valid code)
  Marking for retry...

[Task 3] Attempt 2/3
  Description: Write a Python function that checks if a n...
  Generated result (523 chars)
  Verifying with AI...
  ✓ Verified! (confidence: 98%)

==================================================
FINAL REPORT
==================================================

✓ Task 1: Write a haiku about programming...
  Status: completed
  Attempts: 1

✓ Task 2: List exactly 5 benefits of version contro...
  Status: completed
  Attempts: 2

✓ Task 3: Write a Python function that checks if a ...
  Status: completed
  Attempts: 2

Understanding the Code

The Verification Pipeline

Generate Result → Quick Check → AI Verification → Accept/Retry
       ↓              ↓               ↓              ↓
   (Claude)      (Heuristic)     (Claude)      (Decision)

Quick Check: Fast, cheap heuristics (line count, format checks)
AI Verification: Detailed evaluation against criteria
Decision: Accept, retry, or fail based on results

Two-Tier Verification

# Tier 1: Quick (free, fast)
if not quick_verify(result, expected_type):
    handle_failure(...)
    return

# Tier 2: AI (costs tokens, more accurate)
verification = verify_result(task, result)

This saves API costs by catching obvious failures early.

Feedback Loop

When verification fails, we pass feedback to the next attempt:

if task.get("last_feedback"):
    prompt += f"\nPrevious attempt feedback: {task['last_feedback']}"

This helps Claude learn from its mistakes within the same task.

When to Verify

Scenario	Verification Level
Low stakes (fun facts)	Quick check only
Medium stakes (content)	AI verification
High stakes (code, data)	AI + tests/execution
Critical (financial, medical)	Human review

Exercises

Exercise 1: Add Test Execution

For code tasks, actually run the generated code:

def verify_code(code: str, test_cases: list) -> bool:
    """Execute code and verify against test cases."""
    # Hint: Use exec() carefully with restricted globals
    pass

Exercise 2: Confidence Thresholds

Modify the loop to accept results below 90% confidence only after 2+ attempts:

if verification["confidence"] >= 0.9:
    accept()
elif verification["confidence"] >= 0.7 and attempts >= 2:
    accept()  # Good enough after retries
else:
    retry()

Exercise 3: Verification Cost Tracking

Track how many verification API calls are made and estimate cost:

verification_calls = 0
estimated_cost = verification_calls * 0.003  # ~$0.003 per call

Checkpoint

Before moving on, verify:

Your loop retries failed verifications
Feedback from failures is passed to retry attempts
Tasks are marked failed after max attempts
You understand the two-tier verification approach

Key Takeaway

Verification turns “probably correct” into “verified correct.”

Without verification, you’re trusting that every AI response is perfect. With verification:

Errors are caught before they compound
Retries improve quality
Failed tasks are clearly identified
You know what actually succeeded

Get the Code

Full implementation: 8me/src/tier1-ralph-loop/

The Tier 1 implementation includes verification via tool calling (Lab 05) and circuit breakers (Lab 06).