Lab 07: Confidence Retry

Use AI confidence scores for smart retry decisions that balance quality and cost.

Objectives

By the end of this lab, you will:

Understand confidence-based decision making
Implement adaptive retry thresholds
Balance quality vs cost tradeoffs
Track and optimize retry behavior

Prerequisites

Lab 06 completed (circuit breakers)
Understanding of tool calling (Lab 05)

The Problem: Fixed Thresholds

Simple retry logic uses fixed thresholds:

# Fixed threshold - too rigid
if confidence >= 0.9:
    accept()
else:
    retry()  # Even at 0.89? Even after 5 attempts?

Problems with this approach:

0.89 vs 0.90: Arbitrary cutoff rejects good results
Diminishing returns: 5th retry rarely beats 4th
Cost blindness: Retries cost money regardless of improvement chance
No context: Same threshold for easy and hard tasks

The Solution: Adaptive Thresholds

# Adaptive - smarter decisions
if confidence >= 0.95:
    accept()  # High confidence = accept immediately
elif confidence >= 0.80 and attempts >= 2:
    accept()  # Good enough after retries
elif confidence >= 0.70 and attempts >= 3:
    accept()  # Acceptable after many retries
elif attempts >= max_attempts:
    accept_or_fail()  # Decide based on minimum threshold
else:
    retry()  # Try to improve

Step 1: Create the Confidence Manager

Create confidence_manager.py:

"""
Confidence Manager - Lab 07

Adaptive confidence-based retry decisions.
"""

from dataclasses import dataclass, field
from typing import Dict, Optional, Tuple, List
from enum import Enum


class Decision(Enum):
    """Possible decisions for a result."""
    ACCEPT = "accept"
    RETRY = "retry"
    FAIL = "fail"
    ACCEPT_WITH_WARNING = "accept_with_warning"


@dataclass
class ConfidenceConfig:
    """Configuration for confidence-based decisions."""
    # Immediate accept threshold
    high_confidence: float = 0.95

    # Accept after retries thresholds
    medium_confidence: float = 0.80
    medium_confidence_min_attempts: int = 2

    low_confidence: float = 0.70
    low_confidence_min_attempts: int = 3

    # Minimum acceptable (below = fail)
    minimum_confidence: float = 0.50

    # Maximum attempts before forced decision
    max_attempts: int = 5

    # Cost-aware settings
    cost_per_attempt: float = 0.01  # Estimated $ per retry
    max_cost_per_task: float = 0.10


@dataclass
class AttemptRecord:
    """Record of a single attempt."""
    attempt_number: int
    confidence: float
    result_preview: str
    cost: float = 0.0


@dataclass
class TaskConfidenceState:
    """Tracks confidence history for a task."""
    task_id: str
    attempts: List[AttemptRecord] = field(default_factory=list)
    total_cost: float = 0.0
    final_decision: Optional[Decision] = None
    final_confidence: Optional[float] = None

    @property
    def attempt_count(self) -> int:
        return len(self.attempts)

    @property
    def best_confidence(self) -> float:
        if not self.attempts:
            return 0.0
        return max(a.confidence for a in self.attempts)

    @property
    def latest_confidence(self) -> float:
        if not self.attempts:
            return 0.0
        return self.attempts[-1].confidence

    @property
    def is_improving(self) -> bool:
        """Check if confidence is trending upward."""
        if len(self.attempts) < 2:
            return True  # Assume can improve
        return self.attempts[-1].confidence > self.attempts[-2].confidence


class ConfidenceManager:
    """
    Manages confidence-based retry decisions.

    Usage:
        manager = ConfidenceManager()

        while True:
            result, confidence = execute_task(task)
            decision = manager.evaluate(task_id, confidence, result)

            if decision == Decision.ACCEPT:
                complete_task(result)
                break
            elif decision == Decision.RETRY:
                continue
            else:  # FAIL
                fail_task()
                break
    """

    def __init__(self, config: Optional[ConfidenceConfig] = None):
        self.config = config or ConfidenceConfig()
        self.task_states: Dict[str, TaskConfidenceState] = {}

    def evaluate(
        self,
        task_id: str,
        confidence: float,
        result_preview: str = "",
        cost: float = 0.0
    ) -> Tuple[Decision, str]:
        """
        Evaluate whether to accept, retry, or fail.

        Args:
            task_id: Unique task identifier
            confidence: AI's confidence in the result (0-1)
            result_preview: First ~50 chars of result (for logging)
            cost: Cost of this attempt in dollars

        Returns:
            Tuple of (Decision, reason_string)
        """
        # Get or create task state
        if task_id not in self.task_states:
            self.task_states[task_id] = TaskConfidenceState(task_id=task_id)

        state = self.task_states[task_id]

        # Record this attempt
        state.attempts.append(AttemptRecord(
            attempt_number=state.attempt_count + 1,
            confidence=confidence,
            result_preview=result_preview[:50],
            cost=cost
        ))
        state.total_cost += cost

        # Make decision
        decision, reason = self._make_decision(state, confidence)

        # Record final decision if terminal
        if decision in (Decision.ACCEPT, Decision.ACCEPT_WITH_WARNING, Decision.FAIL):
            state.final_decision = decision
            state.final_confidence = confidence

        return decision, reason

    def _make_decision(
        self,
        state: TaskConfidenceState,
        confidence: float
    ) -> Tuple[Decision, str]:
        """Core decision logic."""
        attempts = state.attempt_count
        config = self.config

        # Check 1: High confidence = immediate accept
        if confidence >= config.high_confidence:
            return Decision.ACCEPT, f"High confidence ({confidence:.0%})"

        # Check 2: Medium confidence after some attempts
        if confidence >= config.medium_confidence and attempts >= config.medium_confidence_min_attempts:
            return Decision.ACCEPT, f"Good confidence ({confidence:.0%}) after {attempts} attempts"

        # Check 3: Low confidence after many attempts
        if confidence >= config.low_confidence and attempts >= config.low_confidence_min_attempts:
            return Decision.ACCEPT_WITH_WARNING, f"Acceptable ({confidence:.0%}) after {attempts} attempts"

        # Check 4: Max attempts reached
        if attempts >= config.max_attempts:
            if confidence >= config.minimum_confidence:
                return Decision.ACCEPT_WITH_WARNING, f"Max attempts reached, accepting ({confidence:.0%})"
            else:
                return Decision.FAIL, f"Max attempts reached, confidence too low ({confidence:.0%})"

        # Check 5: Cost limit reached
        if state.total_cost >= config.max_cost_per_task:
            if confidence >= config.minimum_confidence:
                return Decision.ACCEPT_WITH_WARNING, f"Cost limit reached, accepting ({confidence:.0%})"
            else:
                return Decision.FAIL, f"Cost limit reached, confidence too low ({confidence:.0%})"

        # Check 6: Not improving after multiple attempts
        if attempts >= 3 and not state.is_improving:
            if confidence >= config.low_confidence:
                return Decision.ACCEPT_WITH_WARNING, f"Not improving, accepting best ({confidence:.0%})"
            # Continue trying if still below acceptable

        # Default: retry
        return Decision.RETRY, f"Confidence ({confidence:.0%}) below threshold, retrying"

    def get_state(self, task_id: str) -> Optional[TaskConfidenceState]:
        """Get the confidence state for a task."""
        return self.task_states.get(task_id)

    def get_stats(self) -> dict:
        """Get overall statistics."""
        if not self.task_states:
            return {"tasks": 0}

        total_attempts = sum(s.attempt_count for s in self.task_states.values())
        total_cost = sum(s.total_cost for s in self.task_states.values())

        decisions = [s.final_decision for s in self.task_states.values() if s.final_decision]
        accepts = sum(1 for d in decisions if d in (Decision.ACCEPT, Decision.ACCEPT_WITH_WARNING))
        fails = sum(1 for d in decisions if d == Decision.FAIL)

        return {
            "tasks": len(self.task_states),
            "total_attempts": total_attempts,
            "avg_attempts": round(total_attempts / len(self.task_states), 2),
            "total_cost": round(total_cost, 4),
            "accepts": accepts,
            "fails": fails,
            "acceptance_rate": round(accepts / len(decisions) * 100, 1) if decisions else 0
        }

    def reset(self):
        """Reset all state."""
        self.task_states.clear()

Step 2: Create Confidence Strategies

Add different strategies to confidence_manager.py:

class QualityFirstStrategy(ConfidenceConfig):
    """Prioritize quality over speed/cost."""

    def __init__(self):
        super().__init__(
            high_confidence=0.98,
            medium_confidence=0.90,
            medium_confidence_min_attempts=3,
            low_confidence=0.85,
            low_confidence_min_attempts=4,
            minimum_confidence=0.75,
            max_attempts=7,
            max_cost_per_task=0.25
        )


class CostFirstStrategy(ConfidenceConfig):
    """Prioritize cost over perfect quality."""

    def __init__(self):
        super().__init__(
            high_confidence=0.85,
            medium_confidence=0.70,
            medium_confidence_min_attempts=1,
            low_confidence=0.60,
            low_confidence_min_attempts=2,
            minimum_confidence=0.50,
            max_attempts=3,
            max_cost_per_task=0.05
        )


class BalancedStrategy(ConfidenceConfig):
    """Balance quality and cost (default)."""

    def __init__(self):
        super().__init__(
            high_confidence=0.95,
            medium_confidence=0.80,
            medium_confidence_min_attempts=2,
            low_confidence=0.70,
            low_confidence_min_attempts=3,
            minimum_confidence=0.50,
            max_attempts=5,
            max_cost_per_task=0.10
        )


class TaskTypeStrategy:
    """Select strategy based on task type."""

    STRATEGIES = {
        "code": QualityFirstStrategy(),      # Code needs high accuracy
        "creative": CostFirstStrategy(),     # Creative is subjective
        "factual": QualityFirstStrategy(),   # Facts must be correct
        "summary": BalancedStrategy(),       # Summaries can vary
        "default": BalancedStrategy()
    }

    @classmethod
    def get_config(cls, task_type: str) -> ConfidenceConfig:
        return cls.STRATEGIES.get(task_type, cls.STRATEGIES["default"])

Step 3: Integrate with the Loop

Create loop_with_confidence.py:

"""
Loop with Confidence Retry - Lab 07

Demonstrates adaptive confidence-based retry decisions.
"""

from task_manager import TaskManager
from executor import execute_task
from circuit_breaker import CircuitBreaker, CircuitBreakerConfig
from confidence_manager import (
    ConfidenceManager,
    Decision,
    TaskTypeStrategy
)


def get_task_type(task) -> str:
    """Determine task type from criteria or description."""
    criteria_type = task.criteria.get("type", "") if task.criteria else ""

    if criteria_type in ["code", "function", "script"]:
        return "code"
    elif criteria_type in ["haiku", "poem", "story", "creative"]:
        return "creative"
    elif criteria_type in ["fact", "factual", "definition"]:
        return "factual"
    elif criteria_type in ["summary", "overview"]:
        return "summary"

    # Infer from description
    desc_lower = task.description.lower()
    if any(kw in desc_lower for kw in ["write code", "function", "implement"]):
        return "code"
    elif any(kw in desc_lower for kw in ["haiku", "poem", "story", "creative"]):
        return "creative"

    return "default"


def process_task(
    manager: TaskManager,
    task,
    confidence_mgr: ConfidenceManager,
    breaker: CircuitBreaker
) -> bool:
    """Process a task with confidence-based retry."""
    task_id = task.id

    print(f"\n[{task_id}] {task.description[:50]}...")

    # Determine strategy based on task type
    task_type = get_task_type(task)
    print(f"  Task type: {task_type}")

    while True:
        # Check circuit breaker
        if not breaker.allow_continue():
            print(f"  ⚠️ Circuit breaker tripped: {breaker.trip_reason}")
            manager.fail(task_id, f"Circuit breaker: {breaker.trip_reason}")
            return False

        manager.start(task_id)

        # Execute task
        result = execute_task(task.to_dict())

        if result["status"] != "completed":
            print(f"  ✗ Execution failed: {result.get('reason', 'Unknown')}")
            breaker.record_failure(result.get("reason", ""))
            manager.fail(task_id, result.get("reason", "Execution failed"))
            return False

        confidence = result["confidence"]
        output = result["result"]

        # Estimate cost
        cost = 0.01  # Simplified estimate

        # Evaluate with confidence manager
        decision, reason = confidence_mgr.evaluate(
            task_id=task_id,
            confidence=confidence,
            result_preview=output[:50],
            cost=cost
        )

        state = confidence_mgr.get_state(task_id)
        print(f"  Attempt {state.attempt_count}: {confidence:.0%} confidence")
        print(f"  Decision: {decision.value} - {reason}")

        if decision == Decision.ACCEPT:
            print(f"  ✓ Accepted!")
            breaker.record_success(output)
            manager.complete(task_id, output)
            return True

        elif decision == Decision.ACCEPT_WITH_WARNING:
            print(f"  ⚠️ Accepted with warning")
            breaker.record_success(output)
            manager.complete(task_id, output)
            return True

        elif decision == Decision.FAIL:
            print(f"  ✗ Failed")
            breaker.record_failure("Confidence too low")
            manager.fail(task_id, reason)
            return False

        else:  # RETRY
            print(f"  ↻ Retrying...")
            breaker.record_failure("Low confidence")
            manager.retry(task_id, reason)
            # Loop continues


def main():
    manager = TaskManager("tasks.json")

    # Circuit breaker
    breaker = CircuitBreaker(CircuitBreakerConfig(
        max_iterations=50,
        max_consecutive_failures=5
    ))

    # Confidence manager with balanced strategy
    confidence_mgr = ConfidenceManager()

    # Create sample tasks
    if not manager.tasks:
        manager.create(
            "Write a haiku about Python programming",
            criteria={"type": "creative"}
        )
        manager.create(
            "Write a Python function that calculates fibonacci numbers",
            criteria={"type": "code"}
        )
        manager.create(
            "What is the capital of France?",
            criteria={"type": "factual"}
        )
        manager.create(
            "Summarize the benefits of version control in 2 sentences",
            criteria={"type": "summary"}
        )

    print("=" * 60)
    print("CONFIDENCE-BASED RETRY DEMO")
    print("=" * 60)

    while manager.has_pending() and breaker.allow_continue():
        task = manager.get_next()
        process_task(manager, task, confidence_mgr, breaker)

    # Final report
    print("\n" + "=" * 60)
    print("FINAL REPORT")
    print("=" * 60)

    conf_stats = confidence_mgr.get_stats()
    print(f"\nConfidence Stats:")
    print(f"  Tasks processed: {conf_stats['tasks']}")
    print(f"  Total attempts: {conf_stats['total_attempts']}")
    print(f"  Avg attempts/task: {conf_stats['avg_attempts']}")
    print(f"  Acceptance rate: {conf_stats['acceptance_rate']}%")
    print(f"  Estimated cost: ${conf_stats['total_cost']:.4f}")

    print(f"\nTask Results:")
    for task in manager.get_all():
        icon = {"completed": "✓", "failed": "✗"}.get(task.status, "?")
        state = confidence_mgr.get_state(task.id)
        attempts = state.attempt_count if state else 0
        conf = f"{state.final_confidence:.0%}" if state and state.final_confidence else "N/A"
        print(f"  {icon} {task.description[:40]}... ({attempts} attempts, {conf})")


if __name__ == "__main__":
    main()

Understanding Adaptive Confidence

The Decision Tree

                    ┌─────────────────┐
                    │  New Result     │
                    │  confidence=X   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
              ▼              ▼              ▼
         X >= 0.95      X >= 0.80      X >= 0.70
              │         attempts≥2     attempts≥3
              │              │              │
              ▼              ▼              ▼
           ACCEPT         ACCEPT        ACCEPT*
                                       (* warning)
              │              │              │
              └──────────────┴──────┬───────┘
                                    │
                                    ▼
                            attempts >= max?
                              /         \
                             /           \
                            ▼             ▼
                     X >= minimum?     RETRY
                       /      \
                      ▼        ▼
                   ACCEPT*   FAIL

Why Adaptive Works

Fixed Threshold	Adaptive Threshold
Rejects 89% confidence	Accepts 89% after 2 attempts
Same threshold for all tasks	Different thresholds per task type
Ignores attempt count	Lowers bar after retries
Ignores cost	Stops when cost limit reached
Binary accept/reject	Graduated accept/warn/retry/fail

Cost-Quality Tradeoff

Quality ▲
        │     ┌─────────────────────┐
   100% │     │  Quality First      │  $$$
        │     │  (code, facts)      │
        │     └─────────────────────┘
        │           ┌───────────────┐
    85% │           │  Balanced     │  $$
        │           │  (default)    │
        │           └───────────────┘
        │                 ┌─────────┐
    70% │                 │  Cost   │  $
        │                 │  First  │
        └─────────────────┴─────────┴──────▶ Cost

Exercises

Exercise 1: Confidence Decay

Implement confidence that decays with time since generation:

def adjust_for_staleness(confidence: float, age_seconds: float) -> float:
    """Reduce confidence for old results."""
    decay_rate = 0.01 per hour
    # Implement decay
    pass

Exercise 2: Ensemble Confidence

Use multiple models and aggregate confidence:

def ensemble_confidence(results: List[Tuple[str, float]]) -> float:
    """Combine multiple model results into aggregate confidence."""
    # If models agree, higher confidence
    # If models disagree, lower confidence
    pass

Exercise 3: Confidence Calibration

Track actual success rate vs reported confidence to calibrate:

class CalibratedConfidenceManager:
    """Adjust AI confidence based on historical accuracy."""

    def calibrate(self, reported: float, actual_success: bool):
        """Update calibration based on outcome."""
        pass

    def adjusted_confidence(self, reported: float) -> float:
        """Get calibrated confidence."""
        pass

Checkpoint

Before moving on, verify:

Different strategies produce different accept thresholds
Adaptive thresholds lower with more attempts
Cost tracking works correctly
Task type affects strategy selection
You understand the quality-cost tradeoff

Key Takeaway

Confidence scores enable smart retry decisions.

With adaptive confidence:

Accept early when confidence is high
Accept eventually when good enough after retries
Fail fast when confidence never reaches minimum
Control costs by limiting retry spend
Match strategy to task (strict for code, lenient for creative)

Get the Code

Full implementation: 8me/src/tier1-ralph-loop/