Lab 07: Confidence Retry

Use AI confidence scores for smart retry decisions that balance quality and cost.

Objectives

By the end of this lab, you will:

  • Understand confidence-based decision making
  • Implement adaptive retry thresholds
  • Balance quality vs cost tradeoffs
  • Track and optimize retry behavior

Prerequisites

  • Lab 06 completed (circuit breakers)
  • Understanding of tool calling (Lab 05)

The Problem: Fixed Thresholds

Simple retry logic uses fixed thresholds:

# Fixed threshold - too rigid
if confidence >= 0.9:
    accept()
else:
    retry()  # Even at 0.89? Even after 5 attempts?

Problems with this approach:

  • 0.89 vs 0.90: Arbitrary cutoff rejects good results
  • Diminishing returns: 5th retry rarely beats 4th
  • Cost blindness: Retries cost money regardless of improvement chance
  • No context: Same threshold for easy and hard tasks

The Solution: Adaptive Thresholds

# Adaptive - smarter decisions
if confidence >= 0.95:
    accept()  # High confidence = accept immediately
elif confidence >= 0.80 and attempts >= 2:
    accept()  # Good enough after retries
elif confidence >= 0.70 and attempts >= 3:
    accept()  # Acceptable after many retries
elif attempts >= max_attempts:
    accept_or_fail()  # Decide based on minimum threshold
else:
    retry()  # Try to improve

Step 1: Create the Confidence Manager

Create confidence_manager.py:

"""
Confidence Manager - Lab 07

Adaptive confidence-based retry decisions.
"""

from dataclasses import dataclass, field
from typing import Dict, Optional, Tuple, List
from enum import Enum


class Decision(Enum):
    """Possible decisions for a result."""
    ACCEPT = "accept"
    RETRY = "retry"
    FAIL = "fail"
    ACCEPT_WITH_WARNING = "accept_with_warning"


@dataclass
class ConfidenceConfig:
    """Configuration for confidence-based decisions."""
    # Immediate accept threshold
    high_confidence: float = 0.95

    # Accept after retries thresholds
    medium_confidence: float = 0.80
    medium_confidence_min_attempts: int = 2

    low_confidence: float = 0.70
    low_confidence_min_attempts: int = 3

    # Minimum acceptable (below = fail)
    minimum_confidence: float = 0.50

    # Maximum attempts before forced decision
    max_attempts: int = 5

    # Cost-aware settings
    cost_per_attempt: float = 0.01  # Estimated $ per retry
    max_cost_per_task: float = 0.10


@dataclass
class AttemptRecord:
    """Record of a single attempt."""
    attempt_number: int
    confidence: float
    result_preview: str
    cost: float = 0.0


@dataclass
class TaskConfidenceState:
    """Tracks confidence history for a task."""
    task_id: str
    attempts: List[AttemptRecord] = field(default_factory=list)
    total_cost: float = 0.0
    final_decision: Optional[Decision] = None
    final_confidence: Optional[float] = None

    @property
    def attempt_count(self) -> int:
        return len(self.attempts)

    @property
    def best_confidence(self) -> float:
        if not self.attempts:
            return 0.0
        return max(a.confidence for a in self.attempts)

    @property
    def latest_confidence(self) -> float:
        if not self.attempts:
            return 0.0
        return self.attempts[-1].confidence

    @property
    def is_improving(self) -> bool:
        """Check if confidence is trending upward."""
        if len(self.attempts) < 2:
            return True  # Assume can improve
        return self.attempts[-1].confidence > self.attempts[-2].confidence


class ConfidenceManager:
    """
    Manages confidence-based retry decisions.

    Usage:
        manager = ConfidenceManager()

        while True:
            result, confidence = execute_task(task)
            decision = manager.evaluate(task_id, confidence, result)

            if decision == Decision.ACCEPT:
                complete_task(result)
                break
            elif decision == Decision.RETRY:
                continue
            else:  # FAIL
                fail_task()
                break
    """

    def __init__(self, config: Optional[ConfidenceConfig] = None):
        self.config = config or ConfidenceConfig()
        self.task_states: Dict[str, TaskConfidenceState] = {}

    def evaluate(
        self,
        task_id: str,
        confidence: float,
        result_preview: str = "",
        cost: float = 0.0
    ) -> Tuple[Decision, str]:
        """
        Evaluate whether to accept, retry, or fail.

        Args:
            task_id: Unique task identifier
            confidence: AI's confidence in the result (0-1)
            result_preview: First ~50 chars of result (for logging)
            cost: Cost of this attempt in dollars

        Returns:
            Tuple of (Decision, reason_string)
        """
        # Get or create task state
        if task_id not in self.task_states:
            self.task_states[task_id] = TaskConfidenceState(task_id=task_id)

        state = self.task_states[task_id]

        # Record this attempt
        state.attempts.append(AttemptRecord(
            attempt_number=state.attempt_count + 1,
            confidence=confidence,
            result_preview=result_preview[:50],
            cost=cost
        ))
        state.total_cost += cost

        # Make decision
        decision, reason = self._make_decision(state, confidence)

        # Record final decision if terminal
        if decision in (Decision.ACCEPT, Decision.ACCEPT_WITH_WARNING, Decision.FAIL):
            state.final_decision = decision
            state.final_confidence = confidence

        return decision, reason

    def _make_decision(
        self,
        state: TaskConfidenceState,
        confidence: float
    ) -> Tuple[Decision, str]:
        """Core decision logic."""
        attempts = state.attempt_count
        config = self.config

        # Check 1: High confidence = immediate accept
        if confidence >= config.high_confidence:
            return Decision.ACCEPT, f"High confidence ({confidence:.0%})"

        # Check 2: Medium confidence after some attempts
        if confidence >= config.medium_confidence and attempts >= config.medium_confidence_min_attempts:
            return Decision.ACCEPT, f"Good confidence ({confidence:.0%}) after {attempts} attempts"

        # Check 3: Low confidence after many attempts
        if confidence >= config.low_confidence and attempts >= config.low_confidence_min_attempts:
            return Decision.ACCEPT_WITH_WARNING, f"Acceptable ({confidence:.0%}) after {attempts} attempts"

        # Check 4: Max attempts reached
        if attempts >= config.max_attempts:
            if confidence >= config.minimum_confidence:
                return Decision.ACCEPT_WITH_WARNING, f"Max attempts reached, accepting ({confidence:.0%})"
            else:
                return Decision.FAIL, f"Max attempts reached, confidence too low ({confidence:.0%})"

        # Check 5: Cost limit reached
        if state.total_cost >= config.max_cost_per_task:
            if confidence >= config.minimum_confidence:
                return Decision.ACCEPT_WITH_WARNING, f"Cost limit reached, accepting ({confidence:.0%})"
            else:
                return Decision.FAIL, f"Cost limit reached, confidence too low ({confidence:.0%})"

        # Check 6: Not improving after multiple attempts
        if attempts >= 3 and not state.is_improving:
            if confidence >= config.low_confidence:
                return Decision.ACCEPT_WITH_WARNING, f"Not improving, accepting best ({confidence:.0%})"
            # Continue trying if still below acceptable

        # Default: retry
        return Decision.RETRY, f"Confidence ({confidence:.0%}) below threshold, retrying"

    def get_state(self, task_id: str) -> Optional[TaskConfidenceState]:
        """Get the confidence state for a task."""
        return self.task_states.get(task_id)

    def get_stats(self) -> dict:
        """Get overall statistics."""
        if not self.task_states:
            return {"tasks": 0}

        total_attempts = sum(s.attempt_count for s in self.task_states.values())
        total_cost = sum(s.total_cost for s in self.task_states.values())

        decisions = [s.final_decision for s in self.task_states.values() if s.final_decision]
        accepts = sum(1 for d in decisions if d in (Decision.ACCEPT, Decision.ACCEPT_WITH_WARNING))
        fails = sum(1 for d in decisions if d == Decision.FAIL)

        return {
            "tasks": len(self.task_states),
            "total_attempts": total_attempts,
            "avg_attempts": round(total_attempts / len(self.task_states), 2),
            "total_cost": round(total_cost, 4),
            "accepts": accepts,
            "fails": fails,
            "acceptance_rate": round(accepts / len(decisions) * 100, 1) if decisions else 0
        }

    def reset(self):
        """Reset all state."""
        self.task_states.clear()

Step 2: Create Confidence Strategies

Add different strategies to confidence_manager.py:

class QualityFirstStrategy(ConfidenceConfig):
    """Prioritize quality over speed/cost."""

    def __init__(self):
        super().__init__(
            high_confidence=0.98,
            medium_confidence=0.90,
            medium_confidence_min_attempts=3,
            low_confidence=0.85,
            low_confidence_min_attempts=4,
            minimum_confidence=0.75,
            max_attempts=7,
            max_cost_per_task=0.25
        )


class CostFirstStrategy(ConfidenceConfig):
    """Prioritize cost over perfect quality."""

    def __init__(self):
        super().__init__(
            high_confidence=0.85,
            medium_confidence=0.70,
            medium_confidence_min_attempts=1,
            low_confidence=0.60,
            low_confidence_min_attempts=2,
            minimum_confidence=0.50,
            max_attempts=3,
            max_cost_per_task=0.05
        )


class BalancedStrategy(ConfidenceConfig):
    """Balance quality and cost (default)."""

    def __init__(self):
        super().__init__(
            high_confidence=0.95,
            medium_confidence=0.80,
            medium_confidence_min_attempts=2,
            low_confidence=0.70,
            low_confidence_min_attempts=3,
            minimum_confidence=0.50,
            max_attempts=5,
            max_cost_per_task=0.10
        )


class TaskTypeStrategy:
    """Select strategy based on task type."""

    STRATEGIES = {
        "code": QualityFirstStrategy(),      # Code needs high accuracy
        "creative": CostFirstStrategy(),     # Creative is subjective
        "factual": QualityFirstStrategy(),   # Facts must be correct
        "summary": BalancedStrategy(),       # Summaries can vary
        "default": BalancedStrategy()
    }

    @classmethod
    def get_config(cls, task_type: str) -> ConfidenceConfig:
        return cls.STRATEGIES.get(task_type, cls.STRATEGIES["default"])

Step 3: Integrate with the Loop

Create loop_with_confidence.py:

"""
Loop with Confidence Retry - Lab 07

Demonstrates adaptive confidence-based retry decisions.
"""

from task_manager import TaskManager
from executor import execute_task
from circuit_breaker import CircuitBreaker, CircuitBreakerConfig
from confidence_manager import (
    ConfidenceManager,
    Decision,
    TaskTypeStrategy
)


def get_task_type(task) -> str:
    """Determine task type from criteria or description."""
    criteria_type = task.criteria.get("type", "") if task.criteria else ""

    if criteria_type in ["code", "function", "script"]:
        return "code"
    elif criteria_type in ["haiku", "poem", "story", "creative"]:
        return "creative"
    elif criteria_type in ["fact", "factual", "definition"]:
        return "factual"
    elif criteria_type in ["summary", "overview"]:
        return "summary"

    # Infer from description
    desc_lower = task.description.lower()
    if any(kw in desc_lower for kw in ["write code", "function", "implement"]):
        return "code"
    elif any(kw in desc_lower for kw in ["haiku", "poem", "story", "creative"]):
        return "creative"

    return "default"


def process_task(
    manager: TaskManager,
    task,
    confidence_mgr: ConfidenceManager,
    breaker: CircuitBreaker
) -> bool:
    """Process a task with confidence-based retry."""
    task_id = task.id

    print(f"\n[{task_id}] {task.description[:50]}...")

    # Determine strategy based on task type
    task_type = get_task_type(task)
    print(f"  Task type: {task_type}")

    while True:
        # Check circuit breaker
        if not breaker.allow_continue():
            print(f"  ⚠️ Circuit breaker tripped: {breaker.trip_reason}")
            manager.fail(task_id, f"Circuit breaker: {breaker.trip_reason}")
            return False

        manager.start(task_id)

        # Execute task
        result = execute_task(task.to_dict())

        if result["status"] != "completed":
            print(f"  ✗ Execution failed: {result.get('reason', 'Unknown')}")
            breaker.record_failure(result.get("reason", ""))
            manager.fail(task_id, result.get("reason", "Execution failed"))
            return False

        confidence = result["confidence"]
        output = result["result"]

        # Estimate cost
        cost = 0.01  # Simplified estimate

        # Evaluate with confidence manager
        decision, reason = confidence_mgr.evaluate(
            task_id=task_id,
            confidence=confidence,
            result_preview=output[:50],
            cost=cost
        )

        state = confidence_mgr.get_state(task_id)
        print(f"  Attempt {state.attempt_count}: {confidence:.0%} confidence")
        print(f"  Decision: {decision.value} - {reason}")

        if decision == Decision.ACCEPT:
            print(f"  ✓ Accepted!")
            breaker.record_success(output)
            manager.complete(task_id, output)
            return True

        elif decision == Decision.ACCEPT_WITH_WARNING:
            print(f"  ⚠️ Accepted with warning")
            breaker.record_success(output)
            manager.complete(task_id, output)
            return True

        elif decision == Decision.FAIL:
            print(f"  ✗ Failed")
            breaker.record_failure("Confidence too low")
            manager.fail(task_id, reason)
            return False

        else:  # RETRY
            print(f"  ↻ Retrying...")
            breaker.record_failure("Low confidence")
            manager.retry(task_id, reason)
            # Loop continues


def main():
    manager = TaskManager("tasks.json")

    # Circuit breaker
    breaker = CircuitBreaker(CircuitBreakerConfig(
        max_iterations=50,
        max_consecutive_failures=5
    ))

    # Confidence manager with balanced strategy
    confidence_mgr = ConfidenceManager()

    # Create sample tasks
    if not manager.tasks:
        manager.create(
            "Write a haiku about Python programming",
            criteria={"type": "creative"}
        )
        manager.create(
            "Write a Python function that calculates fibonacci numbers",
            criteria={"type": "code"}
        )
        manager.create(
            "What is the capital of France?",
            criteria={"type": "factual"}
        )
        manager.create(
            "Summarize the benefits of version control in 2 sentences",
            criteria={"type": "summary"}
        )

    print("=" * 60)
    print("CONFIDENCE-BASED RETRY DEMO")
    print("=" * 60)

    while manager.has_pending() and breaker.allow_continue():
        task = manager.get_next()
        process_task(manager, task, confidence_mgr, breaker)

    # Final report
    print("\n" + "=" * 60)
    print("FINAL REPORT")
    print("=" * 60)

    conf_stats = confidence_mgr.get_stats()
    print(f"\nConfidence Stats:")
    print(f"  Tasks processed: {conf_stats['tasks']}")
    print(f"  Total attempts: {conf_stats['total_attempts']}")
    print(f"  Avg attempts/task: {conf_stats['avg_attempts']}")
    print(f"  Acceptance rate: {conf_stats['acceptance_rate']}%")
    print(f"  Estimated cost: ${conf_stats['total_cost']:.4f}")

    print(f"\nTask Results:")
    for task in manager.get_all():
        icon = {"completed": "✓", "failed": "✗"}.get(task.status, "?")
        state = confidence_mgr.get_state(task.id)
        attempts = state.attempt_count if state else 0
        conf = f"{state.final_confidence:.0%}" if state and state.final_confidence else "N/A"
        print(f"  {icon} {task.description[:40]}... ({attempts} attempts, {conf})")


if __name__ == "__main__":
    main()

Understanding Adaptive Confidence

The Decision Tree

                    ┌─────────────────┐
                    │  New Result     │
                    │  confidence=X   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
              ▼              ▼              ▼
         X >= 0.95      X >= 0.80      X >= 0.70
              │         attempts≥2     attempts≥3
              │              │              │
              ▼              ▼              ▼
           ACCEPT         ACCEPT        ACCEPT*
                                       (* warning)
              │              │              │
              └──────────────┴──────┬───────┘
                                    │
                                    ▼
                            attempts >= max?
                              /         \
                             /           \
                            ▼             ▼
                     X >= minimum?     RETRY
                       /      \
                      ▼        ▼
                   ACCEPT*   FAIL

Why Adaptive Works

Fixed Threshold Adaptive Threshold
Rejects 89% confidence Accepts 89% after 2 attempts
Same threshold for all tasks Different thresholds per task type
Ignores attempt count Lowers bar after retries
Ignores cost Stops when cost limit reached
Binary accept/reject Graduated accept/warn/retry/fail

Cost-Quality Tradeoff

Quality ▲
        │     ┌─────────────────────┐
   100% │     │  Quality First      │  $$$
        │     │  (code, facts)      │
        │     └─────────────────────┘
        │           ┌───────────────┐
    85% │           │  Balanced     │  $$
        │           │  (default)    │
        │           └───────────────┘
        │                 ┌─────────┐
    70% │                 │  Cost   │  $
        │                 │  First  │
        └─────────────────┴─────────┴──────▶ Cost

Exercises

Exercise 1: Confidence Decay

Implement confidence that decays with time since generation:

def adjust_for_staleness(confidence: float, age_seconds: float) -> float:
    """Reduce confidence for old results."""
    decay_rate = 0.01 per hour
    # Implement decay
    pass

Exercise 2: Ensemble Confidence

Use multiple models and aggregate confidence:

def ensemble_confidence(results: List[Tuple[str, float]]) -> float:
    """Combine multiple model results into aggregate confidence."""
    # If models agree, higher confidence
    # If models disagree, lower confidence
    pass

Exercise 3: Confidence Calibration

Track actual success rate vs reported confidence to calibrate:

class CalibratedConfidenceManager:
    """Adjust AI confidence based on historical accuracy."""

    def calibrate(self, reported: float, actual_success: bool):
        """Update calibration based on outcome."""
        pass

    def adjusted_confidence(self, reported: float) -> float:
        """Get calibrated confidence."""
        pass

Checkpoint

Before moving on, verify:

  • Different strategies produce different accept thresholds
  • Adaptive thresholds lower with more attempts
  • Cost tracking works correctly
  • Task type affects strategy selection
  • You understand the quality-cost tradeoff

Key Takeaway

Confidence scores enable smart retry decisions.

With adaptive confidence:

  • Accept early when confidence is high
  • Accept eventually when good enough after retries
  • Fail fast when confidence never reaches minimum
  • Control costs by limiting retry spend
  • Match strategy to task (strict for code, lenient for creative)

Get the Code

Full implementation: 8me/src/tier1-ralph-loop/



Back to top

8me Showcase - AI Agent Orchestration Learning Platform

This site uses Just the Docs, a documentation theme for Jekyll.