Lab 11: Self-Play Oscillation
Use internal debate to refine outputs through proposer/critic cycles.
Objectives
By the end of this lab, you will:
- Understand self-play and DIALECTIC patterns
- Implement proposer/critic feedback loops
- Detect and handle convergence
- Prevent infinite oscillation
Prerequisites
- Lab 10 completed (gating)
- Understanding of multi-agent patterns
What is Self-Play?
Self-play uses internal debate to improve outputs:
Single Pass: Self-Play:
Task → AI → Output Task → Proposer ──┐
│
┌───────────┘
▼
Critic ◄──────┐
│ │
▼ │
Good? ──No────┘
│
Yes
▼
Output
Benefits:
- Catches errors the first pass missed
- Improves quality through iteration
- Provides reasoning about why choices were made
The DIALECTIC Pattern
┌──────────────────────────────────────────┐
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Proposer │ │ Critic │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ │ │
│ ▼ │ │
│ "Here's my │ │
│ proposal" │ │
│ │ │ │
│ └────────────────►│ │
│ │ │
│ ▼ │
│ "Issues found: │
│ - Problem A │
│ - Problem B" │
│ │ │
│ ◄────────────────┘ │
│ │ │
│ ▼ │
│ "Refined based │
│ on feedback" │
│ │ │
└─────────┴──────────────────────────────┘
│
▼
Convergence?
/ \
No Yes
│ │
▼ ▼
Continue Output
Step 1: Create the Self-Play Engine
Create self_play.py:
"""
Self-Play Engine - Lab 11
Implements proposer/critic feedback loops with convergence detection.
"""
from typing import Dict, Optional, List, Callable
from dataclasses import dataclass, field
from enum import Enum
import anthropic
class ConvergenceState(Enum):
"""State of the self-play loop."""
RUNNING = "running"
CONVERGED = "converged"
OSCILLATING = "oscillating"
MAX_ROUNDS = "max_rounds"
STALLED = "stalled"
@dataclass
class Critique:
"""Critique from the critic."""
approved: bool
score: float # 0-1
feedback: str
issues: List[str] = field(default_factory=list)
suggestions: List[str] = field(default_factory=list)
@dataclass
class SelfPlayRound:
"""Record of a single round."""
round_number: int
proposal: str
critique: Critique
proposal_hash: str # For oscillation detection
@dataclass
class SelfPlayResult:
"""Result of self-play loop."""
final_output: str
convergence_state: ConvergenceState
rounds: List[SelfPlayRound]
final_score: float
@property
def converged(self) -> bool:
return self.convergence_state == ConvergenceState.CONVERGED
@property
def round_count(self) -> int:
return len(self.rounds)
@dataclass
class SelfPlayConfig:
"""Configuration for self-play."""
max_rounds: int = 5
approval_threshold: float = 0.85
improvement_threshold: float = 0.05 # Min improvement to continue
oscillation_window: int = 3 # Rounds to check for oscillation
model: str = "claude-sonnet-4-20250514"
class SelfPlayEngine:
"""
Runs self-play loops with proposer and critic roles.
Usage:
engine = SelfPlayEngine()
result = engine.run("Write a sorting algorithm")
if result.converged:
print(result.final_output)
else:
print(f"Did not converge: {result.convergence_state}")
"""
def __init__(self, config: Optional[SelfPlayConfig] = None):
self.config = config or SelfPlayConfig()
self.client = anthropic.Anthropic()
self.rounds: List[SelfPlayRound] = []
def run(self, task: str, context: str = "") -> SelfPlayResult:
"""
Run self-play loop until convergence or max rounds.
Args:
task: The task to complete
context: Optional additional context
Returns:
SelfPlayResult with final output and convergence state
"""
self.rounds = []
proposal = None
critique = None
scores: List[float] = []
for round_num in range(1, self.config.max_rounds + 1):
# Generate proposal
proposal = self._generate_proposal(task, context, proposal, critique)
proposal_hash = self._hash_proposal(proposal)
# Get critique
critique = self._get_critique(task, proposal)
scores.append(critique.score)
# Record round
self.rounds.append(SelfPlayRound(
round_number=round_num,
proposal=proposal,
critique=critique,
proposal_hash=proposal_hash
))
# Check for convergence
if critique.approved and critique.score >= self.config.approval_threshold:
return SelfPlayResult(
final_output=proposal,
convergence_state=ConvergenceState.CONVERGED,
rounds=self.rounds,
final_score=critique.score
)
# Check for oscillation
if self._is_oscillating():
return SelfPlayResult(
final_output=proposal,
convergence_state=ConvergenceState.OSCILLATING,
rounds=self.rounds,
final_score=critique.score
)
# Check for stalling (no improvement)
if len(scores) >= 2:
improvement = scores[-1] - scores[-2]
if improvement < self.config.improvement_threshold and scores[-1] < self.config.approval_threshold:
# Allow one more try
if len(scores) >= 3 and all(
scores[i] - scores[i-1] < self.config.improvement_threshold
for i in range(-2, 0)
):
return SelfPlayResult(
final_output=proposal,
convergence_state=ConvergenceState.STALLED,
rounds=self.rounds,
final_score=critique.score
)
# Max rounds reached
return SelfPlayResult(
final_output=proposal,
convergence_state=ConvergenceState.MAX_ROUNDS,
rounds=self.rounds,
final_score=scores[-1] if scores else 0.0
)
def _generate_proposal(
self,
task: str,
context: str,
previous_proposal: Optional[str],
previous_critique: Optional[Critique]
) -> str:
"""Generate or refine a proposal."""
if previous_proposal is None:
# Initial proposal
prompt = f"""You are a proposal generator. Create a high-quality response for this task.
TASK: {task}
{f"CONTEXT: {context}" if context else ""}
Provide your best proposal. Focus on completeness and correctness."""
else:
# Refined proposal
prompt = f"""You are a proposal generator. Refine your previous proposal based on feedback.
TASK: {task}
{f"CONTEXT: {context}" if context else ""}
YOUR PREVIOUS PROPOSAL:
{previous_proposal}
CRITIQUE RECEIVED:
Score: {previous_critique.score:.0%}
Feedback: {previous_critique.feedback}
Issues: {', '.join(previous_critique.issues) if previous_critique.issues else 'None'}
Suggestions: {', '.join(previous_critique.suggestions) if previous_critique.suggestions else 'None'}
Provide an improved proposal that addresses ALL the issues raised. Keep what works, fix what doesn't."""
response = self.client.messages.create(
model=self.config.model,
max_tokens=2000,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
def _get_critique(self, task: str, proposal: str) -> Critique:
"""Get critique for a proposal."""
prompt = f"""You are a strict but fair critic. Evaluate this proposal thoroughly.
ORIGINAL TASK: {task}
PROPOSAL TO EVALUATE:
{proposal}
Evaluate:
1. Does it fully address the task?
2. Is it correct and accurate?
3. Is it well-structured?
4. Are there any issues or errors?
5. How could it be improved?
Respond in this exact format:
APPROVED: yes or no (yes = ready to submit, no = needs improvement)
SCORE: 0.0 to 1.0 (overall quality score)
FEEDBACK: (2-3 sentence overall assessment)
ISSUES: (comma-separated list of problems, or "None")
SUGGESTIONS: (comma-separated list of improvements, or "None")
Be constructive but thorough. Don't approve unless quality is genuinely high."""
response = self.client.messages.create(
model=self.config.model,
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return self._parse_critique(response.content[0].text)
def _parse_critique(self, text: str) -> Critique:
"""Parse critique response."""
approved = False
score = 0.5
feedback = ""
issues = []
suggestions = []
for line in text.split("\n"):
line = line.strip()
upper = line.upper()
if upper.startswith("APPROVED:"):
approved = "yes" in line.lower()
elif upper.startswith("SCORE:"):
try:
score = float(line.split(":")[1].strip().split()[0])
score = max(0.0, min(1.0, score))
except:
pass
elif upper.startswith("FEEDBACK:"):
feedback = line.split(":", 1)[1].strip()
elif upper.startswith("ISSUES:"):
issues_str = line.split(":", 1)[1].strip()
if issues_str.lower() != "none":
issues = [i.strip() for i in issues_str.split(",") if i.strip()]
elif upper.startswith("SUGGESTIONS:"):
sugg_str = line.split(":", 1)[1].strip()
if sugg_str.lower() != "none":
suggestions = [s.strip() for s in sugg_str.split(",") if s.strip()]
return Critique(
approved=approved,
score=score,
feedback=feedback,
issues=issues,
suggestions=suggestions
)
def _hash_proposal(self, proposal: str) -> str:
"""Create hash for oscillation detection."""
import hashlib
normalized = " ".join(proposal.lower().split())
return hashlib.md5(normalized.encode()).hexdigest()[:16]
def _is_oscillating(self) -> bool:
"""Check if proposals are oscillating between states."""
if len(self.rounds) < self.config.oscillation_window:
return False
recent_hashes = [r.proposal_hash for r in self.rounds[-self.config.oscillation_window:]]
# Check if any hash appears more than once
seen = set()
for h in recent_hashes:
if h in seen:
return True
seen.add(h)
return False
Step 2: Add Oscillation Prevention
Add to self_play.py:
class OscillationPreventer:
"""
Prevents and handles oscillation in self-play loops.
Strategies:
1. Temperature increase: Make proposals more varied
2. Constraint injection: Add explicit constraints
3. Best-so-far: Return the best proposal seen
"""
def __init__(self):
self.seen_proposals: Dict[str, float] = {} # hash -> score
self.oscillation_count: int = 0
def record_proposal(self, proposal_hash: str, score: float):
"""Record a proposal and its score."""
if proposal_hash in self.seen_proposals:
self.oscillation_count += 1
self.seen_proposals[proposal_hash] = max(
self.seen_proposals.get(proposal_hash, 0),
score
)
def get_best_proposal(self, rounds: List[SelfPlayRound]) -> SelfPlayRound:
"""Get the best proposal seen so far."""
return max(rounds, key=lambda r: r.critique.score)
def suggest_escape_strategy(self) -> str:
"""Suggest a strategy to escape oscillation."""
if self.oscillation_count <= 1:
return "increase_temperature"
elif self.oscillation_count <= 2:
return "add_constraints"
else:
return "return_best"
def get_constraint_injection(self, issues: List[str]) -> str:
"""Generate explicit constraints from issues."""
if not issues:
return ""
constraints = ["You MUST address these specific issues:"]
for i, issue in enumerate(issues, 1):
constraints.append(f"{i}. {issue}")
constraints.append("\nDo NOT repeat previous approaches that had these issues.")
return "\n".join(constraints)
class AdaptiveSelfPlayEngine(SelfPlayEngine):
"""
Self-play engine with adaptive strategies to prevent oscillation.
"""
def __init__(self, config: Optional[SelfPlayConfig] = None):
super().__init__(config)
self.preventer = OscillationPreventer()
self.base_temperature = 0.7
def run(self, task: str, context: str = "") -> SelfPlayResult:
"""Run with adaptive oscillation prevention."""
self.rounds = []
self.preventer = OscillationPreventer() # Reset
proposal = None
critique = None
temperature = self.base_temperature
extra_constraints = ""
for round_num in range(1, self.config.max_rounds + 1):
# Generate proposal with current settings
proposal = self._generate_proposal_adaptive(
task, context, proposal, critique,
temperature=temperature,
extra_constraints=extra_constraints
)
proposal_hash = self._hash_proposal(proposal)
# Record and check for oscillation
self.preventer.record_proposal(
proposal_hash,
critique.score if critique else 0
)
# Get critique
critique = self._get_critique(task, proposal)
# Record round
self.rounds.append(SelfPlayRound(
round_number=round_num,
proposal=proposal,
critique=critique,
proposal_hash=proposal_hash
))
# Check for convergence
if critique.approved and critique.score >= self.config.approval_threshold:
return SelfPlayResult(
final_output=proposal,
convergence_state=ConvergenceState.CONVERGED,
rounds=self.rounds,
final_score=critique.score
)
# Handle oscillation adaptively
if self._is_oscillating():
strategy = self.preventer.suggest_escape_strategy()
if strategy == "return_best":
best = self.preventer.get_best_proposal(self.rounds)
return SelfPlayResult(
final_output=best.proposal,
convergence_state=ConvergenceState.OSCILLATING,
rounds=self.rounds,
final_score=best.critique.score
)
elif strategy == "increase_temperature":
temperature = min(1.0, temperature + 0.1)
elif strategy == "add_constraints":
extra_constraints = self.preventer.get_constraint_injection(
critique.issues
)
# Max rounds - return best
best = self.preventer.get_best_proposal(self.rounds)
return SelfPlayResult(
final_output=best.proposal,
convergence_state=ConvergenceState.MAX_ROUNDS,
rounds=self.rounds,
final_score=best.critique.score
)
def _generate_proposal_adaptive(
self,
task: str,
context: str,
previous_proposal: Optional[str],
previous_critique: Optional[Critique],
temperature: float = 0.7,
extra_constraints: str = ""
) -> str:
"""Generate proposal with adaptive parameters."""
# Build prompt (similar to base class)
if previous_proposal is None:
prompt = f"""You are a proposal generator. Create a high-quality response.
TASK: {task}
{f"CONTEXT: {context}" if context else ""}
{f"CONSTRAINTS: {extra_constraints}" if extra_constraints else ""}
Provide your best proposal."""
else:
prompt = f"""Refine your previous proposal based on feedback.
TASK: {task}
{f"CONTEXT: {context}" if context else ""}
{f"CONSTRAINTS: {extra_constraints}" if extra_constraints else ""}
PREVIOUS PROPOSAL:
{previous_proposal}
CRITIQUE:
Score: {previous_critique.score:.0%}
Issues: {', '.join(previous_critique.issues)}
Address ALL issues. Try a DIFFERENT approach if the same approach keeps failing."""
response = self.client.messages.create(
model=self.config.model,
max_tokens=2000,
temperature=temperature,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Step 3: Complete Example
Create self_play_demo.py:
"""
Self-Play Demo - Lab 11
Demonstrates self-play with convergence and oscillation handling.
"""
from self_play import (
SelfPlayEngine, AdaptiveSelfPlayEngine, SelfPlayConfig,
ConvergenceState
)
def demo_basic_self_play():
"""Demonstrate basic self-play."""
print("=" * 60)
print("BASIC SELF-PLAY DEMO")
print("=" * 60)
config = SelfPlayConfig(
max_rounds=5,
approval_threshold=0.85
)
engine = SelfPlayEngine(config)
task = "Write a Python function that validates email addresses. Include edge cases."
print(f"\nTask: {task}\n")
print("Running self-play loop...\n")
result = engine.run(task)
print(f"Convergence: {result.convergence_state.value}")
print(f"Rounds: {result.round_count}")
print(f"Final Score: {result.final_score:.0%}")
print("\n--- Round History ---")
for round_data in result.rounds:
print(f"\nRound {round_data.round_number}:")
print(f" Score: {round_data.critique.score:.0%}")
print(f" Approved: {round_data.critique.approved}")
print(f" Feedback: {round_data.critique.feedback[:100]}...")
if round_data.critique.issues:
print(f" Issues: {', '.join(round_data.critique.issues[:3])}")
print("\n--- Final Output ---")
print(result.final_output[:500] + "..." if len(result.final_output) > 500 else result.final_output)
def demo_adaptive_self_play():
"""Demonstrate adaptive self-play with oscillation prevention."""
print("\n" + "=" * 60)
print("ADAPTIVE SELF-PLAY DEMO")
print("=" * 60)
config = SelfPlayConfig(
max_rounds=7,
approval_threshold=0.80,
oscillation_window=3
)
engine = AdaptiveSelfPlayEngine(config)
# Intentionally tricky task that might cause oscillation
task = """Write a function that finds the optimal solution.
It should be efficient but also readable.
Balance performance with maintainability.
Consider edge cases but don't over-engineer."""
print(f"\nTask: {task}\n")
print("Running adaptive self-play loop...\n")
result = engine.run(task)
print(f"Convergence: {result.convergence_state.value}")
print(f"Rounds: {result.round_count}")
print(f"Final Score: {result.final_score:.0%}")
# Show score progression
scores = [r.critique.score for r in result.rounds]
print(f"\nScore progression: {' → '.join(f'{s:.0%}' for s in scores)}")
if result.convergence_state == ConvergenceState.OSCILLATING:
print("\n⚠️ Oscillation detected - returned best proposal")
elif result.convergence_state == ConvergenceState.CONVERGED:
print("\n✓ Converged successfully")
else:
print(f"\n⚡ Stopped: {result.convergence_state.value}")
def main():
demo_basic_self_play()
demo_adaptive_self_play()
if __name__ == "__main__":
main()
Understanding Self-Play
When It Works Well
| Good For | Why |
|---|---|
| Code writing | Bugs are catchable |
| Writing/editing | Style is improvable |
| Problem solving | Solutions can be verified |
| Data validation | Errors are detectable |
When It Struggles
| Challenging | Why |
|---|---|
| Subjective tasks | No clear “better” |
| Creative tasks | Different ≠ better |
| Speed-critical | Multiple rounds = slow |
| Simple tasks | Overkill |
Oscillation Patterns
Type 1: Flip-Flop
Round 1: "Use approach A" → "Too complex"
Round 2: "Use approach B" → "Too simple"
Round 3: "Use approach A" → "Too complex"
...
Type 2: Incremental Reversal
Round 1: "Add feature X" → "Missing Y"
Round 2: "Add feature Y" → "X is now broken"
Round 3: "Fix X" → "Y is now broken"
...
Type 3: Perfection Loop
Round 1: Score 0.82 → "Improve error handling"
Round 2: Score 0.84 → "Simplify error handling"
Round 3: Score 0.83 → "Improve error handling"
...
Exercises
Exercise 1: Weighted Critic
Implement a critic that weighs different aspects:
class WeightedCritique:
correctness: float # Weight: 0.4
completeness: float # Weight: 0.3
style: float # Weight: 0.2
efficiency: float # Weight: 0.1
@property
def weighted_score(self) -> float:
pass
Exercise 2: Multi-Critic Ensemble
Use multiple critics and aggregate their feedback:
class CriticEnsemble:
def __init__(self, critics: List[Critic]):
pass
def evaluate(self, proposal: str) -> Critique:
# Run all critics
# Aggregate scores and feedback
pass
Exercise 3: Convergence Prediction
Predict if the loop will converge based on early rounds:
class ConvergencePredictor:
def predict(self, rounds: List[SelfPlayRound]) -> float:
"""Predict probability of convergence."""
# Analyze score trend
# Check for oscillation patterns
# Return probability
pass
Checkpoint
Before moving on, verify:
- Self-play loop runs multiple rounds
- Convergence is detected correctly
- Oscillation is detected and handled
- Adaptive strategies improve outcomes
- You understand when to use self-play
Key Takeaway
Self-play catches issues that single-pass misses.
Self-play provides:
- Iterative improvement through feedback
- Quality assurance via internal critique
- Robustness by catching errors early
- Transparency through round-by-round history
But watch out for:
- Oscillation (going in circles)
- Over-refinement (diminishing returns)
- Cost (multiple API calls)
Get the Code
Related concepts: 8me/src/tier3.5-orchestration-concepts/02-patterns.md