See how tool scores evolve based on execution history
Range: 0.5x (unreliable) to 2.0x (highly reliable)
Add success or failure events to see how learning boosts change