Skip to main content
    Mathematical Foundations of Curiosity in Reinforcement LearningOpen

    Mathematical Foundations of Curiosity in Reinforcement Learning

    reinforcement learningk-armed banditthompson samplingbayesian inferenceexploration vs exploitation
    May 6, 2026Source

    This document explores the mathematical foundation of curiosity in artificial intelligence using the K-armed bandit problem. The author argues that curiosity is not a separate exploration heuristic but rather the consequence of optimal decision-making under uncertainty.

    Key Concepts: • The K-armed bandit problem involves balancing exploitation (choosing the currently best arm) with exploration (sampling uncertain arms). • Bayesian inference is used to maintain a posterior distribution over an arm's expected mean, where the distribution narrows as more data is collected. • The author defines curiosity as the behavior that arises when agents act optimally based on uncertain beliefs.

    Algorithms and Frameworks:

    1. Thompson Sampling: A decision rule where an agent draws a sample from each arm's posterior and chooses the one with the highest value.
    2. Bayesian Bandit: Iteratively updates posterior beliefs based on rewards, effectively allowing the algorithm to explore arms with wider probability tails.

    Comparison: The author benchmarks the Bayesian approach against standard methods from Sutton & Barto (2018), including UCB (Upper Confidence Bound), Optimistic Greedy, Gradient Bandit, and Epsilon-Greedy. UCB performed best at 1.5490, with the Bayesian method following closely at 1.5372.

    Entities and Tools: • Author: Francesco Sacco. • Advisor/Reviewer: Matteo Peluso. • Referenced Works: W.R. Thompson (1933), Sutton & Barto (Reinforcement Learning textbook), and Russo et al. (tutorial on Thompson Sampling). • Technologies: Python, NumPy, GitHub, and interactive web-based data visualization.

    Core Conclusion: Curiosity is what optimal action looks like when beliefs are uncertain. By framing curiosity as probability matching, the author suggests this approach extends beyond bandits to complex environments like chess engines, where curiosity determines which branches of a game tree require further analysis.

    Share this

    Want AI summaries like this for everything you read?

    Timeln saves articles, videos, and posts — then summarizes, tags, and connects them so you never lose a good find again.

    Save anything

    one click

    AI summaries

    instant

    Connected ideas

    automatic

    Start saving for free

    Free forever · No credit card · 30 seconds to start