diff --git a/gym/scoreboard/__init__.py b/gym/scoreboard/__init__.py index 93873b162..bee543057 100644 --- a/gym/scoreboard/__init__.py +++ b/gym/scoreboard/__init__.py @@ -1223,9 +1223,12 @@ The goal of the agent is to maximize the true reward function given just the noi Prior work has explored learning algorithms for human training scenarios of this flavor [Lopes11]_. +Additionally, Baird and others have noted the relationship between update noise, timestep size, and convergence rate for Q-learners [Baird94]_. + Robustness to noisy rewards may aid scalable oversight in settings where evaluating the true reward signal is expensive or impossible but a noisy approximation is available [Amodei16]_, [Christiano15]_. +.. [Baird94] Baird, Leemon C. "Reinforcement learning in continuous time: Advantage updating." Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference on. Vol. 4. IEEE, 1994. .. [Amodei16] Amodei, Olah, et al. `"Concrete Problems in AI safety" Arxiv. 2016. `_ .. [Lopes11] Lopes, Manuel, Thomas Cederbourg, and Pierre-Yves Oudeyer. "Simultaneous acquisition of task and feedback models." Development and Learning (ICDL), 2011 IEEE International Conference on. Vol. 2. IEEE, 2011. .. [Christiano15] `AI Control `_