Abstract
We consider the problem of adversarial (nonstochastic) online learning with partial-information feedback, in which, at each round, a decision maker selects an action from a finite set of alternatives. We develop a black-box approach for such problems in which the learner observes as feedback only losses of a subset of the actions that includes the selected action. When losses of actions are nonnegative, under the graph-based feedback model introduced by Mannor and Shamir, we offer algorithms that attain the so called “small-loss” o(αL⋆) regret bounds with high probability, where α is the independence number of the graph and L⋆ is the loss of the best action. Prior to our work, there was no data-dependent guarantee for general feedback graphs even for pseudo-regret (without dependence on the number of actions, i.e., utilizing the increased information feedback). Taking advantage of the black-box nature of our technique, we extend our results to many other applications, such as combinatorial semi-bandits (including routing in networks), contextual bandits (even with an infinite comparator class), and learning with slowly changing (shifting) comparators. In the special case of multi-armed bandit and combinatorial semi-bandit problems, we provide optimal small-loss, high-probability regret guarantees of O˜(dL⋆), where d is the number of actions, answering open questions of Neu. Previous bounds for multi-armed bandits and semi-bandits were known only for pseudo-regret and only in expectation. We also offer an optimal O˜(κL⋆) regret guarantee for fixed feedback graphs with clique-partition number at most κ.
- [1] (2017) Open problem: First-order regret bounds for contextual bandits. Proc. 2017 Conf. Learn. Theory, vol. 65, 4–7.Google Scholar
- [2] (2006) Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. Proc. 17th Internat. Conf. Algorithmic Learn. Theory, 229–243.Google Scholar
- [3] (2018) Make the minority great again: First-order regret bound for contextual bandits. Proc. 35th Internat. Conf. Machine Learn., 186–194.Google Scholar
- [4] (2015) Online learning with feedback graphs: Beyond bandits. Proc. 28th Conf. Learn. Theory, 23–35.Google Scholar
- [5] (2013) From bandits to experts: A tale of domination and independence. Proc. 26th Internat. Conf. Neural Inform. Processing Systems, 1610–1618.Google Scholar
- [6] (2017) Nonstochastic multi-armed bandits with graph-structured feedback. SIAM J. Comput. 46(6):1785–1826.Google Scholar
Digital Library
- [7] (2010) Regret bounds and minimax policies under partial monitoring. J. Machine Learn. Res. 11(94):2785–2836.Google Scholar
Digital Library
- [8] (2014) Regret in online combinatorial optimization. Math. Oper. Res. 39(1):31–45.Google Scholar
Digital Library
- [9] (2002) Adaptive and self-confident on-line learning algorithms. J. Comput. System Sci. 64(1):48–75.Google Scholar
Digital Library
- [10] (2003) The nonstochastic multi-armed bandit problem. SIAM J. Comput. 32(1):48–77.Google Scholar
Digital Library
- [11] (2004) Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. Proc. 36th Annual ACM Sympos. Theory Comput., 45–53.Google Scholar
- [12] (2011) Contextual bandit algorithms with supervised learning guarantees. Proc. 14th Internat. Conf. Artificial Intelligence Statist. (PMLR), 19–26.Google Scholar
- [13] (2005) Near-optimal online auctions. Proc. 16th Annual ACM-SIAM Sympos. Discrete Algorithms, 1156–1163.Google Scholar
- [14] (2010) Routing without regret: On convergence to Nash equilibria of regret-minimizing algorithms in routing games. Theory Comput. 6(1):179–199.Google Scholar
Cross Ref
- [15] (2008) Regret minimization and the price of total anarchy. Proc. 40th Annual ACM Sympos. Theory Comput., 373–382.Google Scholar
- [16] (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learning 5(1):1–122. https://www.nowpublishers.com/article/Details/MAL-024.Google Scholar
- [17] (2006) Prediction, Learning, and Games (Cambridge University Press).Google Scholar
Cross Ref
- [18] (2013) Regret minimization for reserve prices in second-price auctions. Proc. 24th Annual ACM-SIAM Sympos. Discrete Algorithms, 1190–1204.Google Scholar
- [19] (2005) Minimizing regret with label efficient prediction. IEEE Trans. Inform. Theory 51(6):2152–2162.Google Scholar
Digital Library
- [20] (2016) Online learning with feedback graphs without the graphs. Proc. 33rd Internat. Conf. Machine Learn., 811–819.Google Scholar
- [21] (1991) Universal portfolios. Math. Finance 1(1):1–29.Google Scholar
Cross Ref
- [22] (2015) Strongly adaptive online learning. Proc. 32nd Internat. Conf. Machine Learn., 1405–1411.Google Scholar
- [23] (2016) Learning in games: Robustness of fast convergence. Annual Conf. Neural Inform. Processing Systems 2016, 4727–4735.Google Scholar
- [24] (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55(1):119–139.Google Scholar
Digital Library
- [25] (1957) Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, vol. 3 (Princeton University Press), 97–139.Google Scholar
- [26] (2007) Logarithmic regret algorithms for online convex optimization. Machine Learn. 69(2–3):169–192.Google Scholar
Digital Library
- [27] (1998) Tracking the best expert. Machine Learn. 32(2):151–178.Google Scholar
Digital Library
- [28] (2005) Efficient algorithms for online decision problems. J. Comput. System Sci. 71(3):291–307.Google Scholar
Digital Library
- [29] (2016) Online learning with noisy side observations. Proc. 19th Internat. Conf. Artificial Intelligence Statist., 1186–1194.Google Scholar
- [30] (2014) Efficient learning by implicit exploration in bandit problems with side observations. Adv. Neural Inform. Processing Systems, 613–621.Google Scholar
- [31] (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1):4–22.Google Scholar
Digital Library
- [32] (2007) The epoch-greedy algorithm for contextual multi-armed bandits. Proc. 20th Internat. Conf. Neural Inform. Processing Systems, 817–824.Google Scholar
- [33] (1994) The weighted majority algorithm. Inform. Comput. 108(2):212–261.Google Scholar
Digital Library
- [34] (2018) Personal communication via email.Google Scholar
- [35] (2015) Achieving all with no parameters: Adanormalhedge. Proc. 28th Conf. Learn. Theory, 1286–1304.Google Scholar
- [36] (2016) Learning and efficiency in games with dynamic population. Proc. 27th Annual ACM-SIAM Sympos. Discrete Algorithms, 120–129.Google Scholar
- [37] (2011) From bandits to experts: On the value of side-observations. Proc. 24th Internat. Conf. Neural Inform. Processing Systems, 684–692.Google Scholar
- [38] (2015) Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Annual Conf. Neural Inform. Processing Systems, 3168–3176.Google Scholar
- [39] (2015) First-order regret bounds for combinatorial semi-bandits. Proc. 28th Conf. Learn. Theory, 1360–1375.Google Scholar
- [40] (2016) Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits. J. Machine Learn. Res. 17(1):5355–5375.Google Scholar
Digital Library
- [41] (2013) Online learning with predictable sequences. Proc. 26th Annual Conf. Learn. Theory, 993–1019.Google Scholar
- [42] (2014) Online non-parametric regression. Proc. 27th Conf. Learn. Theory, 1232–1264.Google Scholar
- [43] (2016) Bistro: An efficient relaxation-based method for contextual bandits. Proc. 33rd Internat. Conf. Machine Learn., vol. 48, 1977–1985.Google Scholar
- [44] (2017) On equivalence of martingale tail bounds and deterministic regret inequalities. Proc. 30th Conf. Learn. Theory, 1704–1722.Google Scholar
- [45] (2010) Online learning: Random averages, combinatorial parameters, and learnability. Preprint, submitted June 6, https://arxiv.org/abs/1006.1138.Google Scholar
- [46] (2015) Intrinsic robustness of the price of anarchy. J. ACM 62(5)1–42.Google Scholar
Digital Library
- [47] (2016) Minimizing regret with multiple reserves. Proc. 2016 ACM Conf. Econom. Comput., 601–616.Google Scholar
- [48] (2016a) Efficient algorithms for adversarial contextual learning. Proc. 33rd Internat. Conf. Machine Learn., 2159–2168.Google Scholar
- [49] (2016b) Improved regret bounds for oracle-based adversarial contextual bandits. Proc. 30th Internat. Conf. Neural Inform. Processing Systems, 3143–3151.Google Scholar
- [50] (2017) Thompson sampling for stochastic bandits with graph feedback. Proc. Conf. AAAI Artificial Intelligence, 31(1).Google Scholar
Index Terms
(auto-classified)Small-Loss Bounds for Online Learning with Partial Information
Comments