Schedule

Time

9:00 am - 12:00 pm, August 14, 2022, @KDD2022, Washington DC Convention Center, Room 204C.

Materials

Abstract

Counterfactual estimators enable the use of existing log data to estimate how some new target policy would have performed, if it had been used instead of the policy that logged the data. We say that those estimators work “off-policy”, since the policy that logged the data is different from the target policy. In this way, counterfactual estimators enable Off-policy Evaluation (OPE) akin to an unbiased offline A/B test, as well as learning new decision-making policies through Off-policy Learning (OPL). The goal of this tutorial is to summarize Foundations, Implementations, and Recent Advances of OPE and OPL (OPE/OPL), with applications in recommendation, search, and an ever growing range of interactive systems. Specifically, we will introduce the fundamentals of OPE/OPL and provide theoretical and empirical comparisons of conventional methods. Then, we will cover emerging practical challenges such as how to handle large action spaces, distributional shift, and hyper-parameter tuning. We will then present Open Bandit Pipeline, an open-source Python software for OPE/OPL to better enable new research and applications. We will conclude the tutorial with future directions and an interactive QA session.

The learning outcomes of this tutorial are to enable the participants (such as applied researchers, practitioners, and students):

  • to know fundamental concepts and conventional methods of OPE/OPL
  • to be familiar with recent advances to address practical challenges such as large action spaces and hyper-parameter tuning
  • to understand how to implement OPE/OPL in their research and applications
  • to be aware of remaining challenges and opportunities in the relevant field

Note that all materials, including slides and demo code, will be available during and after the tutorial on this tutorial website.

Target Audience and Prerequisites

This tutorial is aimed at an audience with intermediate experience in machine learning, data mining, or recommender systems who are interested in using OPE/OPL methods in their research and applications. Participants are expected to have basic knowledge of machine learning, probability theory, and statistics. Basic knowledge about causal inference might help understand the contents better, but is not required.

Outline

Sections Presenter Duration
1: Introduction to OPE/OPL Thorsten Joachims 30min
2: Bias-Variance Control Yuta Saito 35min
3: Recent Advances in OPE Yuta Saito 45min
Break   10min
4: Off-Policy Learning Thorsten Joachims 40min
5: Implementations Yuta Saito 15min
6: Conclusions Both 5min

Section Abstracts

1. Introduction to OPE/OPL (Thorstem Joachims; 30min)

We will introduce conventional formulation of OPE and how it helps improve interactive systems quickly and safely. We also introduce basic estimators in OPE including Direct Method (DM) and Inverse Propensity Score (IPS) weighting with some empirical illustrations to highlight their bias-variance trade-off.

2. Bias-Variance Control (Yuta Saito; 35min)

This section summarizes a wide range of existing estimators in OPE including Self-Normalized IPS, Doubly Robust, Switch, and Doubly Robust with Shrinkage. These estimators aim at achieving a better bais-variance trade-off compared to DM and IPS. We will provide comprehensive comparisons of these estimators from both theoretical and empirical perspectives.

3. Recent Advances (Yuta Saito; 45min)

This section will cover recent related methods to handle emerging practical challenges such as OPE of ranking policies, large-scale applications, deficient support, multiple loggers, and hyper-parameter tuning for OPE. These challenges are closely related to real-world applications such as recommender and retrieval systems where the estimators have to deal with many number of actions and non-stationary dynamics.

4. Off-Policy Learning (Thorsten Joachims; 40min)

This section will cover the fundamental methods for OPL where we aim at training a new decision-making policy using only the logged bandit data.

5. Implementations (Yuta Saito; 15min)

This section will introduce Open Bandit Pipeline, an open-source Python package for OPE/OPL, and demonstrate how it helps us implement OPE/OPL for both research and practical purposes.

6. Conclusions (Both Presenters; 5min)

This section will conclude the tutorial by summarizing the previous sections and presenting remaining research challenges of the area. There will also be a live QA session.

Presenters

Yuta Saito (ys552@cornell.edu)

He is a Ph.D. student in the Department of Computer Science at Cornell University, advised by Prof. Thorsten Joachims. His current research focuses on OPE of bandit algorithm, learning from human behavior data, and fairness in ranking. Some of his recent work has been published at top conferences, including ICML, NeurIPS, SIGIR, RecSys, and WSDM. He has won the Best Paper Runner-Up Award at WSDM2020 and co-lectured a tutorial related to counterfactual inference at RecSys 2021.

Throsten Joachims (tj@cs.cornell.edu)

He is a Professor in the Department of Computer Science and in the Department of Information Science at Cornell University, and he is an Amazon Scholar. His research interests center on the synthesis of theory and system building in machine learning, with applications in information retrieval and recommendation. His past research focused on support vector machines, learning to rank, learning with preferences, and learning from implicit feedback, text classification, and structured output prediction. Working with his students and collaborators, his papers won 9 Best Paper Awards and 4 Test-of-Time Awards. He is also an ACM Fellow, AAAI Fellow, KDD Innovations Award recipient, and member of the SIGIR Academy.

References

  1. Aman Agarwal, Soumya Basu, Tobias Schnabel, and Thorsten Joachims. 2017. Effective Evaluation using Logged Bandit Feedback from Multiple Loggers. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 687–696.
  2. Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization. Statist. Sci. 29, 4 (2014), 485–511.
  3. Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80. PMLR, 1447–1456.
  4. Nan Jiang and Lihong Li. 2016. Doubly Robust Off-Policy Value Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, Vol. 48. PMLR, 652–661.
  5. Thorsten Joachims and Adith Swaminathan. 2016. Counterfactual Evaluation and Learning for Search, Recommendation and Ad Placement. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 1199–1201.
  6. Nathan Kallus, Yuta Saito, and Masatoshi Uehara. 2021. Optimal Off-Policy Evaluation from Multiple Logging Policies. In Proceedings of the 38th International Conference on Machine Learning, Vol. 139. PMLR, 5247–5256.
  7. Nathan Kallus and Angela Zhou. 2018. Policy Evaluation and Optimization with Continuous Treatments. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics. PMLR, 1243–1251.
  8. Masahiro Kato, Shota Yasui, and Masatoshi Uehara. 2020. Off-Policy Evaluation and Learning for External Validity under a Covariate Shift. In Advances in Neural Information Processing Systems, Vol. 33. 49–61.
  9. Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline Evaluation of Ranking Policies with Click Models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1685–1694.
  10. James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Benjamin Carterette. 2020. Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1779–1788.
  11. Noveen Sachdeva, Yi Su, and Thorsten Joachims. 2020. Off-policy Bandits with Deficient Support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 965–975.
  12. Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2020. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. arXiv preprint arXiv:2008.07146 (2020).
  13. Ashudeep Singh and Thorsten Joachims. 2018. Fairness of Exposure in Rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2219–2228.
  14. Ashudeep Singh and Thorsten Joachims. 2019. Policy Learning for Fairness in Ranking. In Advances in Neural Information Processing Systems, Vol. 32.
  15. Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly Robust Off-Policy Evaluation with Shrinkage. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9167–9176.
  16. Yi Su, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2019. Cab: Continuous Adaptive Blending for Policy Evaluation and Learning. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97. PMLR, 6005–6014.
  17. Adith Swaminathan and Thorsten Joachims. 2015. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. The Journal of Machine Learning Research 16, 1 (2015), 1731–1755.
  18. Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. 2017. Optimal and Adaptive Off-Policy Evaluation in Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70. PMLR, 3589–3597.
  19. Yusuke Narita, Shota Yasui, Kohei Yata. 2021. Debiased Off-Policy Evaluation for Recommendation Systems. In Fifteenth ACM Conference on Recommender Systems. 372–379.
  20. Nathan Kallus and Masatoshi Uehara. 2020. Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 5078-5088.
  21. Adith Swaminathan and Thorsten Joachims. 2016. The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, Vol. 28.
  22. Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, Ed Chi. 2019. Top-K Off-Policy Correction for a REINFORCE Recommender System. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 456-464.
  23. Sergey Levine, Aviral Kumar, George Tucker, Justin Fu. 2020. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv preprint arXiv:2005.01643 (2020).
  24. Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. In Proceedings of the 37th International Conference on Machine Learning, pp. 19089–19122.
  25. Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Fifteenth ACM Conference on Recommender Systems. 114–123.
  26. Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. 2022. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 487–497.
  27. Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. 2020. Adaptive Estimator Selection for Off-Policy Evaluation. 2020. In Proceedings of the 37th International Conference on Machine Learning, pp. 9196–9205.
  28. Alberto Maria Metelli, Alessio Russo, and Marcello Restelli. 2021. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. In Advances in Neural Information Processing Systems, Vol. 33.