Updates
- Implement `obp.policy.QLearner` (https://github.com/st-tech/zr-obp/pull/144 )
- Implement Balanced IPW Estimator as `obp.ope.BalancedInverseProbabilityWeighting`. See Sondhi et al.(2020) for details. (https://github.com/st-tech/zr-obp/pull/146 ).
- Implement the Cascade Doubly Robust estimator for the combinatorial action OPE as `obp.ope.CascadeDR`. See Kiyohara et al.(2022) for details. (https://github.com/st-tech/zr-obp/pull/142 )
- Implement a data-driven hyperparameter tuning method for OPE called SLOPE proposed by Su et al.(2020) and Tucker et al.(2021) (https://github.com/st-tech/zr-obp/pull/148 )
- Implement new estimators for the standard OPE based on a power-mean transformation of importance weights proposed by Metelli et al.(2021) (https://github.com/st-tech/zr-obp/pull/149 )
- Implement dataset class for generating synthetic logged bandit data with multiple loggers. Corresponding estimators will be added in the next update (https://github.com/st-tech/zr-obp/pull/150 )
- Implement an argument to control the number of deficient actions in `obp.dataset.SyntheticBanditDataset` and `obp.dataset.MultiClassToBanditReduction`. See Sachdeva et al.(2020) for details. (https://github.com/st-tech/zr-obp/pull/150 )
- Implement some flexible functions to synthesize reward function and behavior policy (https://github.com/st-tech/zr-obp/pull/145 )
Minors
- Adjust to `sklearn>=1.0.0`
- Fix/update error messages, docstrings, and examples.
References
- Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. WSDM2022.
- Arjun Sondhi, David Arbour, Drew Dimmery. Balanced Off-Policy Evaluation in General Action Spaces. AISTATS2020.
- Yi Su, Pavithra Srinath, Akshay Krishnamurthy. Adaptive Estimator Selection for Off-Policy Evaluation. ICML2020.
- George Tucker and Jonathan Lee. Improved Estimator Selection for Off-Policy Evaluation. 2021
- Alberto Maria Metelli, Alessio Russo, Marcello Restelli. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. NeurIPS2021.
- Noveen Sachdeva, Yi Su, and Thorsten Joachims. "Off-policy Bandits with Deficient Support.", KDD2020.
- Aman Agarwal, Soumya Basu, Tobias Schnabel, Thorsten Joachims. "Effective Evaluation using Logged Bandit Feedback from Multiple Loggers.", KDD2018.
- Nathan Kallus, Yuta Saito, and Masatoshi Uehara. "Optimal Off-Policy Evaluation from Multiple Logging Policies.", ICML2021.