Bandit based Optimization of Multiple Objectives on a Music Streaming Platform

KDD 2020, 2020.

Cited by: 0|Bibtex|Views23|Links
Keywords:
centric objectiveobjective optimizationuser behavioralmulti-objective -greedy -gonline multiMore(14+)
Weibo:
Based on correlation analysis across the different stakeholder objectives, we highlighted that trade-offs exist between objectives, and motivated the need for multi-objective optimization of recommender systems

Abstract:

Recommender systems powering online multi-stakeholder platforms often face the challenge of jointly optimizing multiple objectives, in an attempt to efficiently match suppliers and consumers. Examples of such objectives include user behavioral metrics (e.g. clicks, streams, dwell time, etc), supplier exposure objectives (e.g. diversity) a...More

Code:

Data:

0
Introduction
  • Platform ecosystems have witnessed an explosive growth by facilitating efficient interactions between multiple stakeholders, including e.g. buyers and retailers (Amazon), guests and hosts (AirBnb), riders and drivers (Uber), and listeners and artists (Spotify).
  • A recommender system powering a multi-stakeholder platform (e.g. Amazon, Uber) would aim at serving recommendations maximising user satisfaction, and would optimise for fair exposure to retailers and suppliers, alongside maximising revenue-related objectives [25].Even in the case of user-centric recommendation systems, predicting user interest alone is not enough to ensure user satisfaction, and notions such as engagement, novelty and diversity greatly enhance user experiences [8] and should be promoted as well.
  • Unlike past approaches, which dealt with MOO in traditional supervised learning setup, the focus on this work is on more interactive, online, adaptive learning setting based on user feedback
Highlights
  • Platform ecosystems have witnessed an explosive growth by facilitating efficient interactions between multiple stakeholders, including e.g. buyers and retailers (Amazon), guests and hosts (AirBnb), riders and drivers (Uber), and listeners and artists (Spotify)
  • Recent advancements in understanding, interpreting and leveraging user behavioral signals have enabled such recommender systems to be optimized for many different user centric objectives, including clicks [14], dwell time [43], session length time [13], streaming time, conversion [32] among others
  • We propose a multi-objective contextual bandit model based on Generalized Gini Index (GGI) [5] by introducing a model that assimilates contextual information and optimizes for a number of different, including competing objectives
  • We present correlation analysis across these different user- and supplier-centric objectives to motivate the need for multi-objective modeling
  • Based on correlation analysis across the different stakeholder objectives, we highlighted that trade-offs exist between objectives, and motivated the need for multi-objective optimization of recommender systems
  • We observe that all strategies that leverage contextual information significantly improve upon random selection, which asserts that the constructed features possess predictive power for all objectives and the online ridge regression is effective
  • We presented multi-objective -greedy -g (MO)-LinCB, a multi-objective linear contextual bandit model, which leverages Gini aggregation function to scalarize multiple objectives, and proposed a scalable gradient ascent based approach to learn the recommendation policy
Methods
  • The authors compare the proposed approach (MO-LinCB) with several established methods, including both state-of-the-art bandit recommendation techniques, as well as recent multi-objective bandit algorithms.

    (1) -greedy algorithm ( -g (C), -g (S), -g (L)) [34]: a simple bandit approach that instead of picking the best available option always, randomly explores other options with a probability.
  • (2) LinUCB [19]: a widely used and deployed bandit approach based on the principle of optimism in the face of uncertainty
  • It chooses actions by their mean payoffs and uncertainty estimates.
  • (5) Mixed Strategy: A baseline model extending the pure strategy baseline, that, at each round, proposes a probability distribution according to which an arm is to be drawn
  • This is different to the pure strategy baseline mentioned above, wherein only a single arm is decided at each round based on the highest mean reward.
  • The authors consider variants of the proposed approach (MO-LinCB) which is a multi-objective contextual bandit based on GGI, trained with different user satisfactionand supplier diversity objectives:
Results
  • Each of the competing algorithms, i.e. MOLinCB, LinUCB, -greedy, MO-OGDE and random selection is repeated 100 times.
  • The step size is set to 5 and the number of iterations is set to 5 for MO-LinCB.
  • The authors use this dataset in Sections 5.3 and 5.7 to compare performance of approaches on reward obtained and time complexity based scalability test, respectively
Conclusion
  • Recommender systems powering multi-stakeholder platforms often need to consider different stakeholders and their objectives when serving recommendations.
  • Based on correlation analysis across the different stakeholder objectives, the authors highlighted that trade-offs exist between objectives, and motivated the need for multi-objective optimization of recommender systems.
  • To address this problem, the authors presented MO-LinCB, a multi-objective linear contextual bandit model, which leverages Gini aggregation function to scalarize multiple objectives, and proposed a scalable gradient ascent based approach to learn the recommendation policy.
  • The proposed approach was able to obtain gains in a competing promotional objective, without hurting user satisfaction metrics
Summary
  • Introduction:

    Platform ecosystems have witnessed an explosive growth by facilitating efficient interactions between multiple stakeholders, including e.g. buyers and retailers (Amazon), guests and hosts (AirBnb), riders and drivers (Uber), and listeners and artists (Spotify).
  • A recommender system powering a multi-stakeholder platform (e.g. Amazon, Uber) would aim at serving recommendations maximising user satisfaction, and would optimise for fair exposure to retailers and suppliers, alongside maximising revenue-related objectives [25].Even in the case of user-centric recommendation systems, predicting user interest alone is not enough to ensure user satisfaction, and notions such as engagement, novelty and diversity greatly enhance user experiences [8] and should be promoted as well.
  • Unlike past approaches, which dealt with MOO in traditional supervised learning setup, the focus on this work is on more interactive, online, adaptive learning setting based on user feedback
  • Objectives:

    OBJECTIVES & STAKEHOLDERS

    Machine learning systems powering modern recommender systems are optimized for and evaluated upon an increasing number of objectives and metrics.
  • User centric recommender systems such as e-commerce portals optimize for different proxies of user satisfaction, including clicks, dwell time and conversion; whereas multi-stakeholder platforms optimize metrics for its various stakeholders, including guests/hosts (Airbnb), buyers/sellers (Amazon) and listeners/artists (Spotify).
  • The authors discuss scenarios wherein the system needs to jointly optimize for multiple metrics and propose a bandit model as a solution.
  • The authors begin by motivating an industrial usecase of multi-objective modelling and present data driven analysis that motivate the need for multi-objective modelling for recommender systems.
  • Different sets have varying degree of relevance to user’s interests, and users could be satisfied with the recommended set to varying extent
  • Methods:

    The authors compare the proposed approach (MO-LinCB) with several established methods, including both state-of-the-art bandit recommendation techniques, as well as recent multi-objective bandit algorithms.

    (1) -greedy algorithm ( -g (C), -g (S), -g (L)) [34]: a simple bandit approach that instead of picking the best available option always, randomly explores other options with a probability.
  • (2) LinUCB [19]: a widely used and deployed bandit approach based on the principle of optimism in the face of uncertainty
  • It chooses actions by their mean payoffs and uncertainty estimates.
  • (5) Mixed Strategy: A baseline model extending the pure strategy baseline, that, at each round, proposes a probability distribution according to which an arm is to be drawn
  • This is different to the pure strategy baseline mentioned above, wherein only a single arm is decided at each round based on the highest mean reward.
  • The authors consider variants of the proposed approach (MO-LinCB) which is a multi-objective contextual bandit based on GGI, trained with different user satisfactionand supplier diversity objectives:
  • Results:

    Each of the competing algorithms, i.e. MOLinCB, LinUCB, -greedy, MO-OGDE and random selection is repeated 100 times.
  • The step size is set to 5 and the number of iterations is set to 5 for MO-LinCB.
  • The authors use this dataset in Sections 5.3 and 5.7 to compare performance of approaches on reward obtained and time complexity based scalability test, respectively
  • Conclusion:

    Recommender systems powering multi-stakeholder platforms often need to consider different stakeholders and their objectives when serving recommendations.
  • Based on correlation analysis across the different stakeholder objectives, the authors highlighted that trade-offs exist between objectives, and motivated the need for multi-objective optimization of recommender systems.
  • To address this problem, the authors presented MO-LinCB, a multi-objective linear contextual bandit model, which leverages Gini aggregation function to scalarize multiple objectives, and proposed a scalable gradient ascent based approach to learn the recommendation policy.
  • The proposed approach was able to obtain gains in a competing promotional objective, without hurting user satisfaction metrics
Tables
  • Table1: Running times of different approaches
  • Table2: P-values of ’clicks’ for all pairs of algorithms
Download tables as Excel
Reference
  • Gediminas Adomavicius and YoungOk Kwon. 2007. New Recommendation Techniques for Multicriteria Rating Systems. IEEE Intelligent Systems (2007).
    Google ScholarLocate open access versionFindings
  • Deepak Agarwal, Bee-Chung, Pradheep, and Xuanhui. 2011. Click shaping to optimize multiple objectives. In Proceedings of KDD 2011.
    Google ScholarLocate open access versionFindings
  • Deepak Agarwal, Bee-Chung Chen, Pradheep Elango, and Xuanhui Wang. 2012. Personalized click shaping through lagrangian duality for online recommendation. In Proceedings of SIGIR 2012.
    Google ScholarLocate open access versionFindings
  • Leon Barrett and Srini Narayanan. 2008. Learning All Optimal Policies with Multiple Criteria. In ICML. 41–47.
    Google ScholarLocate open access versionFindings
  • Róbert Busa-Fekete, Balázs Szörényi, Paul Weng, and Shie Mannor. 2017. Multiobjective Bandits: Optimizing the Generalized Gini Index. In ICML.
    Google ScholarLocate open access versionFindings
  • Konstantina Christakopoulou, Jaya Kawale, and Arindam Banerjee. [n.d.]. Recommendation with capacity constraints. In Proceedings of CIKM 2017.
    Google ScholarLocate open access versionFindings
  • Wei Chu and Seung-Taek Park. 2009. Personalized Recommendation on Dynamic Content Using Predictive Bilinear Models. In WWW.
    Google ScholarFindings
  • Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Recommender Algorithms on Top-n Recommendation Tasks. In RecSys.
    Google ScholarFindings
  • M. M. Drugan and A. Nowe. 2013. Designing multi-objective multi-armed bandits algorithms: A study. In International Joint Conference on Neural Networks (IJCNN).
    Google ScholarLocate open access versionFindings
  • Zoltán Gábor, Zsolt Kalmár, and Csaba Szepesvári. 1998. Multi-criteria Reinforcement Learning. In ICML. 197–205.
    Google ScholarLocate open access versionFindings
  • Rupesh Gupta, Guanfeng Liang, Ravi Kiran Tseng, Xiaoyu Chen, and Romer Rosales. [n.d.]. Email volume optimization at LinkedIn. In KDD 2016.
    Google ScholarFindings
  • Tamas Jambor and Jun Wang. [n.d.]. Optimizing multiple objectives in collaborative filtering. In Proceedings of RecSys 2010.
    Google ScholarLocate open access versionFindings
  • Dietmar Jannach and Malte Ludewig. 2017. When recurrent neural networks meet the neighborhood for session-based recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 306–310.
    Google ScholarLocate open access versionFindings
  • Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 133–142.
    Google ScholarLocate open access versionFindings
  • I. Y. Kim and O. L. de Weck. 2006. Adaptive weighted sum method for multiobjective optimization: a new method for Pareto front generation. Structural and Multidisciplinary Optimization 31, 2 (2006), 105–116.
    Google ScholarLocate open access versionFindings
  • Anisio Lacerda. 2015. Contextual Bandits for Multi-objective Recommender Systems. In Proceedings of the 2015 Brazilian Conference on Intelligent Systems (BRACIS). 68–73.
    Google ScholarLocate open access versionFindings
  • Kleanthi Lakiotaki, Nikolaos F Matsatsinis, and Alexis Tsoukias. 2011. Multicriteria user modeling in recommender systems. IEEE Intelligent Systems (2011).
    Google ScholarLocate open access versionFindings
  • John Langford and Tong Zhang. 2008. The Epoch-Greedy Algorithm for Multiarmed Bandits with Side Information. In NIPS.
    Google ScholarFindings
  • Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A Contextualbandit Approach to Personalized News Article Recommendation. In WWW.
    Google ScholarFindings
  • Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms. In WSDM.
    Google ScholarFindings
  • C. Liu, X. Xu, and D. Hu. 2015. Multiobjective Reinforcement Learning: A Comprehensive Overview. IEEE Transactions on Systems, Man, and Cybernetics (2015).
    Google ScholarLocate open access versionFindings
  • Donald W. Marquardt and Ronald D. Snee. 1975. Ridge Regression in Practice. The American Statistician 29, 1 (1975), 3–20.
    Google ScholarLocate open access versionFindings
  • James McInerney, Benjamin Lacker, Samantha Hansen, Karl Higley, Hugues Bouchard, Alois Gruson, and Rishabh Mehrotra. [n.d.]. Explore, Exploit, and Explain: Personalizing Explainable Recommendations with Bandits. In Proceedings of RecSys 2018.
    Google ScholarLocate open access versionFindings
  • Rishabh Mehrotra and Benjamin Carterette. 2019. Recommendations in a marketplace. In Proceedings of the 13th ACM Conference on Recommender Systems. 580–581.
    Google ScholarLocate open access versionFindings
  • Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia Lalmas, and Fernando Diaz. 2018. Towards a Fair Marketplace: Counterfactual Evaluation of the trade-off between Relevance, Fairness & Satisfaction in Recommendation Systems. In CIKM.
    Google ScholarFindings
  • K. Van Moffaert, K. Van Vaerenbergh, P. Vrancx, and A. Nowe. [n.d.].
    Google ScholarFindings
  • Eric Moulines and Francis R. Bach. 2011. Non-Asymptotic Analysis of Stochastic
    Google ScholarFindings
  • Włodzimierz Ogryczak and Tomasz Sliwinski. 2003. On solving linear programs with the ordered weighted averaging objective. European Journal of Operational Research 148, 1 (2003), 80 – 91.
    Google ScholarLocate open access versionFindings
  • Saba Q. Yahyaa, Madalina M. Drugan, and Bernard Manderick. 2014. Knowledge Gradient for Multi-objective Multi-armed Bandit Algorithms. In Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 1.
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Lacerda, Veloso, and Ziviani. [n.d.]. Pareto-efficient hybridization for multi-objective recommender systems. In RecSys 2012.
    Google ScholarLocate open access versionFindings
  • Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The Annals of Mathematical Statistics 22, 3 (1951), 400–407.
    Google ScholarLocate open access versionFindings
  • Lili Shan, Lei Lin, and Chengjie Sun. [n.d.]. Combined Regression and Tripletwise Learning for Conversion Rate Prediction in Real-Time Bidding Advertising. In SIGIR 2018.
    Google ScholarLocate open access versionFindings
  • Aleksandrs Slivkins. 2014. Contextual Bandits with Similarity Information. J. Mach. Learn. Res. 15, 1 (2014), 2533–2568.
    Google ScholarLocate open access versionFindings
  • Richard S. Sutton and Francis Bach. 1998. Reinforcement Learning: An Introduction. MIT Press.
    Google ScholarFindings
  • Krysta M Svore, Maksims N Volkovs, and Christopher JC Burges. [n.d.]. Learning to rank with multiple objective functions. In Proceedings of WWW 2011.
    Google ScholarLocate open access versionFindings
  • Cem Tekin and Eralp Turğay. 2018. Multi-objective contextual multi-armed bandit with a dominant objective. IEEE Transactions on Signal Processing (2018).
    Google ScholarLocate open access versionFindings
  • Eralp Turğay, Doruk Öner, and Cem Tekin. 2018. Multi-objective contextual bandit problem with similarity information. arXiv preprint arXiv:1803.04015 (2018).
    Findings
  • Umair ul Hassan and Edward Curry. 2016. Efficient task assignment for spatial crowdsourcing: A combinatorial fractional optimization approach with semibandit learning. Expert Systems with Applications 58 (2016), 36–56.
    Google ScholarLocate open access versionFindings
  • John A. Weymark. 1981. Generalized Gini inequality indices. Mathematical Social Sciences 1 (1981), 409–430.
    Google ScholarLocate open access versionFindings
  • Lin Xiao, Minand, Zhaoquan, L Yiqun, and Ma Shaoping. [n.d.]. Fairness-aware group recommendation with pareto-efficiency. In RecSys 2018.
    Google ScholarLocate open access versionFindings
  • [42] Saba Yahyaa, Madalina Drugan, and Bernard Manderick. 2015. Thompson Sampling in the Adaptive Linear Scalarized Multi Objective Multi Armed Bandit. In Proceedings of the 7th International Conference on Agents and Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • [43] Xing Yi, Liangjie Hong, Erheng Zhong, Nanthan Nan Liu, and Suju Rajan. [n.d.]. Beyond clicks: dwell time for personalization. In Proceedings of RecSys 2014.
    Google ScholarLocate open access versionFindings
  • [44] Chongjie Zhang and Julie A Shah. 2014. Fairness in Multi-Agent Sequential Decision-Making. In NIPS. 2636–2644.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments