Incorporating intent propensities in personalized next best action recommendation

Proceedings of the 13th ACM Conference on Recommender Systems(2019)

引用 1|浏览33
Next best action (NBA) is a technique that is widely considered as the best practice in modern personalized marketing. It takes users' unique characteristics into consideration and recommends next actions that help users progress towards business goals as quickly and smoothly as possible. Many NBA engines are built with rules handcrafted by marketers based on experience or gut feelings. It is not effective. In this proposal, we show our machine learning based approach for such a real-time recommendation engine, detail our design choices, and discuss evaluation techniques. In practice, there are several key challenges to consider. (a) It needs to be able to deal with historical feedback that is typically incomplete and skewed towards a small set of actions; (b) Actions are typically dynamic. They can be added or removed anytime due to seasonal changes or shifts in business strategies; (c) The optimization objective is typically complex. It usually consists of reaching a set of target events or moving users to more preferred stages. The engine needs to account for all these aspects. Standard classification or regression models are not suitable to use, because only bandit feedback is available and sampling bias presented in historical data can not be handled properly. Conventional multi-armed bandit model can address some of the challenges. But it lacks the ability to model multiple objectives. We present a propensity variant hybrid contextual multi-armed bandit model (PV-MAB) that can address all three challenges. PV-MAB consists of two components: an intent propensity model (I-Prop) and a hybrid contextual MAB (H-Bandit). H-Bandit can be considered as a multi-policy contextual MAB, where we model different aspects of user engagement separately and cater the policies to each unique characteristic. I-Prop leverages user intent signals to target different users toward specific goals that are most relevant to them. It acts as a policy selector, to inform H-Bandit to choose the best strategy for different users at different points in the journey. I-Prop is trained separately with features extracted from user profile affinities and past behaviors. To illustrate this design, we will focus our discussion on how to incorporate two common distinct objectives in H-bandit. The first one is to target and drive users to reach a small set of high-value goals (e.g. purchase, become superfan), called goal-oriented policy. The second is to promote progression into more advanced stages in a consumer journey (e.g. from login to complete profile). We call it stage-advancement policy. In the goal-oriented policy, we reward reaching the goals accordingly, and use classification predictor as kernel function to predict the probabilities for achieving those goals. In the stage-advancement policy, we use the progression of stages as reward. Customers can move forward in their journey, skip a few stages or go back to previous stages doing more research or re-evaluation. The reward strategy is designed in the way that we reward higher for bigger positive stage progression and not reward zero or negative stage progression. Both policies incorporate Thompson Sampling with Gaussian kernel for better exploration. One big difference between our hybrid model and regular contextual bandit model, is that besides context information, we also mix user profile affinities in the model. It tells us the user intent and interest, and how their typical journey path looks like. With these special features, our model is able to recommend different actions for users that shows different interests (i.e. football ticket purchase v.s. jersey purchase). Similarly, for fast shoppers who usually skip a few stages, our model recommends actions that quickly triggers goal achievement; while for research type of users, the model offers actions that move them gradually towards next stages. This hybrid strategy provides us with better understanding of user intent and behaviors, so as to make more personalized recommendations. We designed a time-sensitive rolling evaluation mechanism for offline evaluation of the system with various hyperparameters that simulate behaviors in practice. Despite the lack of online evaluation, our strategy allows researchers and prospects to gain confidence through bounded expected performance. Evaluated on real-world data, we observed about 120% of reward gain, with an overall confidence of around 0.95. A big portion of the improvement is contributed by the goal-oriented policy. It well demonstrated the discovery functionality of the intent propensity model.
multi-armed bandit, next best action, reinforcement learning
AI 理解论文