RedDust - a Large Reusable Dataset of Reddit User Traits.

LREC(2020)

引用 9|浏览213
暂无评分
摘要
Social media is a rich source of assertions about personal traits, such as I am a doctor or my hobby is playing tennis. Precisely identifying explicit assertions is difficult, though, because of the users' highly varied vocabulary and language expressions. Identifying personal traits from implicit assertions like I've been at work treating patients all day is even more challenging. This paper presents RedDust, a large-scale annotated resource for user profiling for over 300k Reddit users across five attributes: profession, hobby, family status, age, and gender. We construct RedDust using a diverse set of high-precision patterns and demonstrate its use as a resource for developing learning models to deal with implicit assertions. RedDust consists of users' personal traits, which are (attribute, value) pairs, along with users' post ids, which may be used to retrieve the posts from a publicly available crawl or from the Reddit API. We discuss the construction of the resource and show interesting statistics and insights into the data. We also compare different classifiers, which can be learned from RedDust. To the best of our knowledge, RedDust is the first annotated language resource about Reddit users at large scale. We envision further use cases of RedDust for providing background knowledge about user traits, to enhance personalized search and recommendation as well as conversational agents.
更多
查看译文
关键词
personal knowledge, user profiling, conversational text, online forums
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要