Structural Clustering of Machine-Generated Mail

ACM International Conference on Information and Knowledge Management(2016)

引用 15|浏览374
暂无评分
摘要
Several recent studies have presented different approaches for clustering and classifying machine-generated mail based on email headers. We propose to expand these approaches by considering email message bodies. We argue that our approach can help increase coverage and precision in several tasks, and is especially critical for mail extraction. We remind that mail extraction supports a variety of mail mining applications such as ad re-targeting, mail search, and mail summarization. We introduce new structural clustering methods that leverage the HTML structure that is common to messages generated by a same mass-sender script. We discuss how such structural clustering can be conducted at different levels of granularity, using either strict or flexible matching constraints, depending on the use cases. We present large scale experiments carried over real Yahoo mail traffic. For our first use case of automatic mail extraction, we describe novel flexible-matching clustering methods that meet the key requirements of high intra-cluster similarity, adequate clusters size, and relatively small overall number of clusters. We identify the precise level of flexibility that is needed in order to achieve extremely high extraction precision (close to 100%), while producing relatively small number of clusters. For our second use case, namely, mail classification, we show that strict structural matching is more adequate, achieving precision and recall rates between 85%-90%, while converging to a stable classification after a short learning cycle. This represents an increase of 10%-20% compared to the sender-based method described in previous work, when run over the same period length. Our work has been deployed in production in Yahoo mail backend.
更多
查看译文
关键词
mail mining,mail clustering,mail structural clustering,machine generated mail,mail classification,mail extraction
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要