Extracting Product Information from Email Receipts Using Markov Logic

msra(2009)

引用 24|浏览30
暂无评分
摘要
Email receipts (e-receipts) frequently record e-commerce trans- actions between users and online retailers, and contain a wealth of product information. Such information could be used in a variety of applications if it could be reliably ex- tracted. However, extracting product information from e- receipts poses several challenges. For example, the high la- bor cost of annotating e-receipts makes traditional super- vised approaches infeasible. E-receipts may also be gener- ated from a variety of templates, and are usually encoded in plain text rather than HTML, making it dicult to discover the regularity of how product information is presented. In this paper, we present an approach that addresses all these challenges. Our approach is based on Markov logic (22), a language that combines probability and logic. From a cor- pus of unlabeled e-receipts, we identify all possible templates by jointly clustering the e-receipts and the lines in them. From the non-template portions of e-receipts, we learn pat- terns describing how product information is laid out, and use them to extract the product information. Experiments on a corpus of real-world e-receipts demonstrate that our approach performs well. Furthermore, the extracted infor- mation can be reliably used as labeled data to bootstrap a supervised statistical model, and our experiments show that such a model is able to extract even more product informa- tion.
更多
查看译文
关键词
statistical model,e commerce
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要