De-duping URLs with Sequence-to-Sequence Neural Networks
SIGIR(2017)
摘要
Many URLs on the Internet point to identical contents, which increase the burden of web crawlers. Techniques that detect such URLs (known as URL de-duping) can greatly save resources such as bandwidth and storage for crawlers. Traditional de-duping methods are usually limited to heavily engineered rule matching strategies.In this work, we propose a novel URL de-duping framework based on sequence-to-sequence (Seq2Seq) neural networks. A single concise translation model can take the place of thousands of explicit rules. Experiments indicate that a vanilla Seq2Seq architecture yields robust and accurate results in detecting duplicate URLs. Furthermore, we demonstrate the efficiency of this framework in the real large-scale web environment.
更多查看译文
关键词
Web Crawling, URL De-duplication, Sequence-to-Sequence Neural Network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络