Capturing Web Dynamics by Regular Approximation
Lecture Notes in Computer Science(2004)
摘要
Software systems like Web crawlers, Web archives or Web caches depend on or may be improved with the knowledge of update times of remote sources. In the literature, based on the assumption of an exponential distribution of time intervals between updates, diverse statistical methods were presented to find optimal reload times of remote sources. In this article first we present the observation that the time behavior of a fraction of Web data may be described more precisely by regular or quasi regular grammars. Second we present an approach to estimate the parameters of such grammars automatically. By comparing a reload policy based on regular approximation to previous exponential-distribution based methods we show that the quality of local copies of remote sources concerning 'freshness' and the amount of lost data may be improved significantly.
更多查看译文
关键词
web crawler,software systems,exponential distribution
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络