Discussion Report: Hypersearching the web

semanticscholar(2007)

引用 0|浏览0
暂无评分
摘要
With the WWW growing hour by hour, searching and obtaining high quality relevant results from the collection of unstructured/unorganized information (which WWW trully is) has become a complicated task. In addition, web sites are written in multiple languages, styles and dialects, containing truth, falsehood, wisdom, propaganda or sheer nonsense. Distinguishing the most relevant information from the thousands of others who contain the exact same keywords but completely different context and aim is both challenging and important. Search engines use heuristics also known as ranking functions, to prioritize and hence determine the relevance of web sites in regard to the search term. In the past, search engines implemented simple heuristics like favoring pages by the number of times they contain the query term or by the location and size of the keywords. Simple heuristics, however, did more harm than help to the search results. As these heuristics are easy to be manipulated, many commercial web sites used to exploit their weaknesses using techniques like spamming, which made it very difficult to maintain an effective search engine. For example, they could insert phrases many times over in colors and fonts that are invisible to human eyes. The search engines with simple heuristics, however, will count all of the words as valid and would give the web page a favorable ranking. Moreover, human language, rich with synonymy and polysemy, makes the search even more complex. For example consider a word like “business”. It can have multiple meanings like (a) a purposeful activity, (b) a role or function, (c) an affair or matter, (d) a personal concern, etc, and it can be expressed or substituted by many different words like commerce, trade, industry, work, etc. So whenever a search is performed, it not sufficient to return the results of only the keyword. In fact many of the more relevant web pages might not even contain the search keywords. For example for a search for “automobiles”, many pages might lack the word “automobile” but instead contain “car” and such results of course cannot be excluded.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要