Extracting Names from Websites Containing Lists of People

mag(2006)

引用 23|浏览7
暂无评分
摘要
In this paper we describe an automated system for extracting people’s names from websites containing lists of people. The contents of these websites describe attributes common to the people listed. This public information has strategic value, such as demonstrating who tends to appear at similar events. Unlike traditional named entity recognition (NER) we are extracting names embedded in HTML without natural language context. We use a hidden markov model (HMM) to segment the document’s HTML source in order to extract entire names. Engineering features for this classifier led us to several general types of features useful for segmenting text in structured documents. Rosters may order first and last names in many ways. A first/last classifier determines the ordering used by each document using dictionaries to provide partial knowledge of the distribution of names across token positions. The first/last classifier uses the two dimensional coordinates of text as it would appear when rendered by a browser in order to abstract away the HTML. The HMM segmenter was able to achieve 95% precision and 91% recall while the first/last classifier achieved 84% precision and 82% recall, on average in a corpus of 37 documents containing approximately 10,000 names.
更多
查看译文
关键词
information retrieval,machine learning,hidden markov models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要