Researcher profile Extraction Spec

first draft: by jie tang

May 29, 2007


The data set and related documents are used for researcher profile extraction, also called researcher profiling (Tang et al., 2007; Tang et al., 2008).


The related data are as follows.

898_data: dataset used in paper, without url information (898 files)

Notice: we also annotate block information. See the specification below.

898_data_url: dataset annotated with url information (898 files)

all_data: all data annotated, with url and image information.

Img_898_data: images of 898_data

Img_all_data: images of all_data

id2name.xml: a list of id and person_name. id is the filename in the dataset


Representative publications:

Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2008). [PDF]

Jie Tang, Duo Zhang, and Limin Yao. Social Network Extraction of Academic Researchers. In Proceedings of 2007 IEEE International Conference on Data Mining (ICDM’2007). pp. 292-301. [PDF] [Slides]


Following, we give the specification for the researcher profiling task.

Spec on Researcher Profile Extraction


We are developing extraction tools in ArnetMiner, a researcher social network system. The tool will be used to extract researcher profile from the Web page and outputs the extracted information into a researcher database.

This documentation describes the specs of the researcher profile which need to be extracted at the building process of the researcher network. All the specs here are focusing on English.

General Principle

Generally speaking, a researcher profile is defined as shown in Table 1.

Table 1. Researcher profile annotation subtasks


Annotation Subtasks

Basic Information

Person Photo





Research Interest




Contact Information

Email Address

Postal Address






Educational History

University, Major, Date, and Advisor for Ph.D. Degree

University, Major, Date, and Advisor for M.S. Degree


University, Major, Date, and Advisor for B.S. Degree



Authored Papers and Technique Reports








Each component in the right column of the Table 1 is defined as a property of a researcher and may consist of words (standard words like “professor” and non-standard words like URL and email address).

Although it is possible, we do not annotate embedded annotation. For example, we prefer to

“ [affiliation]Department of Computer Science

University of Victoria[/affiliation]

[address]Victoria, B.C.

CANADA V8W 3P6[/address]”

instead of 

“[address] [affiliation]Department of Computer Science

University of Victoria[/affiliation]

Victoria, B.C.

CANADA V8W 3P6[/address]”


l      About tokenization

We should note that annotated tags should not destroy the structure of the html. For example, keep <b></b> as what it is. Likewise for <font size =3></font>, <a href=””></a>. Especially, both font and bold can be nested, annotation should keep the nested relationship as what it be.

n     Each word is separated by word breakers or sentence breakers.

n     The breakers usually are not included in a word, when they work as separators. E.g., “(student)” à “student”, “A#B” à “A”, “B”, “time/space” à “time”, “space”. But some breakers will be viewed as a separated ‘word’, e.g., “(”, “,”,  “.”, “*”, and a line break.

n     The html tag “<Image>” will be tokenized as a unique word, e.g., ““<IMAGE src="defaul3.jpg" alt=""/>””.

n     Words can be connected by hyphen and underscore symbols: “pre-condition”, ”necessary_condition”.

n     Non-standard words are defined as below.

u email_body, like “”.

u email_pre, like “Email:”.

u Fax_body, like “010-62789831”.

u Fax_pre, like “fax number:”.

u Phone_pre, like “phone number:”.

u Position_body, like “Assistant Professor”.

u URL, like “”.

u Words can be acronyms, like I.B.M. although they may contain the breaker “.”.

u “,” can be considered as a part of a word when the word is a number (e.g. 12,000).

u Some special symbols can be parts of proper nouns, e.g. “C#”, “Yahoo!”, “P&G”, “P2P” are words.

u Words can be machine generated strings containing the breakers (e.g., “4#790174ajaj”).

l      Person Photo

n     A person photo is a picture denoted by a tag of “<Image>” in HTML. When extracting information from the Web, we cleaning all the other tags except the “<Image>” tag. We also download the picture from the Web so that we can define content features for the picture by analyzing its colors and size.

n     A person photo should at least contain the face of the current researcher.

n     A person photo can also contain several persons' face including the current.

n     A person photo can be black and white.

l      Position

n     Position represents the current position of the current researcher, but not past ones. For example, in “He was a professor”, as “was” means that it is a ex-position, it should be annotated as position.

n     Position is not the degree title of the current researcher, for example “Dr.”, “Ph.D.”

n     A position can be “Assistant Professor”, “Co-chair”.

n     A position should only contain the title of a researcher, without the information of his research area or department. In the case of “Professor of Computer Science at Texas A&M University.”, we only annotate “Professor” as position and annotate “Computer Science at Texas A&M University.” as affiliation. Another example: “and is now [position]Professor[/position] of Astronomy and [position]Director[/position] of Graduate Studies at [affiliation]Cornell University[/affiliation]”. “Editor in chief”, or “Editor” can be annotated as position. In the case of “He is a member of IEEE Society Community”, a position should be annotated as only “member” instead of “member of IEEE Society Community”. And the beginning article “a” and “the” should not be included in the position. (pay attention to this)

n     A researcher may have more than one position. E.g., one can be the head of a company research group. He can also be an adjunct professor at a university.

l      Affiliation

n     Affiliation represents the current affiliation as well, but not past ones.

n     An affiliation in an address should not be annotated. E.g. in “[address]CSE Building Room E301University of Florida - P.O. Box 116120 Gainesville, FL  32611-6120 USA[/address]”, the text “University of Florida” should not be annotated as affiliation.

n     A text with preceded text “Mail:” or “Address:” or “Office:” or “Contact information:” should be annotated as address although it might be like an affiliation. E.g. “Department of Computer Science, Tsinghua University, Beijing.” (pay attention to this spec.)

n     However, when an organization co-occurs with an address like “[affiliation]IBM Almaden Research Center[/affiliation] [address]Dept. 8CC/B1 650 Harry Road San Jose, California 95120 USA[/address]”, one needs to annotate them respectively. (pay attention to this)

n     A researcher may have more than one affiliation.,

n     When it is like this: “I am a [position]Phd student[/position] at the [/affiliation]Computer Science Dept. of Tel Aviv University[/affiliation]”, we annotate the information as affiliation instead of phdmajor and phduniv.

l       Annotate  “a [position]member[/position] of [affliation]IEEE, ACM…[/affliation]”, for example:

I am currently serving on the [position]program committees[/position] of [affiliation]AAMAS 2007, AMEC 2007, ACM EC 2007, and AAAI 2007[/affiliation].

l      Postal Address

n     Address should be an appearance mail address. For example, room number, building, and road.

n     Sometime, a researcher may have a office address in addition to his/her contact address.

Special case: [address]R. Dr. Xavier Sigaud 150, Urca</font>


<font size=2> 22290-180 Rio de Janeiro - RJ</font>


<font size=2> Brazil[/address]</font>


<font size=2>Room : [address]3rd Floor, CAT[/address] 39084.txt

l      Research Interest (do not annotate this tag or block)

n     The name of the research topics that the researchers are interested in. They are may be among the introduction part, in the format of natural sentences.

For example, the simple format:

[interests]Machine learning and pattern recognition[/interests]

[interests]Computer vision[/interests], [interests]speech recognition[/interests]

[interests]Programming language system development[/interests], Lush(look up the dictionary, it means thriving)

In the example above, we consider , as separation, indicating that one subsentence should have a tag.


n    The subtitle of the research projects should be annotated as interests.

Research Projects


  * <b>[interests]Schema Mapping Generation[/interests] (Clio).</b>


n    The interests contained in the natural language sentences.The terms which are likely to be topics and subtopics of an area should be annotated as interests. (pay attention to this spec)

I'm an [position]associate professor[/position] at [affiliation]the University of Colorado, in the Department of Computer Science[/affiliation]. I work in the area of [interests]computer systems[/interests], broadly defined. This includes [interests]Computer Architecture[/interests], [interests]Operating systems[/interests], [interests]mobile & wireless[/interests] and whatever else I feel like (ahh... the joys of tenure).



l      Homepage

n     Not available

n     We do not intend to annotate the homepage URL in the web page, because we always think the web page we find from the Web is just the web page of the current researcher.

l      Email Address (pay attention to this, all formats of emails)

n     One can have more than one email address

n     An email address can be represented in diversity form. Some example are as below.

u hangli at microsoft dot com

u, junyang


u erafalin(at)

n     Some email address might be represented as a picture. For example, “my email address: <Image src=’email.jpg’/>”. It should be annotated.

n     Email like this example should also be annotated, “e-mail: [email]Natalio.Krasnogor -replace all this by at simbol-[/email]”

n     Email like this: “wmt then the at-sign then uci dot edu” should be annotated.

l      Phone

n     A phone number can be cell number, office phone number, home phone number, even secretary’s phone number of the current researcher.

n     A researcher can have more than one phone number.

n     A phone number can a long one (including country area code and extension code) or a short one (including only part of the phone number “88788-20”.

l      Fax

n     The format of the fax number is exactly the same as that of phone number.

n     One can use the preceding text to disambiguate the fax number from the phone number.

l      Educational History

n     (PhD/MS/BS) University

u In the case of “I received my MS and PhD in [phdmajor]Computer Science[/phdmajor] from the [phduniv]University of Pennsylvania[/phduniv]”, we only annotate the text “University of Pennsylvania” as “phduniv”. Similarity, when BS and MS occur together, we only annotate “msuniv”.

u In the case of  Zhejiang University, China”, we only annotate “Zhejiang University”, without the place or country.

u 40923: information of one individual, two different affiliations.

u       BS, Dept. of [bsmajor]Civil Eng.[/bsmajor], [bsuniv]National Taiwan University[/bsuniv], [bsdate]2003[/bsdate], in this case, “Civil Eng.” should be annotated as bsmajor.

n     (PhD/MS/BS) Date

u The date means when the researcher obtained his/her corresponding degree.

u In general, the date is only represented as year. However, in some cases, it can include the month “September 2000”.

u In the case of “From xx to [date]xx[date], we only annotate the latter “xx”. The same for 2002.7-[date]2006.7[date]

u If existed “expected”, do not annotate date

n     (PhD/MS/BS) Major : “robotics” or subareas of a large area can be annotated as PhD/MS/BS Major.

n     When a subares cooccurrences with a large area, we annotate both. For example:

PhD (supervisor: Professor K. Glover), [phddate]July 1998[/phddate], [phdmajor]Control[/phdmajor] Group, Department of [phdmajor]Engineering[/phdmajor]

n     (PhD/MS/BS) Advisor: we can add a tag in the data_tag tool(not available).

l      Publications

n     Not available

n     So far, we do not consider annotating the authored papers and technique reports from the researchers’ web pages.

l      Relationships

n     So far, we do not consider annotating the relationship from one’s web pages either.


Line braker

One can conduct experiments by viewing each line in the web page as a unit. The line breaker is naturally the line break between two lines.

word braker

A html tag is a unique word.

A return is a word

For the others, the breaker is defined as:





Person Photo


<IMAGE src="lucian.jpg" alt="Recent photo of Lucian"/>

<IMAGE src="index_files/image002.jpg" alt=""/>

<IMAGE src="myself.JPG" alt=""/>

<IMAGE src="/presspass/images/gallery/execs/thumbnails/lee-2.jpg" alt=""/>

<IMAGE src="/presspass/images/exec/bio_lee.jpg" alt=""/>







corporate vice president



Lead Researcher and Research Manager


Associate Professor

Director of Research


Assistant Professor

Project Leader






hangli at microsoft dot com, junyang



<Image src=“email.jpg”>



mshong [at]




[position]Professor[/position] of Applied Mathematics and Computer Science

[affiliation]Department of Computer Science

Yale University[/affiliation]

[address]PO Box 208285

New Haven, CT 06520-8285[/address]

[affiliation]New York University

Courant Institute[/affiliation]

[address]251 Mercer Street

New York, NY 10012[/address]

As [position]corporate vice president[/position] of the [affiliation]Natural Interactive Services Division (NISD) at Microsoft Corp.[/affiliation]



[affiliation]New York University

Courant Institute[/affiliation]

[address]251 Mercer Street

New York, NY 10012[/address]

[address]Microsoft Research Asia


4F, Sigma Center


No. 49 Zhichun Road, Haidian District


Beijing, China 100080[/address]

Richard Segal [affiliation]IBM Thomas J. Watson Research Center[/affiliation]

[address]PO Box 704, Room 4S-B46

Yorktown Heights, NY 10598[/address]






(+41 22) 379 58 85


Tel: [phone](86-10)58963177[/phone]


Telephone: [phone]203.432.6432[/phone]








425) 882-8080

(425) 882-8080

x.4870, ext.4870

425 882 8080

1 (425) 882-8080

882.8080, 425.882.8080


+1 (425) 882-8080







Fax: [fax](86-10)88097306[/fax]


Fax. [fax]+82-62-970-2004[/fax]









I obtained a [bsdegree]B.S.[/bsdegree] in [bsmajor]Electrical Engineering[/bsmajor] from [bsuniv]Kyoto University[/bsuniv] in [bsdate]1988[/bsdate] and a [msdegree]M.S.[/msdegree] in [msmajor]Computer Science[/msmajor] from [msuniv]Kyoto University[/msuniv] in [msdate]1990[/msdate]. I earned my [phddegree]Ph.D.[/phddegree] in [phdmajor]Computer Science[/phdmajor] from [phduniv]the University of Tokyo[/phduniv] in [phddate]1998[/phddate].

I received my [msdegree]M.S.[/msdegree], [phddegree]Ph.D.[/phddegree] from [phduniv]Stanford University[/phduniv] and [bsdegree]B.A.[/bsdegree] from [bsuniv]University of California, Berkeley[/bsuniv]. Here is my curriculum vitae.

<b>Educational Background:</b>

[phddegree]Ph.D.[/phddegree] [phdmajor]Physics[/phdmajor], [phduniv]USC[/phduniv], Los Angeles, U.S.A.



Annotation Tags

To construct an evaluation data set that conforms to the spec above and has the flexibility of easily adapting to the potential spec changes, we define the following tags. (They are beginning tags, the corresponding eng tags should be [/tagname], e.g., [/address]. We use the bracket rather than “<>” to different them from the HTML tags in the web pages).

1.       [address]

2.       [affiliation]

3.       [bsdate]

4.       [bsmajor]

5.       [bsuniv]

6.       [email]

7.       [fax]

8.       [msdate]

9.       [msmajor]

10.   [msuniv]

11.   [phddate]

12.   [phdmajor]

13.   [phduniv]

14.   [phone]

15.   [position]



For the annotation, we may add more tags, for example to annotate professional history.


Blocks are a larger unit compared to the tags of basic information of a researcher. One block should include some useful information about a researcher, in other words, it is a subset of the tags that are defined last section. For example, the “introduction” block can include the tags such as “position”, “affliation”, “phdmajor” and so on. Our purpose is to build a hierarchical structure of all the labels. The tags in each block can be considered as a child of the block, vice versa, the block is parent of its children. Note that the passage which contains only one label should not be annotated as a block.

l      Contact info

This block always contains such kinds of labels: address, position, affliation, fax, phone, email. A semi-structured passage including subset of these labels should be annotated as “contact_info” block.



      [contactinfo][address]D327 Levine Science Research Center

      Box 90129

      Duke University

      Durham, North Carolina 27708-0129[/address]

      Tel: [phone]919-660-6587[/phone]

      Fax: [fax]919-660-6519[/fax][/contactinfo]

l      Introduction

This block may contain different types of information, such as position, affliation, educational history, work experience and so on. Annotate the passages which contain the most useful information (which we defined last section) as “introduction”, passages containing only work experimence should not be annotated as “introduction”. If a block contains information of employment and education, we annotate this block as introduction.

[introduction][position]Assistant Professor[/position]

[affiliation]Computer Science Department

Duke University[/affiliation][/introduction]


Home Publications Students Teaching Personal (this should be annotated as others, not as any block)


[introduction]I am an [position]Assistant Professor[/position] of [affiliation]Computer Science at Duke University[/affiliation]. My primary research interest lies in the area of database and information systems. I received my [msdegree]M.S.[/msdegree], [phddegree]Ph.D.[/phddegree] from [phduniv]Stanford University[/phduniv] and [bsdegree]B.A.[/bsdegree] from [bsuniv]University of California, Berkeley[/bsuniv]. Here is my curriculum vitae.


I co-direct the [affiliation]Duke Database Research Group[/affiliation], which is part of the Duke Systems and Architecture Group. We also participate in the larger Carolina Database Research Group.[/introduction]

l      Education

This block should enclose the educational information, such as labels: phdmajor, phduniv, phddate and so on. The block which contains only this type of information should be annotated. However, if the block contains other information, such as experimence of employment should be annotated as “introduction” instead of “education”.



  [education]Ph.D., [phdmajor]Computer Science[/phdmajor], [phddate]1979[/phddate], [phduniv]Stanford University[/phduniv]

  M.S., [msmajor]Computer Science[/msmajor], [msdate]1975[/msdate], [msuniv]Stanford University[/msuniv]

  B.S., [bsmajor]Mathematics[/bsmajor], [bsdate]1973[/bsdate], [bsuniv]Massachusetts Institute of Technology[/bsuniv][/education]


l      Publication

This block contains papers, lectures, talks, patents, all reading sources related to research.



  [publication]R. Akers, I. Bica, E. Kant, C. Randall, and R. Young, "SciFinance: A Program Synthesis Tool for Financial Modeling." Proceedings of the Twelfth Innovative Applications of Artificial Intelligence Conference (IAAI-2000), Austin, Texas, July 30-August 3, 2000. American Association for Artificial Intelligence. (Updated version appears in AI Magazine, Vol. 22, No. 2, Summer 2001.)


  J. Gatheral, Y. Epelbaum, J. Han, K. Laud, O. Lubovitsky, E. Kant, and C. Randall, "Implementing Option-Pricing Models Using Software Synthesis," Computing in Science & Engineering, November/December 1999, pp. 54-64.


  C. Randall and E. Kant, and A. Chhabra, "Using Program Synthesis to Price Derivatives," Journal of Computational Finance, Vol. 1, No. 2, 1998, pp 97-128.[/publication]





  [publication]"Program Synthesis for Mathematical Modeling Applications", Seventh International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, Austin, Texas, 1994


  "Knowledge-Based Support for Scientific Programming," The Seventh Knowledge-Based Software Engineering Conference, McLean, VA, 1992


  "Understanding and Automating Algorithm Design," IJCAI-85, Los Angeles, CA, 1985[/publication]


l      Research interests (do not annotate this)

    This block always contain serveral sentences, introducing the research areas of the researcher. It can be called as “research topics”, “research areas”. If the sentences describe the area, in other words, the topic that the researcher is working on, the sentences should be annotated as research interests. However, when the sentences describe the research experimence, for example, the researcher joined a company in 1998, the block should not be annotated as research interests. The sentences which does not contain the special words which indicate “research interests” should not be annotated. The areas the lab worked on should not be annotated.



  [resinterests]Automation of mathematical modeling, aids to scientific problem solving, program synthesis, automated algorithm design, object-oriented/rule-based programming, knowledge representation[/resinterests].


The following should not be annotated as research interests. It can be annotated as research activities.

Research Experiences


  * Summer 06: Microsoft Research Database Group. Research Intern. Mentor: Dr. Roger Barga


      * CEDR Event Processing Project


  * Summer 05: Microsoft Research Database Group. Research Intern. Mentor: Dr. David Lomet


      * Immortal DB Project


<b>Current Research:</b>


[resinterests]Distributed access control systems, distributed theorem proving in access control logics, security for mobile and pervasive computing[/resinterests]



l      Research activities/Academic activities

Activities include position in conference, such as program chairs, research projects (passages introducing the programs that the researchers joined). Other experience or introduction of projects the researcher joined should not be annotated.