Prof. Dr. Felix Naumann


Entity-centric search aims to leverage semantic information in documents to improve document search. The Text REtrieval Conference (TREC - see http://trec.nist.gov/) is one of the most famous conferences where research and industrial organizations can compare their Information Extraction and Retrieval systems in a form of a competition. The goal of this master seminar (5-10 participants) is to develop an Information Retrieval system for the Entity Track of TREC (see http://trec.nist.gov/call2010.html).

The aim of the Entity Track is to perform entity-centric search on web data. To provide such a system various challenges has to be tackled. First, entities and relations among them (i.e., facts) need to be recognized within text. Next, entities have to disambiguated – e.g., recognize whether “Apple” stands for a fruit or a company. Second, to achieve a high precision in document search, page types need to be classified (e.g., to determine the homepage of an entity). Disambiguated entities, their relations, page types and the entities contexts need to be stored and indexed in an appropriate way. Once a user types a query, the query needs to be interpreted, related entities have to be selected, and entities as well as homepages need to be ranked.

The course consists of two parts. The first part is a workshop that introduces basic concepts of Information Retrieval and Information Extraction. In the second part students will be divided into teams. Each of the teams will implement one component of the Information Retrieval system, that operates on the text corpus provided for the TREC’s Entity Track’s.

IMPORTANT: The introducing seminar workshop will take place BEFORE the official beginning of the semester (April 14th - 16th 2010)!!!

An example from TREC 2009 (http://ilps.science.uva.nl/trec-entity/)


  <entity_name>Bridgestone Corporation</entity_name>
  <narrative>Motorsport series that Bridgestone officially supports with types.  </narrative>


  • Formula 1 - www.formula1.com, en.wikipedia.org/wiki/Formula_One
  • MotoGP - www.motogp.com, en.wikipedia.org/wiki/MotoGP


Please send an informal mail to falk.brauer(at)hpi.uni-potsdam.de for capacity planning.

Topics and Time Schedule

The seminar workshop (3 days) will be divided into 6 modules, each of the modules consists of 2 hours lecture and 1 hour exercise:


  1. Basics of Information Retrieval. In this module we will introduce basics of Information Retrieval techniques, and evaluation methods of Information Retrieval systems. As a practical part we use Apache Lucene framework (http://lucene.apache.org) to provide a solution for a simple example scenario.
  2. Introduction to TREC Challenge. We will discuss challenges based on examples from the TREC text corpus. Moreover we draw the overall architecture and components of the system that will be developed. In the second part of the module, we extend program from module 1 to example documents from TREC. 
  3. Entity and Relation Extraction. We will introduce and compare different techniques used in entity and fact extraction, such as pattern matching, statistical entity recognition. As a practical part we will use existing frameworks to perform entity recognition (LingPipe, SystemT).
  4. Entity storage and search.  This module will extend module 1. We will show how to extend classical Information Retrieval techniques to leverage knowledge about semantics within documents. The aim of the second part of the module will be the extension of the programs from previous modules to resolve entity specific queries, e.g., near to queries for geographic locations. 
  5. Entity de-duplication and Information Quality in Information Extraction. Information Extraction on web data is error prone. In module 5 we investigate techniques that tackle quality issues and explore the capabilities of further derived background knowledge (e.g., Google Distance, DBPedia).
  6. Document classification.  In order to successful answer a user query we need not only to return semantically correct documents but also documents that match the user’s intensions. In the last module we will learn how to recognize basic types of documents, such as homepages using regular expressions, dictionaries and machine learning (LingPipe).

Important Dates

General schedule:

14.04.2010- 16.04.2010:

Seminar workshop as described above


Presentation of preliminary results

28.06.2010:Final presentation of evaluation results and integrated prototype

Workshop schedule (14.04.2010):

14:00-16:30A-2.12Practical work

Workshop schedule (15.04.2010 - 16.04.2010):

13:30-16:30A-2.12Practical work


  • Java development skills
  • Basic knowledge about entity extraction and information retrieval is helpful

Grading process

  • Attendance in the seminar workshop is mandatory
  • Code and documentation for one component per team
  • 2 talks (at halftime and end)
  • Attendance of all meetings
  • Contribution to the TREC conference paper (e.g., one page A4)

Type of Lecture

Project seminar (5 to 10 students), 3 points

Links and Literature



Introduction to Information Extraction and Retrieval:   

  • Sunita Sarawagi: Information Extraction. Foundations and Trends in Databases, 2008.
  • Ricardo A. Baeza-Yates, Berthier A. Ribeiro-Neto: Modern Information Retrieval. ACM Press / Addison-Wesley 1999, ISBN 0-201-39829-X
  • Introduction to Information Retrieval. C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2008 (http://nlp.stanford.edu/IR-book/information-retrieval-book.html)

Entity and Fact Extraction:

  • Oren Etzioni, Michael J. Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, Alexander Yates: Unsupervised named-entity extraction from the Web: An experimental study. Artif. Intell. 165(1): 91-134 (2005)
  • Eirinaios Michelakis, Rajasekar Krishnamurthy, Peter J. Haas, Shivakumar Vaithyanathan: Uncertainty management in rule-based information extraction systems. SIGMOD 2009: 101-114

Disambiguation and Duplicate Detection:

  • Xianpei Han, Jun Zhao: Named entity disambiguation by leveraging wikipedia semantic knowledge. CIKM 2009: 215-224
  • Risto Gligorov, Warner ten Kate, Zharko Aleksovski, Frank van Harmelen: Using Google distance to weight approximate ontology matches. WWW 2007: 767-776

Document Classification:

  • Eser Kandogan, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, Huaiyu Zhu: Avatar semantic search: a database approach to information retrieval. SIGMOD Conference 2006: 790-792
  • Pável Calado, Marco Cristo, Edleno Silva de Moura, Nivio Ziviani, Berthier A. Ribeiro-Neto, Marcos André Gonçalves: Combining link-based and content-based methods for web document classification. CIKM 2003: 394-401

Storage and Indexing:

  • Gerhard Weikum, Gjergji Kasneci, Maya Ramanath, Fabian M. Suchanek: Database and information-retrieval methods for knowledge discovery. Commun. ACM 52(4): 56-64 (2009)
  • Atanas Kiryakov, Borislav Popov, Ivan Terziev, Dimitar Manov, Damyan Ognyanoff: Semantic annotation, indexing, and retrieval. J. Web Sem. 2(1): 49-79 (2004)

Ranking Entities and Homepages:

  • Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang: EntityRank: Searching Entities Directly and Holistically. VLDB 2007: 387-398
  • Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis Gravano: To search or to crawl?: towards a query optimizer for text-centric tasks. SIGMOD Conference 2006: 265-276