Entity-centric Information Retrieval
Description
Entity-centric search aims to leverage semantic information in documents to improve document search. The Text REtrieval Conference (TREC - see trec.nist.gov) is one of the most famous conferences where research and industrial organizations can compare their Information Extraction and Retrieval systems in a form of a competition. The goal of this master seminar (5-10 participants) is to develop an Information Retrieval system for the Entity Track of TREC (see trec.nist.gov/call2010.html).
The aim of the Entity Track is to perform entity-centric search on web data. To provide such a system various challenges has to be tackled. First, entities and relations among them (i.e., facts) need to be recognized within text. Next, entities have to disambiguated – e.g., recognize whether “Apple” stands for a fruit or a company. Second, to achieve a high precision in document search, page types need to be classified (e.g., to determine the homepage of an entity). Disambiguated entities, their relations, page types and the entities contexts need to be stored and indexed in an appropriate way. Once a user types a query, the query needs to be interpreted, related entities have to be selected, and entities as well as homepages need to be ranked.
The course consists of two parts. The first part is a workshop that introduces basic concepts of Information Retrieval and Information Extraction. In the second part students will be divided into teams. Each of the teams will implement one component of the Information Retrieval system, that operates on the text corpus provided for the TREC’s Entity Track’s.
IMPORTANT: The introducing seminar workshop will take place BEFORE the official beginning of the semester (April 14th - 16th 2010)!!!
An example from TREC 2009 (http://ilps.science.uva.nl/trec-entity/)
Query:
<query>
<entity_name>Bridgestone Corporation</entity_name>
<entity_URL>http://bridgestone.com</entity_URL>
<target_entity>organization</target_entity>
<narrative>Motorsport series that Bridgestone officially supports with types. </narrative>
</query>
Answer:
- Formula 1 - www.formula1.com, en.wikipedia.org/wiki/Formula_One
- MotoGP - www.motogp.com, en.wikipedia.org/wiki/MotoGP
Registration
Please send an informal mail to falk.brauer@hpi.uni-potsdam.de for capacity planning.
Topics and Time Schedule
The seminar workshop (3 days) will be divided into 6 modules, each of the modules consists of 2 hours lecture and 1 hour exercise:
- Basics of Information Retrieval. In this module we will introduce basics of Information Retrieval techniques, and evaluation methods of Information Retrieval systems. As a practical part we use Apache Lucene framework (http://lucene.apache.org) to provide a solution for a simple example scenario.
- Introduction to TREC Challenge. We will discuss challenges based on examples from the TREC text corpus. Moreover we draw the overall architecture and components of the system that will be developed. In the second part of the module, we extend program from module 1 to example documents from TREC.
- Entity and Relation Extraction. We will introduce and compare different techniques used in entity and fact extraction, such as pattern matching, statistical entity recognition. As a practical part we will use existing frameworks to perform entity recognition (LingPipe, SystemT).
- Entity storage and search. This module will extend module 1. We will show how to extend classical Information Retrieval techniques to leverage knowledge about semantics within documents. The aim of the second part of the module will be the extension of the programs from previous modules to resolve entity specific queries, e.g., near to queries for geographic locations.
- Entity de-duplication and Information Quality in Information Extraction. Information Extraction on web data is error prone. In module 5 we investigate techniques that tackle quality issues and explore the capabilities of further derived background knowledge (e.g., Google Distance, DBPedia).
- Document classification. In order to successful answer a user query we need not only to return semantically correct documents but also documents that match the user’s intensions. In the last module we will learn how to recognize basic types of documents, such as homepages using regular expressions, dictionaries and machine learning (LingPipe).
Important Dates
General schedule:
| 14.04.2010- 16.04.2010: | Seminar workshop as described above |
| 17.05.2010: | Presentation of preliminary results |
| 28.06.2010: | Final presentation of evaluation results and integrated prototype |
Workshop schedule (14.04.2010):
| 10:30-13:00 | A-2.2 | Lecture |
| 13:00-14:00 | Mensa | Lunch |
| 14:00-16:30 | A-2.12 | Practical work |
Workshop schedule (15.04.2010 - 16.04.2010):
| 9:00-12:30 | A-2.2 | Lecture |
| 12:30-13:30 | Mensa | Lunch |
| 13:30-16:30 | A-2.12 | Practical work |
Requirements
- Java development skills
- Basic knowledge about entity extraction and information retrieval is helpful
Grading process
- Attendance in the seminar workshop is mandatory
- Code and documentation for one component per team
- 2 talks (at halftime and end)
- Attendance of all meetings
- Contribution to the TREC conference paper (e.g., one page A4)
Type of Lecture
Project seminar (5 to 10 students), 3 points
Links and Literature
Websites:
Tools:
Introduction to Information Extraction and Retrieval:
Entity and Fact Extraction:
Disambiguation and Duplicate Detection:
Document Classification:
Ranking Entities and Homepages:
| |