Entity-centric search aims to leverage semantic information in documents to improve document search. The Text REtrieval Conference (TREC - see trec.nist.gov) is one of the most famous conferences where research and industrial organizations can compare their Information Extraction and Retrieval systems in a form of a competition. The goal of this master seminar (5-10 participants) is to develop an Information Retrieval system for the Entity Track of TREC (see trec.nist.gov/call2010.html).
The aim of the Entity Track is to perform entity-centric search on web data. To provide such a system various challenges has to be tackled. First, entities and relations among them (i.e., facts) need to be recognized within text. Next, entities have to disambiguated – e.g., recognize whether “Apple” stands for a fruit or a company. Second, to achieve a high precision in document search, page types need to be classified (e.g., to determine the homepage of an entity). Disambiguated entities, their relations, page types and the entities contexts need to be stored and indexed in an appropriate way. Once a user types a query, the query needs to be interpreted, related entities have to be selected, and entities as well as homepages need to be ranked.
The course consists of two parts. The first part is a workshop that introduces basic concepts of Information Retrieval and Information Extraction. In the second part students will be divided into teams. Each of the teams will implement one component of the Information Retrieval system, that operates on the text corpus provided for the TREC’s Entity Track’s.
IMPORTANT: The introducing seminar workshop will take place BEFORE the official beginning of the semester (April 14th - 16th 2010)!!!
An example from TREC 2009 (http://ilps.science.uva.nl/trec-entity/)
<narrative>Motorsport series that Bridgestone officially supports with types. </narrative>
- Formula 1 - www.formula1.com, en.wikipedia.org/wiki/Formula_One
- MotoGP - www.motogp.com, en.wikipedia.org/wiki/MotoGP