Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Neue Entwicklungen im Bereich Informationssysteme

Im Rahmen dieses Forschungsseminars stellen Mitarbeiter und Studenten ihre Forschungsarbeiten auf diesem Gebiet vor. Studenten und Gäste sind herzlich eingeladen.

Allgemein

Wann: Montag, 13:30 - 15:00

Wo: A-2.2, HPI

Themen und Termine

TerminThema Vortragender
05.10.2009,
ab 13:30 Uhr
Graph-Based Ontology Construction from Heterogeneous EvidencesChristoph Böhm
19.11.2009,
ab 14:00 Uhr,
Raum A-1.1
Encapsulating Multi-Stepped Web Forms as Web ServicesTobias Vogel
30.11.2009,
ab 13:30 Uhr
iPopulator: Learning to Extract Structured Information from Wikipedia Articles to Populate InfoboxesDustin Lange
11.01.2010
ab 13:30 Uhr
Data Quality–Werkzeuge im Einsatz: Data Profiling mit Oracle SoftwareNegib Marhoul (Systemberater BI/DWH Oracle Deutschland GmbH)
18.01.2010
ab 13:30 Uhr
Web Query Interface IntegrationThomas Kabisch
15.02.2010
ab 13:00 Uhr
NTII ProbevorträgeFelix Naumann
Christoph Böhm

Felix Naumann & Christoph Böhm - NTII

Two papers from our group were accepted for the ICDE Workshop on New Trends in Information Integration:

  • Complement Union for Data Integration: Jens Bleiholder, Sascha Szott, Melanie Herschel, Felix Naumann
  • Profiling Linked Open Data with ProLOD: Christoph Böhm, Felix Naumann, Ziawasch Abedjan, Dandy Fenz, Toni Grütze, Daniel Hefenbrock, Matthias Pohl, David Sonnabend

www.cse.iitb.ac.in/~grajeev/ntii10/index.htm

Thomas Kabisch - Web Query Interface Integration

The Web has evolved into a data-rich repository containing significant structured content. Thiscontent resides mainly in Web databases that are also referred to as the Deep Web. In order toobtain the contents of Web databases, a user has to pose structured queries. These queries areformulated by filling in Web query interfaces with valid input values.Common examples are job portals or the search for cheap airline tickets.
 
With each application domain hosting a large and increasing number of sources, it is unrealisticto expect the user to probe each source individually. Consequently, significant researcheffort has been devoted to enable a uniform access to the large amount of data guarded by queryinterfaces.The success of these applications relies on a good understanding of Web query interfaces, because aquery interface provides a glimpse into the schema of the underlying database and is the mainmeans to retrieve data from the database.
 
The work focuses on tools and methods for facilitating the programmatic access of Web databases.Its goal is to support developers of integration systems by providing aframework to access and match Deep Web interfaces.
 
We provide novel solutions for three central steps of Deep Web integration:

  • Extraction of query interfaces
  • Matching of query interface elements
  • Domain classification of unknown interfaces

We give insights into each of the three steps by presenting concepts and algorithms.We show the experimental results and discuss advantages and limits of the developed methods.

Negib Marhoul - Data Quality–Werkzeuge im Einsatz: Data Profiling mit Oracle Software

Um ein Unternehmen mit fundierten Daten steuern zu können, stellt die Datenqualität eine Grundvoraussetzung dar. Der Zustand der Datenqualität kann meist nur sehr aufwändig ermittelt und berichtet werden. Ein Grund hierfür ist, dass eine Unternehmensinfrastruktur heutzutage eine sehr stark heterogene Systemlandschaft und viele unterschiedliche Datenquellen mit jährlich doppelter Datenmengen aufweist. Oracle stellt den Unternehmen sowohl Infrastruktur als auch Werkzeuge zur Verfügung, die eine Datenqualitätssicherung unterstützen. Die IT und der Fachbereich erhalten damit ein technisches Hilfsmittel, um effizient und Kosten minimierend ihr Wissen in die Datenanalyse und Qualitätssicherung einfließen zu lassen. Der Vortrag behandelt einen kurzen Überblick zum aktuellen Markt und der Positionierung von Oracle. Die Software von Oracle wird an Hand von Beispielprojekten demonstriert. Allgemeine Praxismerkmale können damit im Bereich der Datenqualitätsanalyse eruiert werden.

Vorläufige Agenda:

  • Softwarehersteller und Softwaremarkt im Bereich Data Quality
    • Verantwortungsbereiche im Unternehmen
    • Beispiele der möglichen Fehlerquellen im Unternehmen
      • Weisen
      • Duplikate
      • Fachliche Codes
  • Produktportfolio von Oracle
    • Oracle Oracle Warehouse Builder / DataQuality Option
    • Oracle Oracle Data Integrator / DataQuality for ODI (Trillium)
  • Einsatz im Oracle Projekt – Stärken und Schwächen der Produkte
    • Adressdaten Qualität Sicherung
      • Projektzeiten reduzieren
    • Zeitreihenanalyse über Flottenmanagementdaten
      • Benachrichtigungen bei Datenkonflikten
  • Diskussion

Dustin Lange - iPopulator: Learning to Extract Structured Information from Wikipedia Articles to Populate Infoboxes

In this talk, we present iPopulator - a system that automatically populates infoboxes of Wikipedia articles by analyzing article texts. Wikipedia infoboxes provide a brief overview on the most important facts about an article's subject. Readers profit by well-populated infoboxes, since they can instantly gather interesting facts. Additionally, external applications that process infoboxes, such as DBpedia, benefit from complete infoboxes, as they create a large knowledge base from them.
 
The problem addressed by iPopulator is the population of existing infoboxes with as many correct attribute values as possible. We assume that an article text already contains a potentially incomplete infobox. To approach the infobox population problem, we implemented a system that exploits attribute value structures and applies machine learning techniques. We have developed an algorithm that determines the structure of the majority of values of an attribute. Using this structure, an attribute can be divided into parts, enabling the creation of a well-labeled training data set. Conditional Random Fields (CRFs) are applied to the training articles to learn extractors for all attributes in all infobox templates. These extractors are applied to the test articles, so that attribute value parts can be extracted that are merged into a result value according to the learned attribute value structure.
 
To evaluate iPopulator, we use Wikipedia articles that already have an infobox. Our system achieves its highest extraction results for attributes that frequently occur in article texts. In about two of three cases, the attribute value structure can successfully be reconstructed. By extracting infobox attribute values, the knowledge that is summarized in infoboxes is extended. Additionally, the exploitation of attribute value structures for the construction of attribute values results in a better data quality.

Tobias Vogel - Encapsulating Multi-Stepped Web Forms as Web Services

HTML forms are the predominant interface between usersand web applications. Many of these applications display a sequence ofmultiple forms on separate pages, for instance to book a flight or ordera DVD. We introduce a method to wrap these multi-stepped forms andoffer their individual functionality as a single consolidated Web Service.This Web Service in turn maps input data to the individual forms in thecorrect order. Such consolidation better enables operation of the formsby applications and provides a simpler interface for human users.To this end we analyze the HTML code and sample user interaction ofeach page and infer the internal model of the application. A particularchallenge is to map semantically same fields across multiple forms andchoose meaningful labels for them. Web Service output is parsed fromthe resulting HTML page. Experiments on different multi-stepped webforms show the feasibility and usefulness of our approach.

Böhm, Groth, Leser - Graph-Based Ontology Construction from Heterogeneous Evidences

Ontologies are tools for describing and structuring knowledge, with many applications in searching and analyzing complex knowledge bases. Since building them manually is a costly process, there are various approaches for bootstrapping ontologies automatically through the analysis of appropriate documents. Such an analysis needs to find the concepts and the relationships that should form the ontology. However, the initial set of relationships is usually inconsistent and rather imbalanced - a problem which was mostly ignored so far. In the paper, we define the problem of extracting a consistent as well as properly structured ontology from a set of inconsistent and heterogeneous relationships. Moreover, we propose three graph-based methods for solving the ontology extraction problem and evaluate them on a large data set of >325K documents against a gold standard ontology comprising >12K relationships. Our study shows that an algorithm based on a modified formulation of the dominating set problem outperforms greedy methods.


The talk presents the original paper and will soon be held at ISWC 2009.