Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Description

Tables on the web are a significant source of structured information. In a large-scale crawling effort in 2008, Cafarella et al. extracted 14.1 billion tables from billions of HTML webpages. While many webtables are used for layout purposes only, there are still much more tables that contain high-quality and structured information. Cafarella et al. estimate that 154 million of the 14.1 billion tables contain relational data, i.e, database alike tables. Even just the English version of Wikipedia contains more than 1 million tables as of 11/2017. However, making use of webtables automatically is challenging: the tables usually contain few records and are designed to be read by humans, not machines. The main use cases of webtables include knowledge base augmentation, searching or querying large table Corpora, and find the set of tables joinable with a query table.

The above use-cases demand solutions for many different tasks, which include, but are not limited to:

  • Detection of genuine (relational) web tables.

  • Header (Row(s)/Column(s))  Detection.

  • Schema Normalization.

In this seminar, we will introduce you to the research area of webtables. Each team, ideally consists of 2 people, will implement a solution for one of the above mentioned tasks (or any other relevant problem in the research area of webtables they found it interesting). We will provide you with state of the art papers that suggest solutions to the above problems which you will implement and try to improve upon with your own ideas in scalable way.

Time Table

When: Thursday 13:30 at Campus II, Building F, Room F-2-11.

 

Date

Topic

Slides

11.4.2019
(H-E.51)

Introduction

Preparatory Material: Cafarella, M., Halevy, A., Lee, H., Madhavan, J., Yu, C., Wang, D. Z., & Wu, E. (2018). Ten years of webtablesProceedings of the VLDB Endowment11(12), 2140-2149.

PDF

18.4.2019

Group allocation+ mini talks about research topics in our group

PDF

25.4.2019

Basics of literature search and giving technical talks

PDF

2.5.2019

Task presentations (summary and planned contributions)

 

9.5.2019

Progress report

 

16.5.2019

Technical talk about specific paper (BaseLine solution)

 

23.5.2019

Research Process – practical hints and clues  !UPDATE DATE 29.5.2019 at 3 PM  at Campus II, Building F, Room F-2-11. !

PDF

30.5.2019

Christi Himmelfahrt

 

6.6.2019

Mid-term presentation  (implementation)

 

13.6.2019-27.6.2019

Progress report

 

4.7.2019

Present improvements

 

11.7.2019-18.7.2019

Progress report !UPDATE DATE 12.7.2019 at 2 PM  at Campus II, Building F, Room F-2-11. instead of 11.7.2019 !

 

25.7.2019

End-term presentation

 
End of September

Final paper-style submission

 

Literature

You can find the following papers on dblp or google-scholar:

  • Cafarella, M. J., Halevy, A. Y., Zhang, Y., Wang, D. Z., & Wu, E. (2008, June). Uncovering the Relational Web. In WebDB.
  • Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., & Zhang, Y. (2008). Webtables: exploring the power of tables on the webProceedings of the VLDB Endowment1(1), 538-549. 
  • Balakrishnan, S., Halevy, A., Harb, B., Lee, H., Madhavan, J., Rostamizadeh, A., ... & Yu, C. (2015). Applying webtables in practice.
  • Cafarella, M., Halevy, A., Lee, H., Madhavan, J., Yu, C., Wang, D. Z., & Wu, E. (2018). Ten years of webtablesProceedings of the VLDB Endowment11(12), 2140-2149.
  • Pinto, D., McCallum, A., Wei, X., & Croft, W. B. (2003, July). Table extraction using conditional random fields. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (pp. 235-242). ACM.
  • Julian Eberius, Katrin Braunschweig, Markus Hentsch, Maik Thiele, Ahmad Ahmadov, Wolfgang Lehner: Building the Dresden Web Table Corpus: A Classification Approach Proceedings of the 2nd International Symposium on Big Data Computing (BDC), 2015.
  • Crestan, E., & Pantel, P. (2011, February). Web-scale table census and classification. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 545-554). ACM.
  • Wang, J., Wang, H., Wang, Z., & Zhu, K. Q. (2012, October). Understanding tables on the web. In International Conference on Conceptual Modeling (pp. 141-155). Springer, Berlin, Heidelberg.
  • Braunschweig, K., Thiele, M., & Lehner, W. (2015, October). From web tables to concepts: A semantic normalization approach. In International Conference on Conceptual Modeling (pp. 247-260). Springer, Cham.
  • Lehmberg, O., & Bizer, C. (2017). Stitching web tables for improving matching quality. Proceedings of the VLDB Endowment, 10(11), 1502-1513.
  • Wang, D. Z., Dong, X. L., Sarma, A. D., Franklin, M. J., & Halevy, A. Y. (2009, June). Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems. In WebDB.

By Topic:

Table Header Detection and  Relational Webtable Detection

  • Cafarella, M. J., Halevy, A. Y., Zhang, Y., Wang, D. Z., & Wu, E. (2008, June). Uncovering the Relational Web. In WebDB.
  • Crestan, E., & Pantel, P. (2011, February). Web-scale table census and classification. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 545-554). ACM.

 

Table Normalization

  • Braunschweig, K., Thiele, M., & Lehner, W. (2015, October). From web tables to concepts: A semantic normalization approach. In International Conference on Conceptual Modeling (pp. 247-260). Springer, Cham.
  • Wang, D. Z., Dong, X. L., Sarma, A. D., Franklin, M. J., & Halevy, A. Y. (2009, June). Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems. In WebDB.

Datasets

Organization

  • Project seminar for master students 
  • Language of instruction: English
  • Maximum number of participants: 6

Students form teams of two members. Each team is assigned a task and the according publications. After studying this (and further) literature, the teams should present a summary of state of the art solutions and in parallel also implement their baseline. To present the baseline and the results of the first phase to the whole group, all teams will give a mid-term presentations.

In the second half of the seminar, each team tries to improve or find a better solution for thier task. The team members should finally report on their improvements in a last presentation. To conclude the seminar, each team needs to  prepare a paper-style submission of thier solution .

Grading

The final grade is weighted by 6 LP and considers the following:

  • (10%) Active participation in meetings and discussions
  • (15%) Technical presentation of a scientific paper (the chosen baseline).
  • (20%) End-term presentation
  • (25%) Quality of implementation and coding style
  • (30%) Final paper-style submission