Processing Web Tables

Advisors

Prof. Dr. Felix Naumann , Hazar Harmouch and Leon Bornemann

Description

Tables on the web are a significant source of structured information. In a large-scale crawling effort in 2008, Cafarella et al. extracted 14.1 billion tables from billions of HTML webpages. While many webtables are used for layout purposes only, there are still much more tables that contain high-quality and structured information. Cafarella et al. estimate that 154 million of the 14.1 billion tables contain relational data, i.e, database alike tables. Even just the English version of Wikipedia contains more than 1 million tables as of 11/2017. However, making use of webtables automatically is challenging: the tables usually contain few records and are designed to be read by humans, not machines. The main use cases of webtables include knowledge base augmentation, searching or querying large table Corpora, and find the set of tables joinable with a query table.

The above use-cases demand solutions for many different tasks, which include, but are not limited to:

Detection of genuine (relational) web tables.
Header (Row(s)/Column(s)) Detection.
Schema Normalization.

In this seminar, we will introduce you to the research area of webtables. Each team, ideally consists of 2 people, will implement a solution for one of the above mentioned tasks (or any other relevant problem in the research area of webtables they found it interesting). We will provide you with state of the art papers that suggest solutions to the above problems which you will implement and try to improve upon with your own ideas in scalable way.

Time Table

When: Thursday 13:30 at Campus II, Building F, Room F-2-11.

Date	Topic	Slides
11.4.2019 (H-E.51)	Introduction Preparatory Material: Cafarella, M., Halevy, A., Lee, H., Madhavan, J., Yu, C., Wang, D. Z., & Wu, E. (2018). Ten years of webtables. Proceedings of the VLDB Endowment, 11(12), 2140-2149.	PDF
18.4.2019	Group allocation+ mini talks about research topics in our group	PDF
25.4.2019	Basics of literature search and giving technical talks	PDF
2.5.2019	Task presentations (summary and planned contributions)
9.5.2019	Progress report
16.5.2019	Technical talk about specific paper (BaseLine solution)
23.5.2019	Research Process – practical hints and clues !UPDATE DATE 29.5.2019 at 3 PM at Campus II, Building F, Room F-2-11. !	PDF
30.5.2019	Christi Himmelfahrt
6.6.2019	Mid-term presentation (implementation)
13.6.2019-27.6.2019	Progress report
4.7.2019	Present improvements
11.7.2019-18.7.2019	Progress report !UPDATE DATE 12.7.2019 at 2 PM at Campus II, Building F, Room F-2-11. instead of 11.7.2019 !
25.7.2019	End-term presentation
End of September	Final paper-style submission

Literature

You can find the following papers on dblp or google-scholar:

Cafarella, M. J., Halevy, A. Y., Zhang, Y., Wang, D. Z., & Wu, E. (2008, June). Uncovering the Relational Web. In WebDB.
Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., & Zhang, Y. (2008). Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1), 538-549.
Balakrishnan, S., Halevy, A., Harb, B., Lee, H., Madhavan, J., Rostamizadeh, A., ... & Yu, C. (2015). Applying webtables in practice.
Cafarella, M., Halevy, A., Lee, H., Madhavan, J., Yu, C., Wang, D. Z., & Wu, E. (2018). Ten years of webtables. Proceedings of the VLDB Endowment, 11(12), 2140-2149.
Pinto, D., McCallum, A., Wei, X., & Croft, W. B. (2003, July). Table extraction using conditional random fields. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (pp. 235-242). ACM.
Julian Eberius, Katrin Braunschweig, Markus Hentsch, Maik Thiele, Ahmad Ahmadov, Wolfgang Lehner: Building the Dresden Web Table Corpus: A Classification Approach Proceedings of the 2nd International Symposium on Big Data Computing (BDC), 2015.
Crestan, E., & Pantel, P. (2011, February). Web-scale table census and classification. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 545-554). ACM.
Wang, J., Wang, H., Wang, Z., & Zhu, K. Q. (2012, October). Understanding tables on the web. In International Conference on Conceptual Modeling (pp. 141-155). Springer, Berlin, Heidelberg.
Braunschweig, K., Thiele, M., & Lehner, W. (2015, October). From web tables to concepts: A semantic normalization approach. In International Conference on Conceptual Modeling (pp. 247-260). Springer, Cham.
Lehmberg, O., & Bizer, C. (2017). Stitching web tables for improving matching quality. Proceedings of the VLDB Endowment, 10(11), 1502-1513.
Wang, D. Z., Dong, X. L., Sarma, A. D., Franklin, M. J., & Halevy, A. Y. (2009, June). Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems. In WebDB.

By Topic:

Table Header Detection and Relational Webtable Detection

Cafarella, M. J., Halevy, A. Y., Zhang, Y., Wang, D. Z., & Wu, E. (2008, June). Uncovering the Relational Web. In WebDB.
Crestan, E., & Pantel, P. (2011, February). Web-scale table census and classification. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 545-554). ACM.

Table Normalization

Braunschweig, K., Thiele, M., & Lehner, W. (2015, October). From web tables to concepts: A semantic normalization approach. In International Conference on Conceptual Modeling (pp. 247-260). Springer, Cham.
Wang, D. Z., Dong, X. L., Sarma, A. D., Franklin, M. J., & Halevy, A. Y. (2009, June). Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems. In WebDB.

Datasets

Web Data Commons - Web Table Corpora
Wikipedia tables in JSON format

Organization

Project seminar for master students
Language of instruction: English
Maximum number of participants: 6

Students form teams of two members. Each team is assigned a task and the according publications. After studying this (and further) literature, the teams should present a summary of state of the art solutions and in parallel also implement their baseline. To present the baseline and the results of the first phase to the whole group, all teams will give a mid-term presentations.

In the second half of the seminar, each team tries to improve or find a better solution for thier task. The team members should finally report on their improvements in a last presentation. To conclude the seminar, each team needs to prepare a paper-style submission of thier solution .

Grading

The final grade is weighted by 6 LP and considers the following:

(10%) Active participation in meetings and discussions
(15%) Technical presentation of a scientific paper (the chosen baseline).
(20%) End-term presentation
(25%) Quality of implementation and coding style
(30%) Final paper-style submission