Processing Web Tables
Advisors
Prof. Dr. Felix Naumann , Hazar Harmouch and Leon Bornemann
Description
Tables on the web are a significant source of structured information. In a large-scale crawling effort in 2008, Cafarella et al. extracted 14.1 billion tables from billions of HTML webpages. While many webtables are used for layout purposes only, there are still much more tables that contain high-quality and structured information. Cafarella et al. estimate that 154 million of the 14.1 billion tables contain relational data, i.e, database alike tables. Even just the English version of Wikipedia contains more than 1 million tables as of 11/2017. However, making use of webtables automatically is challenging: the tables usually contain few records and are designed to be read by humans, not machines. The main use cases of webtables include knowledge base augmentation, searching or querying large table Corpora, and find the set of tables joinable with a query table.
The above use-cases demand solutions for many different tasks, which include, but are not limited to:
-
Detection of genuine (relational) web tables.
-
Header (Row(s)/Column(s)) Detection.
-
Schema Normalization.
In this seminar, we will introduce you to the research area of webtables. Each team, ideally consists of 2 people, will implement a solution for one of the above mentioned tasks (or any other relevant problem in the research area of webtables they found it interesting). We will provide you with state of the art papers that suggest solutions to the above problems which you will implement and try to improve upon with your own ideas in scalable way.
Time Table
When: Thursday 13:30 at Campus II, Building F, Room F-2-11.
Date | Topic | Slides |
11.4.2019 | Introduction Preparatory Material: Cafarella, M., Halevy, A., Lee, H., Madhavan, J., Yu, C., Wang, D. Z., & Wu, E. (2018). Ten years of webtables. Proceedings of the VLDB Endowment, 11(12), 2140-2149. | |
18.4.2019 | Group allocation+ mini talks about research topics in our group | |
25.4.2019 | Basics of literature search and giving technical talks | |
2.5.2019 | Task presentations (summary and planned contributions) |
|
9.5.2019 | Progress report |
|
16.5.2019 | Technical talk about specific paper (BaseLine solution) |
|
23.5.2019 | Research Process – practical hints and clues !UPDATE DATE 29.5.2019 at 3 PM at Campus II, Building F, Room F-2-11. ! | |
30.5.2019 | Christi Himmelfahrt | |
6.6.2019 | Mid-term presentation (implementation) |
|
13.6.2019-27.6.2019 | Progress report |
|
4.7.2019 | Present improvements |
|
11.7.2019-18.7.2019 | Progress report !UPDATE DATE 12.7.2019 at 2 PM at Campus II, Building F, Room F-2-11. instead of 11.7.2019 ! | |
25.7.2019 | End-term presentation | |
| End of September | Final paper-style submission |
Literature
You can find the following papers on dblp or google-scholar:
- Cafarella, M. J., Halevy, A. Y., Zhang, Y., Wang, D. Z., & Wu, E. (2008, June). Uncovering the Relational Web. In WebDB.
- Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., & Zhang, Y. (2008). Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1), 538-549.
- Balakrishnan, S., Halevy, A., Harb, B., Lee, H., Madhavan, J., Rostamizadeh, A., ... & Yu, C. (2015). Applying webtables in practice.
- Cafarella, M., Halevy, A., Lee, H., Madhavan, J., Yu, C., Wang, D. Z., & Wu, E. (2018). Ten years of webtables. Proceedings of the VLDB Endowment, 11(12), 2140-2149.
- Pinto, D., McCallum, A., Wei, X., & Croft, W. B. (2003, July). Table extraction using conditional random fields. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (pp. 235-242). ACM.
- Julian Eberius, Katrin Braunschweig, Markus Hentsch, Maik Thiele, Ahmad Ahmadov, Wolfgang Lehner: Building the Dresden Web Table Corpus: A Classification Approach Proceedings of the 2nd International Symposium on Big Data Computing (BDC), 2015.
- Crestan, E., & Pantel, P. (2011, February). Web-scale table census and classification. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 545-554). ACM.
- Wang, J., Wang, H., Wang, Z., & Zhu, K. Q. (2012, October). Understanding tables on the web. In International Conference on Conceptual Modeling (pp. 141-155). Springer, Berlin, Heidelberg.
- Braunschweig, K., Thiele, M., & Lehner, W. (2015, October). From web tables to concepts: A semantic normalization approach. In International Conference on Conceptual Modeling (pp. 247-260). Springer, Cham.
- Lehmberg, O., & Bizer, C. (2017). Stitching web tables for improving matching quality. Proceedings of the VLDB Endowment, 10(11), 1502-1513.
- Wang, D. Z., Dong, X. L., Sarma, A. D., Franklin, M. J., & Halevy, A. Y. (2009, June). Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems. In WebDB.
By Topic:
Table Header Detection and Relational Webtable Detection
- Cafarella, M. J., Halevy, A. Y., Zhang, Y., Wang, D. Z., & Wu, E. (2008, June). Uncovering the Relational Web. In WebDB.
- Crestan, E., & Pantel, P. (2011, February). Web-scale table census and classification. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 545-554). ACM.
Table Normalization
- Braunschweig, K., Thiele, M., & Lehner, W. (2015, October). From web tables to concepts: A semantic normalization approach. In International Conference on Conceptual Modeling (pp. 247-260). Springer, Cham.
- Wang, D. Z., Dong, X. L., Sarma, A. D., Franklin, M. J., & Halevy, A. Y. (2009, June). Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems. In WebDB.
Datasets
-
Web Data Commons - Web Table Corpora
-
Wikipedia tables in JSON format
Organization
- Project seminar for master students
- Language of instruction: English
- Maximum number of participants: 6
Students form teams of two members. Each team is assigned a task and the according publications. After studying this (and further) literature, the teams should present a summary of state of the art solutions and in parallel also implement their baseline. To present the baseline and the results of the first phase to the whole group, all teams will give a mid-term presentations.
In the second half of the seminar, each team tries to improve or find a better solution for thier task. The team members should finally report on their improvements in a last presentation. To conclude the seminar, each team needs to prepare a paper-style submission of thier solution .
Grading
The final grade is weighted by 6 LP and considers the following:
- (10%) Active participation in meetings and discussions
- (15%) Technical presentation of a scientific paper (the chosen baseline).
- (20%) End-term presentation
- (25%) Quality of implementation and coding style
- (30%) Final paper-style submission