For bachelor students we offer German lectures on database systems in addition with paper- or project-oriented seminars. Within a one-year bachelor project students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, search engines and information retrieval enhanced by specialized seminars, master projects and advised master theses.
The Web Science group focuses on various topics related to the Web, such as Information Retrieval, Natural Language Processing, Data Mining, Knowledge Discovery, Social Network Analysis, Entity Linking, and Recommender Systems. The group is particularly interested in Text Mining to deal with the vast amount of unstructured and semi-structured information available on the Web.
Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our data sets and source code.
In this seminar, we want to implement large-scale data processing tasks on two MapReduce  platforms. Each team of two students will learn about and implement a large-scale data analysis problem in Hadoop  and Stratosphere  and compare the efficiency and scale-out properties of both solutions.
The maximum number of students is 8, resulting in 4 teams.
To participate, please join us at the first meeting on Oct 18 in H-2.58.
October 18: Topic introduction
October 22: Submission of topic wishlist
October 24: Notification
November 15: Paper presentation and implementation ideas (15+5 min)
December 20: Intermediate presentation (15+5 min)
February 10, 9.15: Final presentation (30+10 min)
March 25: Final report (6-8 pages)
A few mandatory (before presentations/submission of report) and some optional consultations once per week
Calculate PageRank on a cluster efficiently (Chapter 5.2) and implement one extension countering link spam (either TrustRank (5.4.4) or SpamMass (5.4.5)).
Clustering in Non-Euclidian Space
Clustering groups similar items according to a distance measure. Chapter 7.5 introduces clustering in non-euclidian space and 7.6.6 outlines briefly a parallel implementation.
Frequent itemsets (Chapter 6) represent often co-occurring items in a large data set, e.g. books that are regularly bought together at Amazon. The SON algorithm can be well parallelized with MapReduce as described in Chapter 6.4.4.
Collaborative filtering is a technique to recommend items to users using a large knowledge base of previous user-item relations, e.g., purchase or ratings. Chapter 9 covers recommendation systems in general; a parallel implementation is the parallel stochastic gradient descent.