Hasso-Plattner-Institut
Prof. Dr. Emmanuel Müller
 

Open bachelor or master theses topics

We are currently looking for passionate students who want to run a bachelor or master thesis within the Knowledge Discovery and Data Mining group. The candidate should have good programming skills as well as general interest in modern data mining techniques. Although any thesis proposal is welcomed, we propose theses on our current research areas, which include correlation analysis, (un-)supervised feature selection, cluster and outlier detection, graph mining, user-centric and interactive graph exploration, graph evolution, ensemble methods, descriptions for unexpected patterns, semi-automated data exploration, and schema extraction.

The following projects are examples of open bachelor or master thesis projects. Please note that each of the topics presented might end up in more than one thesis project. 

Event impacts on time series

Due to the ubiquitous collection of data over time, important events like natural disasters, financial crises, or mass displacements of people often leave traces in environmental, economic, or other time series. Anomalies, extreme values, and significant changes in these time series can indicate the occurrence of such events. Hence, a lot of research focuses on detecting unusual patterns in time series to improve our understanding of the underlying systems and enable early warning systems. However, there is little agreement on how to precisely quantify the effect of discrete events on continuous time series in an interpretable way. In this project, we aim at filling this gap by exploring novel methods to correlate time series with event series. The project merges ideas from pattern mining, correlation analysis and event coincidence analysis, so candidates should have a strong interest in math and algorithms.

Contact: Erik Scharwächter

Monitoring correlations in massive time series collections

The observation of urban, societal and environmental variables like air pollution level, mobile phone activity, or tropical vegetation cover through various sensors allows to detect interesting, unusual events in near real-time. For example, unusual road congestion leads to high air pollution levels in some districts of a city, deforestation in a tropical forest can be perceived by a sharp decrease in vegetation cover, and an earthquake leads to a sudden increase in mobile phone calls in the affected regions. In this project, novel approaches to capture and analyze complex dynamics within massive collections of time-series of such sensor readings are explored to reveal anomalous events and interesting patterns. A special focus is put on dynamics in the correlation structure of time-series as well as the different spatial scales of events—ranging from small events with localized impacts to massive events that lead to global anomalies. The project combines methods from anomaly detection, clustering and correlation tracking, so candidates should have strong interest in math and algorithms.

Contact: Erik Scharwächter

Missing value invariant data mining

Although data mining algorithms have been improved extensively over the last two decades, incomplete data is still hard to process by traditional anomaly detection techniques. The most straightforward approach, to overcome missing values, is leaving out records listwise in case, a single value is missing. However, if the dimensionality of the data increases, the probability of a row to be full will decrease exponentially. Another approach is to impute missing values before applying the algorithm. This, however, can be costly and adds a bias.

The aim of this project is to develop novel or adapt data mining algorithms (Clustering, Outlier Detection, Classification, Regression, Feature Selection), which overcome the problem of missing data. This requires some knowledge of statistics, machine learning and programming, so a candidate should feel comfortable in these areas. For writing a thesis,  a candidate can either bring his/her own idea, or we will provide you one. 

Contact: Thomas Goerttler 

Interactive Graph Exploration

Graphs are widely used for the representation of social networks, gene and protein interactions, communication networks, or product co-purchase in web stores. Each object is represented by its relationships to other objects (edge structure) and by additional properties (attributes). For instance, social networks store friendship relations as edges and age, income, and other properties as attributes. Exploring and looking at the structure of the graph when the user is searching an information shows important relationships between the graph and the user. We want to support graph exploration starting from a user query clustering the results in a convenient way. The project will start with the implementation of an existing technique for graph exploration and will move towards new directions to present to the user unexpected results or more insights on the structures. 

User-driven graph mining

Social networks contain many information in terms of profiles of the user. They contain large and complex graph structures and many additional heterogeneous information (e.g. from user profiles, behaviors, opinions, etc.). On the one side automatic graph mining technology allows to analyze such graphs, while on the other side human users have certain (user-driven) interests in the graph. If the user wants to see only a portion of the social networks based on its (user-driven) preferences, we need to focus on a specific (user-driven) sub-graph and take into account several heterogeneous information sources for this selection. Starting from the implementation of traditional frequent pattern mining techniques the project will explore ways to exclude parts of the graph that are not interesting for the user and combine the information into small graph summaries.

Reusability through large scale graph compatibility

3D printers started a revolution in object fabrication, provided that users have now the possibility to construct objects without expensive materials and complicate procedures. Therefore, many different repositories of 3D objects ready to be printed have recently appeared. One such example is thingiverse.com which allows any user to search for a model and construct it. This is however very limited since we should be able to search for parts (or mechanisms) and understanding what is the best way to reuse this mechanism. 3D objects can be conveniently modelled as graphs, where each node is a joint and an edge contains the connections between different parts. In this thesis, in collaboration with the HCI group, we intend to mine the most probable objects attached to a given input mechanism to help the user in the construction phase.

Predictive analytics for enterprise performance management

The importance of predicting the trend of data has been shown in many applications. This includes financial data, weather forecasting, sensor reading. However, a non-expert user wants to be guided in the use of forecasting models and selecting the right descriptors (features) for any sheet of data. Another important aspect to evaluate is the time, since there is the need to provide a fast, scalable and usable tool. The thesis aims at creating such a framework to allow the selection of the right model, the analysis of the data and the forecasting of feature trends. This project will be supervised in collaboration with Valsight and the candidate will work with real data on a real scenario of a customer. Valsight is an HPI startup that builds software for risk-based simulations to quantify the impact of events and measures onto KPIs for decision support in enterprise performance management. 

Feature selection for multivariate sensor readings

In today's world the number of measurements one can collect from complex systems is huge. Examples include temperature, heart rate, traffic and so on. This is particularly important for multivariate sensor readings, where multiple sensors acquire different data from the environment. Such data represents a valuable source of information as well as a big challenge for data analysts who have to face the scalability problem of the existing techniques. Moreover, not all the measurements are interesting to predict some phenomenon, and there might be redundant or useless information. Feature selection is used in these cases.  This thesis will explore feature selection on sensors to predict future readings or other variables. The current methods exploit one single selection criteria and mostly work with limited size data, while the project aims at overcoming the aforementioned limitations to select the best subset of input variables. This project will be supervised in collaboration with Bosch and GFZ; and the candidate will work with real sensor data from complex systems. 

Database query answering in time series

The development of sensor technologies in a wide range of domains (e.g. earth observation, predictive maintenance, health monitoring, etc.) has led to a big data explosion in monitoring activities, which provide a very large amount of time series. In order to efficiently process and analyze large volumes of time series, we can build index structures that help us answer fast similarity queries on massive collections of time series. Nevertheless, query answering in this domain remains a challenging problem, and a deeper understanding of database query processing and raw data characteristics is needed. The goal of this master thesis project is to explore the reasons that make query answering in time series data challenging and develop new methods that characterize a given query w.r.t. both intrinsic properties of the data and the distribution of answers in the search space. Therefore, together with our collaboration partners from Paris Descartes University we will develop a measure that can characterize the difficulty of answering a query and encompasses both of these factors at the same time.