Prof. Dr. Emmanuel Müller

Open bachelor or master theses topics

We are currently looking for passionate students who want to run a bachelor or master thesis within the Knowledge Discovery and Data Mining group. The candidate should have good programming skills as well as general interest in modern data mining techniques. Although any thesis proposal is welcomed, we propose a thesis on our current research areas, which include correlation analysis, (un-)supervised feature selection, cluster and outlier detection, graph mining, user-centric and interactive graph exploration, graph evolution, ensemble methods, descriptions for unexpected patterns, semi-automated data exploration, and schema extraction.

The following projects are examples of open bachelor or master thesis projects. Please note, that each of the topics presented might end up in more than one thesis project. 

Event detection in sensor collections

Monitoring urban, societal and environmental variables like air pollution, mobile phone usage, or tropical vegetation cover through various sensors allows to detect interesting, unusual events in near real-time. For example, unusual road congestion leads to high air pollution levels in some districts of a city, deforestation in a tropical forest can be perceived by a sharp decrease in vegetation cover, and an earthquake leads to a sudden increase in call volume in the affected regions. In this project, a novel approach for event detection in sensor collections will be explored that explicitly considers the different spatial scales of events—ranging from small events with localized impacts to massive events that lead to global anomalies. The project combines methods from anomaly detection and combinatorial optimization, so candidates should have interest in math and algorithms.

Interactive Graph Exploration

Graphs are widely used for the representation of social networks, gene and protein interactions, communication networks, or product co-purchase in web stores. Each object is represented by its relationships to other objects (edge structure) and by additional properties (attributes). For instance, social networks store friendship relations as edges and age, income, and other properties as attributes. Exploring and looking at the structure of the graph when the user is searching an information shows important relationships between the graph and the user. We want to support graph exploration starting from a user query clustering the results in a convenient way. The project will start with the implementation of an existing technique for graph exploration and will move towards new directions to present to the user unexpected results or more insights on the structures. 

User-driven graph mining

Social networks contain many information in terms of profiles of the user. They contain large and complex graph structures and many additional heterogeneous information (e.g. from user profiles, behaviors, opinions, etc.). On the one side automatic graph mining technology allows to analyze such graphs, while on the other side human users have certain (user-driven) interests in the graph. If the user wants to see only a portion of the social networks based on its (user-driven) preferences, we need to focus on a specific (user-driven) sub-graph and take into account several heterogeneous information sources for this selection. Starting from the implementation of traditional frequent pattern mining techniques the project will explore ways to exclude parts of the graph that are not interesting for the user and combine the information into small graph summaries.

Reusability through large scale graph compatibility

3D printers started a revolution in object fabrication, provided that users have now the possibility to construct objects without expensive materials and complicate procedures. Therefore, many different repositories of 3D objects ready to be printed have recently appeared. One such example is thingiverse.com which allows any user to search for a model and construct it. This is however very limited since we should be able to search for parts (or mechanisms) and understanding what is the best way to reuse this mechanism. 3D objects can be conveniently modelled as graphs, where each node is a joint and an edge contains the connections between different parts. In this thesis, in collaboration with the HCI group, we intend to mine the most probable objects attached to a given input mechanism to help the user in the construction phase.

Predictive analytics for enterprise performance management

The importance of predicting the trend of data has been shown in many applications. This includes financial data, weather forecasting, sensor reading. However, a non-expert user wants to be guided in the use of forecasting models and selecting the right descriptors (features) for any sheet of data. Another important aspect to evaluate is the time, since there is the need to provide a fast, scalable and usable tool. The thesis aims at creating such a framework to allow the selection of the right model, the analysis of the data and the forecasting of feature trends. This project will be supervised in collaboration with Valsight and the candidate will work with real data on a real scenario of a customer. Valsight is an HPI startup that builds software for risk-based simulations to quantify the impact of events and measures onto KPIs for decision support in enterprise performance management. 

Feature selection for multivariate sensor readings

In today's world the number of measurements one can collect from complex systems is huge. Examples include temperature, heart rate, traffic and so on. This is particularly important for multivariate sensor readings, where multiple sensors acquire different data from the environment. Such data represents a valuable source of information as well as a big challenge for data analysts who have to face the scalability problem of the existing techniques. Moreover, not all the measurements are interesting to predict some phenomenon, and there might be redundant or useless information. Feature selection is used in these cases.  This thesis will explore feature selection on sensors to predict future readings or other variables. The current methods exploit one single selection criteria and mostly work with limited size data, while the project aims at overcoming the aforementioned limitations to select the best subset of input variables. This project will be supervised in collaboration with Bosch and GFZ; and the candidate will work with real sensor data from complex systems. 

Database query answering in time series

The development of sensor technologies in a wide range of domains (e.g. earth observation, predictive maintenance, health monitoring, etc.) has led to a big data explosion in monitoring activities, which provide a very large amount of time series. In order to efficiently process and analyze large volumes of time series, we can build index structures that help us answer fast similarity queries on massive collections of time series. Nevertheless, query answering in this domain remains a challenging problem, and a deeper understanding of database query processing and raw data characteristics is needed. The goal of this master thesis project is to explore the reasons that make query answering in time series data challenging and develop new methods that characterize a given query w.r.t. both intrinsic properties of the data and the distribution of answers in the search space. Therefore, together with our collaboration partners from Paris Descartes University we will develop a measure that can characterize the difficulty of answering a query and encompasses both of these factors at the same time.