HOME
> HPI DSE
> Weekly Seminar

Don't forget to have a look at our current seminar schedule!

Weekly Seminar | Wintersemester 2019/2020

06.02.2020 | Prof. Dr. Christoph Lippert: Critical Thinking

In my talk I will briefly give an overview on concepts and tools for critical thinking. Following these principles can be useful for reasoning, when reading or writing a paper, choosing a research question and assessing data.

After my talk, I strongly encourage all of you to join the HPI colloquium of our distinguished guest Prof. Dr. Thomas Fuchs from the MSKCC in New York, who will deliver a talk on his research on clinical applications of deep learning and computer vision, which is close to our research school topic.

06.02.2020 | Hazar Harmouch: Schema Discovery using Similarity Search in a Table Corpus

Column headers are among the most relevant types of meta-data for relational tables, because they provide meaning and context in which the data is to be interpreted. Headers play an important role in many data integration, exploration, and cleaning scenarios, such as schema matching, knowledge base augmentation, and similarity search. Unfortunately, in many cases column headers are missing, because they were never defined properly, are meaningless, or have been lost during data extraction, transmission, or storage. For example, around one third of the tables on the Web have missing headers; it is even worse for open data tables, which are seldomly supplied with meaningful schemata. In the end, missing headers lead to abundant tabular data being shrouded and inaccessible to many data-driven applications. The talk presents our fully automated, multi-phase system for discovering table column headers for cases where headers are missing, meaningless, or unrepresentative for the column values. Our approach is to leverage existing table headers from web tables to suggest human-understandable, representative, and consistent headers for any target table.

30.01.2020 | Ioannis Koumarelas: Data Preparation and Domain-agnostic Duplicate Detection

In a world driven by data, knowing how to extract useful information from them is essential for most applications. Unfortunately, data generated from users or other sources are nearly never in a form ready to be analyzed or imported into an application process. For this reason, data cleaning processes are applied first to improve the state of data by repairing data inconsistencies. Among the plethora of data inconsistencies, the existence of duplicate records, which refer to the same entity but with differences in values and no unique global identifiers, is a particularly challenging problem that causes a number of issues in applications.

To this end, we approach the problem from two different aspects; preparing data and making duplicate record suggestions in a lack of a gold standard. First, we study the benefits of preparing data before duplicate detection starts, by proposing two novel pipelines to systematically select data preparation steps. Then, we introduce MDedup, a novel pipeline that uses matching dependencies to detect duplicates, which can be discovered regardless of any labels, and learn their properties on known datasets to then discover them in new datasets.

30.01.2020 | Michael Loster: Machine Learning-Based Knowledge Base Construction

When building knowledge bases from multiple data sources, a necessary step consists of the systematic consolidation of the different entities from the data sources involved. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity -- duplicates -- into a large heterogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise.

We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to $+26$~percent F-measure, depending on the task and the dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to $+4.7$~percent.

23.01.2020 | Prof. Dr. Bert Arnrich: How to write a good research plan?

A good research plan outlines the goals and nature of the doctoral thesis and the responsibilities of the doctoral student. Usually, the research plan has to be submitted to the supervisor within the first twelve months of enrollment. There is no set format for the research plan, but it should contain motivation, research theses, related work, research contribution, approach, evaluation, dependencies, risk and contingency plan, work plan, publication plan, and if applicable a course plan.

16.01.2020 | Justin Albert: Personal Introduction

As I am a new member of the Research School for Data Science and Engineering since October 2019, I will give a personal introduction about myself and my completed projects in my first presentation. These projects are all in the field of medical informatics, more precisely in the development of software for new medical hardware prototypes as well as in digital image processing and machine learning using supervised and unsupervised learning methods for medical image segmentation. These projects were mainly carried out during my internship in San Jose (USA), as well as in the late phase of my master studies including the master project and the master's thesis.

09.01.2020 | Vanja Doskoc: Cautious Limit Learning

In the framework of language learning in the limit from text, a learner (a computable function) is successively presented all and only the information of a target language (a computably enumerable subset of the natural numbers). With every new datum, the learner makes a new guess which set it believes to be presented. Learning is successful once the learner sticks to a single, correct description of the target language (explanatory learning). Cautious learners additionally may never output a hypothesis which is a proper subset of a previous guess.

Although being a seemingly natural way of learning, being cautious is known to severely lessen the learning capabilities of explanatory (syntactic) learners. To further understand this loss of learning power, previous work introduced weakened versions of cautious learning and gave first partial results on their relation. We complete this picture and also investigate cautious learning in different settings.

In this talk, we make the audience familiar with language learning in the limit from text, a branch of algorithmic learning theory. Furthermore, we provide the audience with a feeling for cautious learners in different settings.

19.12.2019 | Cross-group Abstract Writing

To foster collaboration across research groups and to encourage creativity in coming up with research projects, we will be writing and proposing abstracts about fictitious (and future?) research papers. To this end, we will divide up all attendees into teams of three. No team can include more than one member from any one HPI group. Each team will have 30 minutes to create a title and abstract. The topic can be wild, but should cut across the different areas of expertise of the authors. Afterwards, all abstracts will be read, followed by a very brief Q&A. The whole event should be fun and might even lead to actual cooperation in the future.

12.12.2019 | Dr. Thomas Bläsius: Selling: How to write an introduction

There is no easy-to-use general procedure for writing a good introduction. It always depends on the specific case. There are, however, some common mistakes one can observe regularly. The goal of this talk is to sensitize us for these mistakes, so that we can avoid them in the future.

05.12.2019 | Sidratul Moontaha: Time Series Analysis of Clinical Data by Kalman Filter using State Space Modelling

In many applications of data science it is necessary to estimate a signal embedded in noise. Generally, estimation can be performed from measurements by employing a known mathematical model which describes the behavior of the signal. A wide class of signal estimation problems can be approached by state space modelling and Kalman filtering. In this talk I will discuss the signal estimation problem and the optimum estimation tool i.e., Kalman filter incorporated into state space model. Application of non-linear Kalman filter to count time series of epileptic seizures will be discussed. Additional examples of the applications of Kalman filter technology from connected health paradigm will also be provided.

28.11.2019 | Prof. Dr. Tilmann Rabl: Back of the Envelope Calculations and Sound Research Validation

In this talk, we will discuss a multi-step approach to practical research validation. Core is a simple modelling technique (back of the envelope calculation), which makes it possible to estimate performance in orders of magnitude. Additionally, we will discuss other validation techniques and give some positive and negative examples on how to apply the techniques.

21.11.2019 | Alexander Rakowski: Disentangling Deep Learning Latent Space Representations

The goal of representation learning is to transform complex, usually high-dimensional data (e.g. images) into more compact feature representations. These should capture characteristics of the data samples on higher levels of abstractions. However, these representations are usually not easy to interpret. This is especially true in unsupervised settings, where no labels are available.

In a disentangled representation each of the learned features should independently correspond to a separate factor of variation of the data. Ideally one would want to uncover the set of ground truth causal factors. It is believed that such representations should be easier to analyse and draw conclusions from. They should also prove more useful when utilized in further tasks.

The leading approach in deep learning is to utilize the Variational Autoencoders framework, imposing prior assumptions on the posterior distribution of the model. Recent work showed that these methods are sensitive to hyperparameter choices and fail to yield consistent results.

In my research I want to investigate alternative approaches to this problem. The first direction I chose to follow is the use of a different family of models, namely Generative Adversarial Networks. While they have been proven successful in a variety of data generation tasks, their applications in representation learning remain understudied.

14.11.2019 | Prof. Dr. Felix Naumann: Plagiarism – what it is, how to detect it, and how to avoid it

Plagiarism, i.e., copying content by other authors and claiming it as one’s own work, is a violation of scientific and ethical standards. We will discuss the different types of plagiarism and in which cases plagiarism might be acceptable. We will have a look at a plagiarism detection tool and report on experiences with plagiarism cases. Together, we will discuss strategies of detecting plagiarism and how to react to it when encountered.

07.11.2019 | Introduction of Current Research

Each PhD student in the Data Science and Engineering Research School will briefly elaborate what they are currently working on, e.g., open problems, solution ideas, proofs, implementation, experimentation.