Summer Semester 2016

13.04.2016 - Dr. Davide Mottin

The road towards graph exploration

The increasing interest in social networks, protein-interaction, and many other types of networks has raised the question how users can explore such large and complex graph structures in an intuitive way. Nowadays, these networks count billions of nodes and relationships and are used to study complex phenomena, such as social behaviors, marketing campaigns, and economic factors.

Given this complexity and size, interactive algorithms would naturally assist the human in finding interesting information in graphs. Interactive algorithms and exploratory methods have been studied for more traditional data, such as relation, semi-structured, and textual data, to allow intuitive data exploration.

In this talk, I will present my previous work on graph query reformulation, as a first step towards interactive algorithms for graphs. I will also show our recently started projects and collaborations in HPI, as well as my intended agenda for the next months.

20.04.2016 - Hazar Harmouch

Counting in the Big Data Scale

Over the past years, datasets keep growing rapidly and became extremely large and complex that traditional data processing applications are inadequate. Those ever growing datasets are referred to as big data. Revolutionary steps forward from traditional data analysis are needed to address big data's three main components: variety, velocity and volume. Despite many modern technologies, it is still hard to fetch, store, query, visualize, and integrate big data. Thus, data profiling can be an important preparatory task to reveal the unknown characteristics of big data and point out which data might be useful.

Data profiling methods includes single-column tasks and multi columns tasks to examine datasets and produce metadata. The basic form of data profiling is single-column which analysis the individual columns in a given table resulting in metadata comprises various counts, such as the number of unique values. The number of unique values in a column is one of the most important statistics that is used by the DBMS optimizer to estimate the selectivity of operators and perform other optimization steps. A fresh look at the current methods of cardinality estimation is needed to indicate the limitation and accuracy levels for the big dataset.

27.04.2016 - Dr. Alaaeddine Yousfi

Towards a Ubiquitous Business Process Management

Ubiquitous computing refers to the third generation computing that pervades the physical space by taking place anywhere and everywhere. Its main goal is to bridge the gap between virtual systems and the physical environments (i.e., Internet of Things) in which they operate. To date, ubiquitous computing is still understudied in the discipline of business process management. Throughout the talk, we will be sharing some of the recent advances aiming to upgrade the business process management life-cycle to ubiquity. We will focus primarily on the phases of discovery, modeling and improvement.

04.05.2016 - Mina Razaei

Brain Abnormality Detection by Convolutional Neural Network

During the past years deep learning has raised a huge attention by showing promising result in some state-of-the-art approaches such as speech recognition, handwritten character recognition, image classification, detection and segmentation. There are expectations that deep learning improve or create medical image analysis applications, such as computer aided diagnosis, image registration and multimodal image analysis, image segmentation and retrieval. There has been some application that using deep learning in medical application like cell tracking and organ cancer detection.Doctors use medical imaging to diagnosis diseases. Medical application and diagnosis tools make it faster and more accurate.

An advance application based on deep learning methods for diagnosis, detection and segmentation of brain magnetic resonance imaging (MRI) is my goal. Through my first talk in HPI and research school, I will show you my earlier result in Brain MRI classification and lesion detection issues.

11.05.2016 - Stefan Lehmann

Reactive Object Queries - Consistent Views in Object-Oriented Languages

Maintaining consistency between data throughout a system using scattered, imperative code fragments is challenging. Some mechanisms address this challenge by making data dependencies explicit. Among these mechanisms are reactive collections, which define data dependencies for collections of objects, and object queries, which allow developers to query their program for a subset of objects.

However, on their own, both of these mechanisms are limited. Reactive collections require an initial collection to apply reactive operations to and object queries do not update its result as the system changes. Using these two mechanisms in conjunction allows each to mitigate the disadvantage of the other. To do so, object queries need to respond to state changes of the system.

We propose a combination of both mechanisms, called reactive object queries. Reactive object queries allow the developer to declaratively select all objects in a program that match a particular predicate, creating a view. Those views can further be composed using reactive operations. All views automatically update when the underlying program state changes.

18.05.2016 - Ankit Chauhan

Deterministic model for Real-World Networks

It was experimentally observed that the majority of real-world networks are scale free and follow almost power law degree distribution. In my seminar, I will talk about the deterministic conditions for checking whether graph has power law degree distribution. Also, using these deterministic conditions I will give theoretical explanation why some algorithms are more efficient on real-world networks.

25.05.2016 - Jan Renz (external)

120 courses in 3 years. Pleasures and Pains of running MOOCs.

Invited speaker from OpenHPI

The hype is over. However, MOOCs became an important building brick of the academic offerings of the HPI. Including the SAP focused openSAP enterprise MOOC offer the openHPI team hosted more than 120 courses within the last three years. This talk will give insights in some of the challenges and learnings from both the perspective as course authors and platform developers. In the first half of the talk we are going to introduce the underlying platform and recap some of the MOOC basics and concepts. In the second half we would like to move towards an open discussion about MOOCs.

01.06.2016 - Interactive Session

08.06.2016 - Panel Discussion

15.06.2016 - Konstantina Lazaridou

Identifying Political Bias in News Articles

Individuals' political leaning, such as journalists', politicians' etc. often shapes the public opinion over several issues. In the case of online journalism, due to the numerous ongoing events, newspapers have to choose which stories to cover, emphasize on and possibly express their opinion about. These choices depict their profile and could reveal a potential bias towards a certain perspective or political position. Likewise, politicians' choice of language and the issues they broach, are an indication of their beliefs and political orientation. Given the amount of user-generated text content online, such as news articles, blog posts, politican statements etc., automatically analyzing this information becomes increasingly interesting, in order to understand what people stand for and how they influence the general public. In this PhD thesis, we analyze UK news corpora along with parliament speeches in order to identify potential political media bias. We currently examine the politicians' mentions and their quotes in news articles and how this referencing pattern of the newspapers evolves in time.

22.06.2016 - Luise Pufahl

Batch processing in business processes

Business process automation improves organizations' efficiency to perform work. The single executions of process models, called process instances, are usually executed independent in business process management systems (BPMS). In practice, we can observe examples in which the synchronized execution of groups of instances for certain activities, called batch processing, can lead to an improved process performance. For example, online retailers can pack and ship several orders of the same customer to save shipment costs. We developed the batch regions concept to enable the batch processing in business processes over a set of activities by grouping the instances based on their data characteristics. In this talk, batch regions are presented and its implementation in an open-source engine is shown.

29.06.2016 - Thomas Beyhl (external)

A Framework for Incremental View Maintenance of Deductive Graph Databases

Nowadays, graph databases are employed when relationships between entities are in the scope of queries. These queries employ graph pattern matching that is NP-complete for subgraph isomorphism. Thus, the response time for queries can become large, when these queries become complex or the number of entities and relationships stored by the graph database increases.

One possibility to increase the throughput of graph databases is to employ views on the stored graphs. These views store answers for graph queries. However, when the graphs change, these views must be updated as well.

In my talk, I describe how views on the graphs can be maintained to keep these views up-to-date. I describe a modeling language that enables to define these views. Furthermore, I present an incremental maintenance algorithm for the views that can outperform existing approaches for incremental graph pattern matching in time and space.

06.07.2016 - Arian Treffer

Omniscient Debuging in Database Applications

Omniscient debuggers can greatly improve developer productivity. Not only do they allow for more efficient navigation in the execution of a program, they can be used as a foundation for dynamic analyses that further help the developer to identify relevant parts of code.

We present an approach of bringing omniscient debugging and advanced analysis algorithms to stored procedures. Our prototype allows omniscient debugging of SQLScript that handles large amounts of data, while creating only a small overhead by using an insert-only approach. Furthermore, we present an extension to SQL that allows the developer to express queries that cover a period of execution time.

13.07.2016 - Toni Mattis

Improving Development Tools using Semantic Code Models

Maintaining a large code base requires a thorough understanding of the high-level concepts which span multiple modules. Poor understanding usually leads to redundancy and faulty usage of existing facilities. Acquiring a better mental model requires a lot of effort, such as searching documentation and code for explanations or examples demonstrating the concepts behind the code.

Tools can identify the abstract concepts underlying the code base and provide programmers with context that helps them conceiving the mental model behind certain code passage. This context can consist of code using certain concept, abstractions used by this concept, live example data and related documentation.

In this talk, we explore latent semantic models, which map code to a concept structure that allows rapid navigation in the concept space. In contrast to the closely related models from natural language processing, we capture the recursive nature of abstractions and take advantage of repository meta-data, such as history.

20.07.2016 - Ahmad Samiei

Distributed duplicate detection

Duplicate detection is a time-intensive process that is periodically executed on a database. The sheer amount of streaming data, generated as a result of wide spreading internet, sensor data, etc., added to an already de-duplicated database makes the database unreliable and unusable very fast, and therefore imposes extra cost to industry. Incremental record de-duplication attempts to address this problem and renders databases with many transactions always up-to-date and clean. That is, duplicate detection must happen on the fly, as the data arrives and enters the database.

A prevalence of distributed platforms for data processing has made it very attractive to investigate and utilize these platforms for efficient parallelization of such computationally intensive jobs. There are already some works focusing on batch deduplication approaches, mainly on Apache Hadoop, a Map-Reduced based platform. In this work we investigate Apache Flink, a new open-source framework for distributed data processing for our task of incremental deduplication. This platforms provides a new API for stream data processing, which promises to be suitable. We attempt to utilize this new feature and devise an incremental algorithm that efficiently parallelizes the task.