• de

Smart Representations for Big Data Analytics (Sommersemester 2017)

Dozent: Prof. Dr. Emmanuel Müller (Knowledge Discovery and Data Mining) , Dr. Davide Mottin (Knowledge Discovery and Data Mining) , Fabian Geier (Knowledge Discovery and Data Mining) , Erik Scharwächter (Knowledge Discovery and Data Mining) , Anton Tsitsulin (Knowledge Discovery and Data Mining)
Website zum Kurs: https://hpi.de/mueller/lehre/aktuelle-vorlesung/ss-17/smart-representations-for-big-data-analytics.html


Smart representations (such as embeddings, graphical models, discretizations) are useful models that allow the abstraction of data within a well-defined mathematical formalism. The representations we aim at are conceptual abstractions of real world phenomena (such as sensor reading, causal dependencies, social interactions) into the world of statistics and discrete mathematics in such a way that the powerful tools developed in those areas are available for complex analyses in a simple and elegant manner.  

Usually data is transformed explicitly or implicitly from raw data representation (as it was measured or collected) into a smart data representation (more useful for data analysis). One goal of such smart representations, e.g. with a higher level of abstraction, is to enable the application of data mining techniques and theory developed in different areas. Smart data representations in many cases also induce a reduction of the original data mining problem into a more tractable or more compact problem formulation that can be solved by an algorithm (e.g. with lower worst case complexity, scalable to larger data sizes, more robust to data artefacts, etc.).

In this seminar we will focus on three smart data representations with the aim of understanding the analytical properties of different data mining tasks:

  • graph embeddings in low/high dimensional vector spaces [1][2][3]

  • representations of probabilistic models for causal inference [4][5][6]

  • data reduction techniques for anomaly detection in time series [7][8]

The main focus in each of these three areas will be the understanding and comparison of smart representations and their explicit/implicit data transformation methods. By transforming the data we will study limitations or advantages of each technique and how the data representation changes the problem setup, reduces complexity, introduces robustness, or other valuable properties for big data analytics.


Familiarity with data mining principles, statistics notions (random variables, distributions, expected values, …) and linear algebra (matrices, vectors, inversion, diagonalization, eigenvalues and eigenvectors, decompositions).

The above requirements are not mandatory but highly beneficial; the student should at least show interest in the disciplines.

We highly recommend to enroll for this seminar if you are  interested in writing a thesis at the Knowledge Discovery and Data Mining chair, since it will provide the basic methodology and the tools we usually require to a student.


[1] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

[2] Cao, Shaosheng, Wei Lu, and Qiongkai Xu. "Deep neural networks for learning graph representations." Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI Press, 2016.

[3] Tang, Jian, et al. "Line: Large-scale information network embedding." Proceedings of the 24th International Conference on World Wide Web. ACM, 2015

[4] Schölkopf et al.  "On Causal and Anticausal Learning ." Proceedings of the 29th International Conference on Machine Learning. , 2012

[5] C. M. Bishop, Chapter 8 on Graphical Models in "Pattern Recognition and Machine Learning." Springer, 2006

[6] E.Nazerfard, D. Cook. "Using Bayesian Networks for Daily Activity Prediction."   AAAI Workshop on Plan, Activity and Intent Recognition, 2013

[7] Smith and Goulding: “A novel symbolization technique for time-series outlier detection.” Proceedings of the IEEE International Conference on Big Data, 2015.

[8] Lin, Keogh, et al.: “A Symbolic Representation of Time Series, with Implications for Streaming Algorithms”. Proceedings of SIGMOD DMKD, 2003.

Lern- und Lehrformen

The course will provide tools for problem solving, big data analytics, and an introduction of scientific writing.


The seminar will be divided into three groups, each group responsible of a specific data representation. Work tasks for each group are (1) a set of prototype implementations, (2) formal comparison of representations, and (3) empirical experiments on synthetic and real world data. The overall goal of the seminar is to introduce participants into state-of-the-art research challenges and scientific writing. With our supervision we aim in each group to write and submit an experimental paper to a major database or data mining conference.

Allgemeine Information

  • Semesterwochenstunden : 4
  • ECTS : 6
  • Benotet : Ja
  • Einschreibefrist : 28.04.2017
  • Programm : IT-Systems Engineering MA
  • Lehrform : S
  • Belegungsart : Wahlpflicht
  • Maximale Teilnehmerzahl : 12


  • BPET-Konzepte und Methoden
  • BPET-Spezialisierung
  • BPET-Techniken und Werkzeuge
  • OSIS-Konzepte und Methoden
  • OSIS-Spezialisierung
  • OSIS-Techniken und Werkzeuge