# Description

Smart representations (such as embeddings, graphical models, discretizations) are useful models that allow the abstraction of data within a well-defined mathematical formalism. The representations we aim at are conceptual abstractions of real world phenomena (such as sensor reading, causal dependencies, social interactions) into the world of statistics and discrete mathematics in such a way that the powerful tools developed in those areas are available for complex analyses in a simple and elegant manner.

Usually data is transformed explicitly or implicitly from raw data representation (as it was measured or collected) into a smart data representation (more useful for data analysis). One goal of such smart representations, e.g. with a higher level of abstraction, is to enable the application of data mining techniques and theory developed in different areas. Smart data representations in many cases also induce a reduction of the original data mining problem into a more tractable or more compact problem formulation that can be solved by an algorithm (e.g. with lower worst case complexity, scalable to larger data sizes, more robust to data artifacts, etc.).

In this seminar we will focus on four smart data representations with the aim of understanding the analytical properties of different data mining tasks:

representations + similarities of graphs for classification [1][2][3]

representations of natural time series baselines for outlier interpretation [4][5]

representing complexity in large time series collections [6][7]

representations of missing values in incomplete datasets [8][9][10]

The main focus in each of these three areas will be the understanding and comparison of smart representations and their explicit/implicit data transformation methods. By transforming the data we will study limitations or advantages of each technique and how the data representation changes the problem setup, reduces complexity, introduces robustness, or other valuable properties for big data analytics.

## Objectives

Introduction to the concepts of smart data representations

Understanding limitations and advantages of state-of-the-art techniques

Implementation of techniques in research prototypes

Designing of experiments to prove the effective quality of each technique in a set of traditional tasks where the representation is used

Running the experiments on real and synthetic datasets

Writing and submitting a scientific publication (more information below)

Presentation of scientific results during seminar and potentially at international conferences

# Literature

[1] Verma, Saurabh, and Zhi-Li Zhang. "Hunt For The Unique, Stable, Sparse And Fast Feature Learning On Graphs." NIPS*, 2017*.

[2] Tsitsulin, Anton, Davide Mottin, Panagiotis Karras, and Emmanuel Müller "VERSE: Versatile Graph Embeddings from Similarity Measures." WWW, 2017.

[3] Yanardag, Pinar, and S. V. N. Vishwanathan. "Deep graph kernels." KDD, 2015.

[4] Sundararajan et al.: "Axiomatic Attribution for Deep Networks." ICLM, 2017.

[5] Riberio et al.: "Why should I trust you? Explaining the Predictions of Any Classifier." KDD, 2016.

[6] M. Wiedermann, A. Radebach, J. Donges, J. Kurths, and R. Donner: "A climate network-based index to discriminate different types of El Niño and La Niña." Geophysical Research Letters, 2016.

[7] S. Papadimitriou, J. Sun, and C. Faloutsos: "Streaming Pattern Discovery in Multiple Time-Series." VLDB, 2005.

[8] L. Farhangfar, et al., "Experimental analysis of methods for imputation of missing values in databases", SPIE 5421, April 2004.

[9] Silva, Luciana O., and Zárate, Luis E. "A brief review of the main approaches for treatment of missing data" Intelligent Data Analysis, 2014

[10] Lovedeep Gondara and Ke Wang "MIDA: Multiple Imputation using Denoising Autoencoders", PAKDD 2018