# Description

Smart representations (such as embeddings, graphical models, discretizations) are useful models that allow the abstraction of data within a well-defined mathematical formalism. The representations we aim at are conceptual abstractions of real world phenomena (such as sensor reading, causal dependencies, social interactions) into the world of statistics and discrete mathematics in such a way that the powerful tools developed in those areas are available for complex analyses in a simple and elegant manner.

Usually data is transformed explicitly or implicitly from raw data representation (as it was measured or collected) into a smart data representation (more useful for data analysis). One goal of such smart representations, e.g. with a higher level of abstraction, is to enable the application of data mining techniques and theory developed in different areas. Smart data representations in many cases also induce a reduction of the original data mining problem into a more tractable or more compact problem formulation that can be solved by an algorithm (e.g. with lower worst case complexity, scalable to larger data sizes, more robust to data artefacts, etc.).

In this seminar we will focus on three smart data representations with the aim of understanding the analytical properties of different data mining tasks:

graph embeddings in low/high dimensional vector spaces [1][2][3]

representations of probabilistic models for causal inference [4][5][6]

data reduction techniques for anomaly detection in time series [7][8]

The main focus in each of these three areas will be the understanding and comparison of smart representations and their explicit/implicit data transformation methods. By transforming the data we will study limitations or advantages of each technique and how the data representation changes the problem setup, reduces complexity, introduces robustness, or other valuable properties for big data analytics.

## Objectives

Introduction to the concepts of smart data representations

Understanding limitations and advantages of state-of-the-art techniques

Implementation of techniques in research prototypes

Designing of experiments to prove the effective quality of each technique in a set of traditional tasks where the representation is used

Running the experiments on real and synthetic datasets

Writing and submitting a scientific publication (more information below)

Presentation of scientific results during seminar and potentially at international conferences

A very good example is the VLDB Research and Experimental and Analysis Papers:

http://dexl.lncc.br/vldb/call-for-research-track.html

# Literature

[1] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.

[2] Cao, Shaosheng, Wei Lu, and Qiongkai Xu. "Deep neural networks for learning graph representations." Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI Press, 2016.

[3] Tang, Jian, et al. "Line: Large-scale information network embedding." Proceedings of the 24th International Conference on World Wide Web. ACM, 2015

[4] Schölkopf et al. "On Causal and Anticausal Learning ." Proceedings of the 29th International Conference on Machine Learning. , 2012

[5] C. M. Bishop, Chapter 8 on Graphical Models in "Pattern Recognition and Machine Learning." Springer, 2006

[6] E.Nazerfard, D. Cook. "Using Bayesian Networks for Daily Activity Prediction." AAAI Workshop on Plan, Activity and Intent Recognition, 2013

[7] Smith and Goulding: “A novel symbolization technique for time-series outlier detection.” Proceedings of the IEEE International Conference on Big Data, 2015.

[8] Lin, Keogh, et al.: “A Symbolic Representation of Time Series, with Implications for Streaming Algorithms”. Proceedings of SIGMOD DMKD, 2003.