Predictions based on Graphs and Machine Learning (Wintersemester 2017/2018)

Dozent: Prof. Dr. Holger Giese (Systemanalyse und Modellierung) , Christian Medeiros Adriano (Systemanalyse und Modellierung) , Thomas Brand (Systemanalyse und Modellierung)

Allgemeine Information

Semesterwochenstunden: 4
ECTS: 6
Benotet: Ja
Einschreibefrist: 27.10.2017
Lehrform: Projektseminar
Belegungsart: Wahlpflichtmodul

Studiengänge, Modulgruppen & Module

IT-Systems Engineering MA

IT-Systems Engineering
- HPI-ITSE-A Analyse
IT-Systems Engineering
- HPI-ITSE-E Entwurf
IT-Systems Engineering
- HPI-ITSE-K Konstruktion
IT-Systems Engineering
- HPI-ITSE-M Maintenance
BPET: Business Process & Enterprise Technologies
- HPI-BPET-K Konzepte und Methoden
BPET: Business Process & Enterprise Technologies
- HPI-BPET-T Techniken und Werkzeuge
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge
SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-K Konzepte und Methoden
SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-T Techniken und Werkzeuge
SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-S Spezialisierung

Beschreibung

Predictions based on Graphs and Machine Learning

In the last decades, the world has experienced an exponential growth in size and complexity of highly connected data [1]. Genome mapping, social networks, and various recommender systems currently generate large swaths of data represented as complex graphs. Companies and governments query these graphs to predict the spread of diseases (e.g., risk of a new Ebola outbreak), the flow of information (e.g., impact of fake news on elections), and consumer preferences (e.g., video streaming).

Predicting new phenomena from these graphs is becoming increasingly difficult, both methodologically and technically. On the methodological side, learning from highly connected data has been dependent on expert knowledge captured in heuristics [1][2][3] such as querying rules and network metrics. This approach has the caveat of poorly generalizing across domains, which is in contrast with machine learning techniques [4][5]. On the technological side, only recently we have seen powerful tools made widely available to store and query graphs [6][7].

Besides appropriate technology and methods, the success of predictions depends on the quality and the quantity of the data. Poor data quality (e.g., outliers, redundant data) and sparse data (e.g., capture with low frequencies) hinders the training and testing of prediction models. Take for instance a sparse dataset with redundant items. Redundancy spurs input data to correlate which can produce a biased prediction model, i.e., show good performance during training, but underperform during testing (overfitting). Conversely, sparse data can lead to high variance on the prediction outcome (underfitting). Therefore, selecting a prediction model that performs well with the available data is not a straightforward decision. Several techniques exist and can be applied out-of-the-box [4][5], but they need to be carefully tuned for the problem at hand [8][9].

Nonetheless, there is so far limited work on learning from highly connected data through out-of-the-box machine learning approaches. The challenges range from querying the graph database for relevant training data to mapping this data to different machine learning techniques.

In this project seminar, we want to address this research challenge. We will jointly study different machine learning methods and graph data models through hands-on experiments on real data.

Literatur

[1] Easley, D., & Kleinberg, J. (2010). Networks, crowds, and markets: Reasoning about a highly connected world. London: Cambridge University Press.

[2] Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, (10).

[3] Gigerenzer, G., & Brighton, H. (2009). Homo heuristicus: Why biased minds make better inferences. Topics in cognitive science, 1(1), pp. 107-143.

[4] Russell, S., Norvig, P., & Intelligence, A. (2016). A modern approach. Artificial Intelligence. Harlow: Pearson.

[5] Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Cambridge: Morgan Kaufmann.

[6] Angles, R., & Gutierrez, C. (2008). Survey of graph database models. ACM Computing Surveys, 40 (1), 39 pages.

[7] Sahu, S., Mhedhbi, A., Salihoglu, S., Lin, J., & Özsu, M. T. (2017). The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing: A User Survey. arXiv preprint arXiv:1709.03188.

[8] Kuhn, M., & Johnson, K. (2013). Applied predictive modeling, New York: Springer.

[9] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: Springer.

Lern- und Lehrformen

The course is a project seminar, which has an introductory phase comprised by an initial short lecture. After that, the students will work in groups on jointly identified experiments applying specific solutions to given data sets and finally prepare a presentation and write a report about their findings concerning the experiments.

There will be an introductory phase to present basic concepts for the theme including the necessary foundations for graphs and machine learning.

Leistungserfassung

We will grade the group's experiments (50%), reports (40%), and presentations (10%). Participation in the project seminar during meetings and other groups' presentations in the form of questions and feedback will also be required.

Termine

After the introductory phase with an initial short lecture, we will identify the group topics and then there will be regular individual feedback meetings of the groups with their supervisors. In addition, there will also be regular meetings during the semester for the whole project seminar to discuss the progress of all groups and open questions in general.

The first meeting will be on Tuesday the 17th of October at 13:30 in room A-2.2.

Zurück