Mishaps in Statistics and ML (Sommersemester 2022)
Lecturer:
Prof. Dr. Bernhard Renard
(Data Analytics and Computational Statistics)
,
Dr. Katharina Baum
(Data Analytics and Computational Statistics)
General Information
- Weekly Hours: 2
- Credits: 3
- Graded:
yes
- Enrolment Deadline: 01.04.2022 - 30.04.2022
- Teaching Form: Seminar
- Enrolment Type: Compulsory Elective Module
- Course Language: English
- Maximum number of participants: 6
Programs, Module Groups & Modules
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
- SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-K Konzepte und Methoden
- SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-T Techniken und Werkzeuge
- SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-S Spezialisierung
- DATA: Data Analytics
- HPI-DATA-K Konzepte und Methoden
- DATA: Data Analytics
- HPI-DATA-T Techniken und Werkzeuge
- DATA: Data Analytics
- HPI-DATA-S Spezialisierung
- PREP: Data Preparation
- HPI-PREP-K Konzepte und Methoden
- PREP: Data Preparation
- HPI-PREP-T Techniken und Werkzeuge
- PREP: Data Preparation
- HPI-PREP-S Spezialisierung
- CYAD: Cyber Attack and Defense
- HPI-CYAD-K Konzepte und Methoden
- CYAD: Cyber Attack and Defense
- HPI-CYAD-T Techniken und Werkzeuge
- CYAD: Cyber Attack and Defense
- HPI-CYAD-S Spezialisierung
- SECA: Security Analytics
- HPI-SECA-K Konzepte und Methoden
- SECA: Security Analytics
- HPI-SECA-T Techniken und Werkzeuge
- SECA: Security Analytics
- HPI-SECA-S Spezialisierung
- SCAD: Scalable Computing and Algorithms for Digital Health
- HPI-SCAD-C Concepts and Methods
- SCAD: Scalable Computing and Algorithms for Digital Health
- HPI-SCAD-T Technologies and Tools
- SCAD: Scalable Computing and Algorithms for Digital Health
- HPI-SCAD-S Specialization
- APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-C Concepts and Methods
- APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-T Technologies and Tools
- APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-S Specialization
Description
While we usually discuss how we can identify even better performing methods for data analysis, quite common in reality many data science projects fail since seemingly little steps are overlooked. Further, some of these fallacies are fairly obvious - when clearly presented. If well hidden in large datasets, they easily remain unidentified and lead to incorrect conclusions, even though purely technically all steps of a data analysis may have been performed correctly.
Within this seminar, we aim to identify these mishaps and pitfalls and discuss strategies to overcome them, both with regard to concrete examples, but also on a more general perspective.
Learning objectives
- You learn to identify common mishaps in data analysis
- You learn strategies to circumvent these mishaps
- You learn to identify open challenges in data analysis
- You can present a scientific manuscript in this field and lead a discussion
Requirements
You should have some mathematical background (at least Mathe 1+2 of the ITSE bachelor or comparable) as well as have taken at least one class in statistics. Good knowledge of English is required to understand and discuss current literature.
Literature
Altman Douglas G, Bland J Martin. Statistics notes: Absence of evidence is not evidence of absence BMJ 1995; 311 :485
Efron, Bradley & Morris, Carl. (1977). Stein's Paradox in Statistics. Scientific American - SCI AMER. 236. 119-127. 10.1038/scientificamerican0577-119.
Miguel A Hernán, David Clayton, Niels Keiding, The Simpson's paradox unraveled, International Journal of Epidemiology, Volume 40, Issue 3, June 2011, Pages 780–785
Pearl, J. (2014, 10 3). Lord’s paradox revisited – (Oh Lord! Kumbaya!). Technical Report.
Griffith, G.J., Morris, T.T., Tudball, M.J. et al. Collider bias undermines our understanding of COVID-19 disease risk and severity. Nat Commun 11, 5749 (2020)
Daniel Westreich, Noah Iliinsky, Epidemiology Visualized: The Prosecutor's Fallacy, American Journal of Epidemiology, Volume 179, Issue 9, 1 May 2014, Pages 1125–1127
Robert, C. (2014). On the Jeffreys-Lindley Paradox. Philosophy of Science, 81(2), 216-232. doi:10.1086/675729
Whalen, S., Schreiber, J., Noble, W. S., & Pollard, K. S. (2021). Navigating the pitfalls of applying machine learning in genomics. Nature Reviews Genetics, 1-13.
https://towardsdatascience.com/be-careful-when-interpreting-predictive-models-in-search-of-causal-insights-e68626e664b6
Learning
- Seminar for master students
- Language of instruction: English
- Maximum number of participants: 7
Topics will be presented in the first session (April 25, 2022). For topic assignments, participants will have to write an e-mail by May 2nd, 2022 in which they can give preferences for up to three of the presented topics. Then, the topics will be assigned by us. In case of too many applicants, we will decide randomly. As first talks will be scheduled May 16th will be the last time point to de-register from the class.
The seminar will be conducted on site (with an hybrid option if needed). Please register in the moodle of the course (https://moodle.hpi.de/course/view.php?id=293) for further information.
Examination
In the seminar, each participant will give a presentation about a predefined topic within the research area and write a short report. The final grade consists of the following two parts:
- Presentation and discussion (65%)
- Written report (35%)
Dates
First session April 25, 2022.
Zurück