Development of a Hierarchical Feature Selection Library (Sommersemester 2023)
Dozent:
Prof. Dr. Bernhard Renard
(Data Analytics and Computational Statistics)
,
Jan-Philipp Sachs
(Digital Health - Personalized Medicine)
Allgemeine Information
- Semesterwochenstunden: 4
- ECTS: 6
- Benotet:
Ja
- Einschreibefrist: 01.04.2023 - 07.05.2023
- Lehrform: Seminar
- Belegungsart: Wahlpflichtmodul
- Lehrsprache: Englisch
- Maximale Teilnehmerzahl: 3
Studiengänge, Modulgruppen & Module
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
- SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-K Konzepte und Methoden
- SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-T Techniken und Werkzeuge
- SAMT: Software Architecture & Modeling Technology
- HPI-SAMT-S Spezialisierung
- DANA: Data Analytics
- HPI-DANA-K Konzepte und Methoden
- DANA: Data Analytics
- HPI-DANA-T Techniken und Werkzeuge
- DANA: Data Analytics
- HPI-DANA-S Spezialisierung
- CODS: Complex Data Systems
- HPI-CODS-K Konzepte und Methoden
- CODS: Complex Data Systems
- HPI-CODS-T Techniken und Werkzeuge
- CODS: Complex Data Systems
- HPI-CODS-S Spezialisierung
- MALA: Machine Learning and Analytics
- HPI-MALA-C Concepts and Methods
- MALA: Machine Learning and Analytics
- HPI-MALA-T Technologies and Tools
- MALA: Machine Learning and Analytics
- HPI-MALA-S Specialization
Beschreibung
Powerful predictive models based on high-dimensional data are a cornerstone of modern machine learning applications. Yet, in real-world scenarios, often a vast majority of the available features are irrelevant or redundant to others, rendering them superfluous for the actual prediction task. While some model classes (e.g., decision trees) intrinsically select the most relevant features, others (e.g., non-penalized regression models or neural networks) have difficulties in identifying these, thus resulting in less desirable model properties such as overfitting, prediction uncertainty, or decreased feature set stability.
One way to successfully remedy this shortcoming is feature selection. Feature selection aims at reducing a large number of input features to a small set of relevant and non-redundant ones, e.g., by selecting according to data-intrinsic properties (filter methods) or by fitting separate models to subsets of the feature space (wrapper methods).
However, there are scenarios where the features themselves additionally show hierarchical relations amongst each other, i.e., one feature is a more specific instance of a more general feature. Two frequently mentioned examples in this context are:
- Datasets where the instances are genes, and the features are functional annotations; e.g., “VEGF-A complex” is a more specific description of the concept of “growth-factor activity”, though both might appear as features in the same dataset.
- Datasets where instances are texts or tweets, and the features are the words contained in these instances; e.g., “server” is a more specific instance of “computer”, and again both can appear in the same dataset.
In these settings, selecting both the more specific and the more general concept for modeling would lead to (hierarchical) redundancy, and should thus be avoided through the feature selection process. Unfavorably, all of the well-known ‘flat’ feature selection methods have limited capabilities in realizing this, opening up the field for specialized hierarchical features selection (HFS) methods. They take as input the original features as well as the information about the feature hierarchy, organized in a tree or a directed acyclic graph (DAG) – in the two above-mentioned examples GeneOntology and WordNet).
Around a dozen of these HFS methods have been described in literature, but openly accessible implementations are only available for a few.
The goal of this seminar will thus be to write Python implementations of all these HFS methods under the umbrella of a package compatible with scikit-learn, arguably the most widely used Machine Learning package in Python.
For that purpose, you will first acquire the fundamentals of HFS and read a subset of the papers describing the available methods. In a second step, you will familiarize yourself with the design principles of a Python library in general and of the scikit-learn environment in particular. In the third place, you will implement the methods and perform thorough testing and benchmarking, including the application on real-world datasets. Fourthly, you will individually develop a small research question, e.g., about a specific performance aspect of these methods that has not yet been covered in the literature, and will perform experiments to answer this question. Lastly, your work will be written up in a self-contained documentation.
By the end of the semester, you will have completed the following deliverables:
1. REQUIRED:
1.1. Write a Python library:
- Implement at least three different methods per participant.
- Ensure compatibility with scikit-learn.
- Fully document your code as both user and technical documentation.
1.2. Extract suitable evaluation datasets referenced in the literature.
1.3 Write a project report (can be written as a joint report of all participants clearly marking the individual contributions):
- Describe and compare the methods you have selected.
- Benchmark, e.g., the runtime or classification performance of the implemented methods on the extracted datasets.
- Design, carry out, and document a small experiment regarding some unexplored aspect of HFS.
2. OPTIONAL:
- Follow test-driven development throughout the entire project.
- Determine the computational complexity for those HFS methods for which it has not yet been reported in the literature.
- Explore optimization potential of existing methods (e.g., through parallelization, dynamic programming, etc.).
- Generate synthetic datasets with specific properties of the data or underlying hierarchy to derive recommendations on when to apply the respective HFS methods (i.e., by creating and testing new hypotheses on why some HFS methods over- or underperform in specific settings).
Learning Objectives:
- Understand the rationale for choosing different feature selection techniques for hierarchically structured data and their differences to ‘flat’ feature selection methods.
- Ability to develop meaningful benchmarks for testing the practically relevant aspects of different methods.
- Ability to extract and distill important information from scientific literature.
- Ability to develop a research question based on existing evidence, to set up and run an experiment to address this question, and to critically assess the results.
- Gain experience in collaboratively working on a science-oriented software project geared towards a broader spectrum of users.
Voraussetzungen
The most important ingredient for this seminar to become a successful and joyful experience is your curiosity to dive into a new topic, to implement and test algorithms, and to (co-)create your own little applied computer science research project!
Additionally, it would be appreciated if the following requirements were fulfilled:
- Intermediate programming skills in Python or advanced skills in another programming language.
- Basic understanding about the overall idea and workflow of machine learning, classification, and feature selection.
- Basic knowledge about good practices in collaborative software design.
- Basic experience in reading scientific papers, and in setting up your own research question(s).
- Sufficient knowledge of English to read scientific papers, write a technical document, and communicate your own ideas to like-minded peers.
Optionally, you are:
- Interested in basic computational complexity considerations.
- Curious about optimization strategies.
Literatur
- For a better understanding of the rationale behind feature selection:
“An Introduction to Feature Selection” (Chapter 19) in: M. Kuhn and K. Johnson. Applied predictive modeling. New York: Springer, 2013.
https://doi.org/10.1007/978-1-4614-6849-3
- A comprehensive overview of the more traditional feature selection methods:
J. Tang, S. Alelyani, and H. Liu. “Feature Selection for Classification: A Review.” (Chapter 2) In: Data Classification - Algorithms and Applications. Charu C. Aggarwal (Editor). Chapman and Hall/CRC, New York, 1st Edition (2014). https://doi.org/10.1201/b17320
- For a short introduction to HFS, sections 1-3 of the following paper:
P. Ristoski, and H. Paulheim (2014). “Feature selection in hierarchical feature spaces.” In Discovery Science: 17th International Conference, DS 2014, Bled, Slovenia, October 8-10, 2014. Proceedings 17 (pp. 288-300). Springer International Publishing. https://doi.org/10.1007/978-3-319-11812-3_25
- Short introduction to developing a scikit-learn compatible software package:
“API design for machine learning software: Experiences from the scikit-learn project.” https://doi.org/10.48550/arXiv.1309.0238
~~~Further literature is available upon request.~~~
Lern- und Lehrformen
The core of this seminar will be informal, weekly in-person meetings to discuss the progress of the project, jointly identify open questions, and exchange thoughts about all aspects of the considered methods.
Despite the separate evaluation of each participant’s contributions, the nature of this project will be collaborative with regards to writing both the code and the project report.
Guidance about the organization and structure of the given tasks will be available at any time via Mattermost / e-mail / open office.
Leistungserfassung
The actual grade will be determined based on:
- A project presentation (25%).
- The project report, including code and documentation (75%).
Termine
Regular meetings will be held every Thursday from 4-5 PM. Other times can be agreed upon with all the participants.
The due dates for the project components will be agreed upon together with the participants.
There will be the opportunity to also hold a non-graded presentation including a Q & A session in front of a scientific audience (the DACS group).
The assignment of the scientific papers with the methods to be implemented will take place during the seminar session in the second week of the semester (latest by April 30, 2023).
Dropping the course is possible until two weeks later, by May 14, 2023.
#################################################
(Kickoff Meeting: Thursday, April 20, 2023, 4.00-5.00 PM, room K-2.04)
#################################################
2nd MEETING: Thursday, April 27, 2023, 4.00-5.00 PM, room K-2.03
#################################################
Zurück