Hasso-Plattner-Institut25 Jahre HPI
Hasso-Plattner-Institut25 Jahre HPI
 

Feature Selection and Classification for hierarchically structured feature spaces (Wintersemester 2023/2024)

Dozent: Prof. Dr. Bernhard Renard (Data Analytics and Computational Statistics) , Jan-Philipp Sachs (Data Analytics and Computational Statistics)

Allgemeine Information

  • Semesterwochenstunden: 4
  • ECTS: 6
  • Benotet: Ja
  • Einschreibefrist: 01.10.2023 - 31.10.2023
  • Lehrform: Seminar
  • Belegungsart: Wahlpflichtmodul
  • Lehrsprache: Englisch
  • Maximale Teilnehmerzahl: 3

Studiengänge, Modulgruppen & Module

IT-Systems Engineering MA
  • OSIS: Operating Systems & Information Systems Technology
    • HPI-OSIS-K Konzepte und Methoden
  • OSIS: Operating Systems & Information Systems Technology
    • HPI-OSIS-T Techniken und Werkzeuge
  • OSIS: Operating Systems & Information Systems Technology
    • HPI-OSIS-S Spezialisierung
  • SAMT: Software Architecture & Modeling Technology
    • HPI-SAMT-K Konzepte und Methoden
  • SAMT: Software Architecture & Modeling Technology
    • HPI-SAMT-T Techniken und Werkzeuge
  • SAMT: Software Architecture & Modeling Technology
    • HPI-SAMT-S Spezialisierung
Data Engineering MA
Digital Health MA
Cybersecurity MA
Software Systems Engineering MA

Beschreibung

------------------------------------------------------------------------------------------------------------------------------------

If you want to join this seminar, please enroll to the moodle course:

https://moodle.hpi.de/course/view.php?id=688

------------------------------------------------------------------------------------------------------------------------------------

Powerful predictive models based on high-dimensional data are a cornerstone of modern machine learning applications. Yet, in real-world scenarios, often a vast majority of the available features are irrelevant or redundant to others, rendering them superfluous for the actual prediction task. While some model classes (e.g., decision trees) intrinsically select the most relevant features, others (e.g., non-penalized regression models or neural networks) have difficulties in identifying these, thus resulting in less desirable model properties such as overfitting, prediction uncertainty, or decreased feature set stability.

One way to successfully remedy this shortcoming is feature selection. Feature selection aims at reducing a large number of input features to a small set of relevant and non-redundant ones, e.g., by selecting according to data-intrinsic properties (filter methods) by fitting separate models to subsets of the feature space (wrapper methods), or by designing models that perform feature selection intrinsically (embedded methods).

However, there are scenarios where the features themselves additionally show hierarchical relations amongst each other, i.e., one feature is a more specific instance of a more general feature. Two frequently mentioned examples in this context are:

  1. Datasets where the instances are genes, and the features are functional annotations; e.g., “VEGF-A complex” is a more specific description of the concept of “growth-factor activity”, though both might appear as features in the same dataset.
  2. Datasets where instances are texts or tweets, and the features are the words contained in these instances; e.g., “server” is a more specific instance of “computer”, and again both can appear in the same dataset.

In these settings, selecting both the more specific and the more general concept for modeling would lead to (hierarchical) redundancy, and should thus be avoided through the feature selection process. Unfavorably, all of the well-known ‘flat’ feature selection methods have limited capabilities in realizing this, opening up the field for specialized hierarchical features selection (HFS) methods. They take as input the original features as well as the information about the feature hierarchy, organized in a tree or a directed acyclic graph (DAG) – in the two above-mentioned examples the GeneOntology and WordNet.

Around two dozen papers about these HFS methods have been published over the last two decades, but openly accessible implementations are only available for a few. As a consequence, the available filter and wrapper HFS methods are currently being implemented in Python as a scikit-learn compatible open source library here at HPI during summer term 2023.

The first goal of this master seminar will thus be to extend this package and include embedded HFS methods as well as specialized classifiers that are designed to work out-of-the-box with hierarchical feature spaces (without implicit feature selection). The second goal of the project is to improve the existing methods in this area of hierarchically structured input data on both conceptual and computational levels.

For that purpose, the project will be structured as follows: You will first learn the fundamentals of HFS. In a second step, you will familiarize yourself with the design principles of Python libraries in general, and the scikit-learn environment and the currently developed HFS library in particular. In the third place, each participant of the seminar will implement a subset of these methods compatible with the existing library, and perform thorough testing and benchmarking, including the application on real-world datasets. Fourthly, you will develop ideas about the improvement of the existing methods (or the design of new ones), implement these and design a set of experiments to collect evidence convincingly showing the hypothesized improvements. Lastly, your implementation will be added to the technical documentation of the library, and your experiments and results will be written up in a self-contained scientific project report.

By the end of the semester, you will have completed the following deliverables:

  1. Implement a pre-specified set of embedded HFS methods and specialized classifiers:
    1. Ensure compatibility with scikit-learn and the existing HFS library.
    2. Fully document your code as both user and technical documentation.
  2. Conceptualize improvements of the existing methods or design a new one for hierarchically structured features in the input data; and conduct experiments showing their advantages:
    1. Identify and explain shortcomings of the methods of your choice.
    2. Sketch out improvements (or a new method) and implement these as compatible code.
    3. Hypothesize which aspects will benefit from the improved (or new) method, design, and carry out suitable experiments to investigate these.
    4. Collect and visually work up the results to prove (or falsify) these hypotheses.
  3. Write a project report (can be written as a joint report of all participants clearly marking the individual contributions) about:
    1. The methods you have implemented.
    2. Their runtime and classification performance on real-world datasets under a set of different tunable parameters.
    3. The rationale, design and results of the experiments about the improvement of existing methods.

 

Optional:

  1. Follow test-driven development throughout the entire project.
  2. Determine the computational complexity for those methods for which it has not yet been reported in the literature.
  3. Generate synthetic datasets with specific properties of the data or underlying hierarchy to derive recommendations on when to apply the respective HFS methods (i.e., by creating and testing new hypotheses on why some HFS methods over- or underperform in specific settings).

 

Learning Objectives:

  • Understand the rationale for choosing different feature selection techniques for hierarchically structured data and their differences to ‘flat’ feature selection methods.
  • Ability to develop meaningful benchmarks for testing the practically relevant aspects of different methods.
  • Ability to extract and distill important information from scientific literature.
  • Ability to develop a research question based on existing evidence, to set up and run an experiment to address this question, and to critically assess the results.
  • Gain experience in collaboratively working on a science-oriented software project geared towards a broader spectrum of users.

Voraussetzungen

The most important ingredient for this project to become a successful and joyful experience is your curiosity to dive into a new topic, to implement and test algorithms, and to (co-)create your own little applied computer science research project!

Additionally, it would be appreciated if the following requirements were fulfilled:

  • Intermediate programming skills in Python or advanced skills in another programming language.
  • Basic understanding about the overall idea and workflow of machine learning, classification, and feature selection.
  • Basic knowledge about good practices in collaborative software design.
  • Basic experience in reading scientific papers, and in setting up your own research question(s).
  • Sufficient knowledge of English to read scientific papers, write a technical document, and communicate your own ideas to like-minded peers.  

Optionally, you are:

  • Interested in basic computational complexity considerations.
  • Curious about optimization strategies.

Literatur

1. For a better understanding of the rationale behind feature selection:

“An Introduction to Feature Selection” (Chapter 19) in: M. Kuhn and K. Johnson. Applied predictive modeling. New York: Springer, 2013.
https://doi.org/10.1007/978-1-4614-6849-3

 

2. For a short introduction to HFS, sections 1-3 of the following paper:

P. Ristoski, and H. Paulheim (2014). “Feature selection in hierarchical feature spaces.” In Discovery Science: 17th International Conference, DS 2014, Bled, Slovenia, October 8-10, 2014. Proceedings 17 (pp. 288-300). Springer International Publishing.

https://doi.org/10.1007/978-3-319-11812-3_25

 

3. Short introduction to developing a scikit-learn compatible software package:

API design for machine learning software: Experiences from the scikit-learn project.” https://doi.org/10.48550/arXiv.1309.0238

 

Further literature is available upon request.

Lern- und Lehrformen

The core of this project will be weekly in-person meetings to discuss the progress of the project, jointly identify open questions, and exchange thoughts about all aspects of the considered methods. Besides that, the participants will be given considerable degrees of freedom regarding self-organization.

Despite the separate evaluation of each participant’s contributions, the nature of this project will be collaborative, in particular with regards to the optimization and methodological improvements, the experiments, and the project report.

Leistungserfassung

The actual grade will be determined based on:

  1. Project presentation (25%).
  2. The project report, including code and documentation (75%).

Termine

------------------------------------------------------------------------------------------------------------------------------------

If you want to join the seminar, please enroll to the moodle course:

https://moodle.hpi.de/course/view.php?id=688

All communication will be held via a separate Slack channel. You will get an invitation to it as soon as you will be registered in the Moodle course.

------------------------------------------------------------------------------------------------------------------------------------

The 1st meeting to get to know each other was on Monday, October 16, 2023, 3.15 - 4.45 PM, room K-2.03.

The 2nd meeting to kick off the actual seminar work was on Monday, October 30 (!!), 2023, 3.15 - 4.45 PM, room K-2.03.

------------------------------------------------------------------------------------------------------------------------------------

Regular meetings are planned to be held every Monday, 3.15 – 4.45 PM (except Monday, October 23). Other times can be agreed upon with all the participants.

The due dates for the project components will be agreed upon together with the participants.

There will be the opportunity to also hold a non-graded presentation about the work in progress including a Q & A session in front of a scientific audience (the DACS group).

Zurück