Local Hierarchical Classification (Sommersemester 2021)
Lecturer:
Prof. Dr. Bernhard Renard
(Data Analytics and Computational Statistics)
,
Fabio Malcher Miranda
(Data Analytics and Computational Statistics)
General Information
- Weekly Hours: 4
- Credits: 6
- Graded:
yes
- Enrolment Deadline: 18.03.2021 - 09.04.2021
- Teaching Form: Seminar
- Enrolment Type: Compulsory Elective Module
- Course Language: English
- Maximum number of participants: 8
Programs, Module Groups & Modules
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge
- OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
- ISAE: Internet, Security & Algorithm Engineering
- HPI-ISAE-K Konzepte und Methoden
- ISAE: Internet, Security & Algorithm Engineering
- HPI-ISAE-T Techniken und Werkzeuge
- ISAE: Internet, Security & Algorithm Engineering
- HPI-ISAE-S Spezialisierung
- DATA: Data Analytics
- HPI-DATA-K Konzepte und Methoden
- DATA: Data Analytics
- HPI-DATA-T Techniken und Werkzeuge
- DATA: Data Analytics
- HPI-DATA-S Spezialisierung
- CODS: Complex Data Systems
- HPI-CODS-K Konzepte und Methoden
- CODS: Complex Data Systems
- HPI-CODS-T Techniken und Werkzeuge
- CODS: Complex Data Systems
- HPI-CODS-S Spezialisierung
- CYAD: Cyber Attack and Defense
- HPI-CYAD-K Konzepte und Methoden
- CYAD: Cyber Attack and Defense
- HPI-CYAD-T Techniken und Werkzeuge
- CYAD: Cyber Attack and Defense
- HPI-CYAD-S Spezialisierung
- SECA: Security Analytics
- HPI-SECA-K Konzepte und Methoden
- SECA: Security Analytics
- HPI-SECA-T Techniken und Werkzeuge
- SECA: Security Analytics
- HPI-SECA-S Spezialisierung
- APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-C Concepts and Methods
- APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-T Technologies and Tools
- APAD: Acquisition, Processing and Analysis of Health Data
- HPI-APAD-S Specialization
Description
Most research in machine learning focuses on flat classifiers, which solve binary or multiclass
problems while completely ignoring any hierarchy that could possibly exist between classes.
Nevertheless, many real-world problems have a hierarchical structure, where classes are
organized in the form of trees or directed acyclic graphs. Notable examples of major real-world
applications are text categorization, protein function prediction and music genre classification. In
such cases, exploiting the hierarchical information in the data, might improve the quality of the
predictions.
There are basically three distinct approaches that can be employed to solve hierarchical
problems:
- Local classifier per node - Creates a binary classifier for each class in a given hierarchy, except for the root node. Figure 1.a illustrates this approach.
- Local classifier per parent node - Creates numerous multiclass classifiers, one for each parent node in the hierarchical structure, which are employed to predict child classes. Figure 1.b illustrates this approach.
- Local classifier per level - Creates a multiclass classifier for each level of the hierarchy. Figure 1.c illustrates this approach.
Although hierarchical approaches can be effective in many real-world scenarios, as far as we
know, there are no generic, user-friendly libraries to solve hierarchical classification tasks.
Hence, a solution needs to be coded from scratch for every new problem and this can be quite
time consuming. For this reason, in this project we propose the development of a library, which
will be compatible with a largely used framework, e.g., scikit-learn, in order to facilitate its
adoption by software developers and/or scientists.
One of the main challenges of this project is to develop efficient and scalable algorithms that are
able to handle large amounts of data, and those algorithms also need to be generic to be
utilized in solutions across different application domains. By the end of the project, it is expected
that the final product will be a usable open-source library which complies with good software
engineering practices. In order to achieve this goal, code will be developed in a team effort, with
group discussions about new features, peer revisions, writing of user and technical
documentation as well as the use of automated unit tests. Depending on the progress of the
project, there is a possibility of a manuscript being written by the end of the semester and
submitted to a scientific journal.
Learning objectives:
- Develop scalable code, which is able to handle large volumes of data;
- Break down complex tasks in smaller sub-tasks that can be easily managed;
- Plan and implement the development of generic hierarchical machine learning models;
- Teamwork skills;
- Ability to work in open-source projects.
Requirements
- Good programming skills in Python, C or C++;
- Familiarity with machine learning methods and libraries;
- Motivation to learn and apply good software engineering practices;
- English proficiency;
- Fundamental knowledge about trees and directed acyclic graphs is beneficial but NOT required.
Literature
- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. and Vanderplas, J., 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, pp.2825-2830.
- Silla, C.N. and Freitas, A.A., 2011. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1), pp.31-72.
Learning
Meetings will initially happen via zoom due to the COVID-19 pandemic, but if the situation
improves a return to the lecture hall is also possible in the future. Visual aids will be used when
appropriate, in order to facilitate the discussion. Assignment details will be provided in advance
and will be handed during lectures. Recorded meetings will be made available online.
All interested participants must enroll by April 14 via email to fabio.malchermiranda@hpi.de.
Examination
Grading is based on multiple factors concerning the project success, including: project design
draft after 4 weeks (10%), a presentation due at the second half of the semester (30%) and a
final report (60%). We will provide close support and regular feedback.
Dates
Thursday 13:30 - 15:00, but can be changed according to students' needs.
Zurück