Information Systems Group
The research goal of the Information Systems Group is the efficient and effective management of heterogeneous information in large, autonomous systems. This includes methods for data profiling, data cleansing, search, and metadata management. Please also see our blog.
The list below gives an overview of our research topics. Further details can be found on our project and our publications pages. In addition, we maintain a repeatability site to publish code and data.
- Data Profiling: When integrating heterogeneous sources, details of the schema, such as keys, functional dependencies, and foreign key dependencies, are often unknown. We are developing efficient and scalable data profiling methods to automatically detect these and other dependencies in very large databases. Our Metanome project collects various high-efficiency methods into a common framework.
Links: Metanome project, Vision paper, ProLOD++ tool
Completed projects: Spider algorithm, Aladin project
German: ProLOD seminar, Advanced data profiling seminar
- Data quality / information quality: The quality of data is measured in many different dimensions. Quality values can be aggregated along data operations, for instance to calculate the quality of query results.
Links: ICIQ 2009,
German: Schlagwort "Datenqualität" im Informatik Spektrum
- Duplicate detection: Duplicates are multiple, different representations of the same real-world object, for instance, multiple records of a customer in a CRM database. Duplicate detection try to build systems that efficiently and effectively find such duplicates in large data sets.
Links: Synthesis lecture, repeatability, DuDe
German: Duplikaterkennung allgemeinverständlich
- Linked Open Data (LOD): More and more sources provide data in RDF form as linked open data. Such data serves as use case in a variety of projects.
Links: HPI's open data activities, ProLOD
- Similarity Search: Queries often do not exactly match desired objects in the data store. To also find similar matches for a query, a similarity measure as well as a similarity-aware index structure are necessary.
Links: Similarity search research project, Similarity Search Algorithms seminar (German)
- Data Fusion: Data fusion is the process of fusing multiple records representing the same real-world object, i.e., duplicates, into a single, consistent, and clean representation. Challenges are scalability over large data volumes and conflict resolution of contradictory values.
- Links: FuSem, Hummer, ACM computing survey, VLDB tutorial
- Bachelor: We offer regular german lectures in database systems, namely Datenbanksysteme I (DBS I) und Datenbanksysteme II (DBS II). In addition we offer the regular seminar "Beauty is our Business" and many other project-oriented seminars.
One-year Bachelor Projects with 6-8 students finalize bachelor studies at HPI. Our group offers one or two such projects per year in cooperation with external partners.
- Master: We frequently offer the courses "Information Integration", "Search Engines", and "Information Retrieval". In addition we offer diverse specialized seminars, some theoretical, some project-oriented.