ptFD-Miner

ptFD-Miner: Mining Partial Temporal Functional Dependencies

Dr. Lisa Ehrlinger, Prof. Felix Naumann

Functional Dependency Mining

Functional dependencies (FDs) of the form X→Y (e.g., ZipCode→City), where X determines Y, can be discovered from data. However, not all such discovered FDs are equally significant. For example, in a library dataset, ISBN→BookTitle is a genuine FD, while BookTitle→ISBN might hold coincidentally in a small dataset but is semantically invalid. One goal of data profiling research is to classify valid, discovered FDs as genuine or spurious [3].

One such approach regards data as it changes over time: FDs that are valid across time or in many versions of a dataset are more likely to be genuine rather than random dependencies. Hence, such temporal FDs provide an indicator about the genuineness of a discovered dependency for the given relation. However, in evolving data, even genuine FDs may not hold perfectly at every point, especially in datasets without transactional semantics. Data changes through inserts, updates, and deletes. For example, Wikipedia undergoes constant modifications to both its data and structure, which can also introduce data errors. Recent work proposed an approach specific to another type of dependency — inclusion dependencies, which enables the discovery of temporal inclusion dependencies on evolving data [1]. Here, the authors relax both the validity itself and the time interval across which validity of the dependency is checked. While a number of approaches for mining partial FDs exist, there is no combined approach to discover partial temporal FDs.

Project Goal

This master's project aims to develop an algorithm to discover partial temporal functional dependencies (ptFDs) efficiently through the following steps:

Familiarize yourself with related work on FD discovery [2,4], mining of temporal inclusion dependencies [1], and the provided Wikipedia dataset.
Develop and implement an algorithm to detect ptFDs (ptFD-Miner).
Extend the experimental setup by generating large amounts of synthetic data that contains (labeled) partial and temporal FDs.
Evaluate ptFD-Miner according to its efficiency (experiment 1) and the usefulness of the detected FDs (experiment 2).
- Experiment 1: Beat the baseline (apply HyFD at each point in time).
- Experiment 2: Show that detected FDs are more likely to be genuine.
Refine ptFD-Miner and optimize for efficient discovery.
Prepare a submission to a top database conference.

[1] Bornemann, L., Bleifuß, T., Kalashnikov, D. V., Nargesian, F., Naumann, F., & Srivastava, D. (2024). Efficient Discovery of Temporal Inclusion Dependencies in Wikipedia Tables. In EDBT.

[2] Papenbrock, T., & Naumann, F. (2016). A hybrid approach to functional dependency discovery. In proceedings of the 2016 International Conference on Management of Data (pp. 821-833).

[3] Parciak, M., Weytjens, S., Hens, N., Neven, F., Peeters, L. M., & Vansummeren, S. (2025). Measuring approximate functional dependencies: A comparative study. The VLDB Journal, 34(4), 1-27.

[4] Schirmer, P., Papenbrock, T., Kruse, S., Naumann, F., Hempfing, D., Mayer, T., & Neuschäfer-Rube, D. (2019). DynFD: Functional Dependency Discovery in Dynamic Datasets. In EDBT (pp. 253-264).