Interview with Prof. Dr. Ziawasch Abedjan

From HPI to the world

Prof. Dr. Ziawasch Abedjan received his doctorate at the Hasso Plattner Institute in 2014. He was supervised by Prof. Dr. Felix Naumann. In his dissertation, young researcher Abedjan investigated how freely available and linkable data, so-called Linked Open Data, can be better used on the Internet. Nowadays he is working at the TU Berlin as Head of the “BigDaMa” group and recently became one of the four new GI Juniorfellows.

" Data cleaning is a tedious preprocessing step in the vast majority of data science applications."

Interview

Hello Prof. Dr. Abedjan, a lot has changed for you since you finished your PhD at the HPI Research School in 2014, and we would be happy to know more about your new environment, the development of your work, and what you are up to these days in general.

Can you give us an update about what happened during the last months? Recently (31. July 2019) you have been announced as a GI Junior fellow. How does it feel and what are your aims?

Last month was quite busy. First it was the end of semester. So we had to prepare and conduct the final exams in our data science courses. In particular, the students of our new data science course for non-computer scientists took their final exam. This was quite tricky because the participants had very different backgrounds from Arts History to Electrical Engineering. We had to make sure that we prepare the exam in a fair way and still make sure that relevant concepts of math and computer science could be conveyed. Furthermore, we presented our research at two different conferences SIGMOD and SSDBM. Last but not least I was invited for an interview with members of the GI where I presented my concept for improving data science education at university level. I hope that with the help the relevant working groups at the GI, we can come up with consistent guidelines for data science study programs at universities.

How do you like your work at the Technische Universität Berlin? What are you/ “bigdama” working on these days?

As I mentioned, we just presented our recent work at two major database conferences SIGMOD and SSDBM. Currently, we are working on approaches to reduce the human effort data cleaning. Data cleaning is a tedious preprocessing step in the vast majority of data science applications. Our solutions make use of historical data and novel application of machine learning techniques to reduce the user effort in this regard. Our approach “Raha” is able to outperform existing error detection techniques with only a handful of labeled data records. This is a significant improvement over existing work where the user has to label some 1 to 5 percentage of a dataset, which can turn into impractical numbers for large datasets.

What are the biggest challenges and opportunities of big data/ data mining?

There are generally many open challenges, traditional challenges on scalability of data analytics, handling of fast data, and dealing with heterogeneity of data. Other challenges, such as fairness and bias removal are also becoming more prominent in the research community. Our group tackles the classical but still hard problem of data heterogeneity. So far, we focused on data cleaning that aims at identifying data inconsistencies and errors and correcting them. Most data scientists claim to spend 60-80% of their time on cleaning and transforming data, next to data discovery and extraction. Also most data scientists admit that this is the least enjoyable task in their pipeline.

Could you explain your concept behind “mining configurations”?

In my PhD thesis, I defined mining configurations were defined in the context of mining open RDF data, were data is represented as RDF facts consisting of subject, predicate, and objects. For example, a fact would be “Ziawasch Abedjan” (subject) “studiedAt” (predicate) “HPI” (object). Our concept of mining deals with co-occurrence analysis. So there are different ways on how to analyze a corpora of facts. One could analyze how often certain subjects share the same set of predicates, or the same set of objects and the other way around. Each of these constellations was called a mining configuration.

Which opportunities do we gain by filtering and using big data?

The opportunities are manifold. Data-driven solutions have well-known applications in medicine, text analysis and translation, process optimization, and knowledge extraction.

Is the analysis of big data necessary to create a functioning A.I.?

It depends on your definition of what functioning A.I is. What I can say is that big data and machine learning play an integral role in the current A.I. research.

How is your personal approach regarding the sharing of your personal data? Do you “contribute” to the big pool?

Personal data should always be protected to avoid misuse and identity theft. We only contribute to the large pool by publishing our research results.

What are your visions for your future research?

We will certainly work towards improving the data science pipeline by reducing the human effort in the preparation phase. This requires, in addition to automated cleaning and transformation techniques, also easy to handle discovery modules which are able to find and extract datasets that are interesting to the scientist.

Thank you a lot for your time. It has been a pleasure talking to you.

We are looking forward to stay in contact with you and are curious to follow your work and your upcoming discoveries.