Profiling with Column Stores
Real life datasets often lack adequate structural information, because they got lost or have never been defined due to lacking knowledge about the data. Those information are constraints like primary and foreign keys, data types and several other things. But those information are often essential for successful data migrations or fusions. Typically life science datasets are missing such schema information, especially when they are provided as raw csv dumps. On the other hand such data includes additional valuable information, which is hard to find. Considered a database of chemical compounds: there could exist dependencies between several attributes of the dataset, that could lead domain-specific scientist to hints about cancer causes and other things.
So a good data knowledge is the key to a lot of data-centric tasks and one discipline dealing with that conjuncture is data profiling.
ProCSIA is a joint project between IBM’s research lab in Böblingen and the HPI. The goal is to evaluate the potential performance improvement of IBM's Information Analyzer in cooperation with column store technology. The Information Analyzer is part of IBM's InfoSpehere product family and a reliable tool for data profiling tasks. In the current setting the IA works primarily with classical row oriented DBMS's as data source for profiling tasks as well as for internal data storage. IBM kindly provides both essential software and support including also own research knowledge.
The main intention as mentioned above is the evaluation of column store technology to improve profiling tasks. Column stores came up in the last years and mainly differ from conventional relational database systems in the manner of physical storing the relations on disk. Instead of saving the tuples row-oriented, they store the values per column, which enables powerful side effects like a way better possibility for compressing the data. Because typical data profiling task rely on column-aggregating and -joining queries instead of whole tuple assembly and insert operations, which are naturally expensive on column stores, the new technology seems to be perfectly fitting for those tasks. Additionaly the general improvement and extension of Information Analyzer's functionality is another goal of the bachelor project. Therefore newest research results are studied, implemented and tested, also under the aspect how column store technology can improve them.