Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
  
 

Publications

We try to keep an up to date list of all our publications. If you are interested in a PDF that we have not uploaded yet, feel free to send us an email to get a copy. All recent publications you will find below. For older, please click appropriate year.

Publications of the years 2020, 2019, 2018, 20172016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007

Performance Evaluation and Optimization of Multi-Dimensional Indexes in Hive

Liu, Yue; Guo, Shuai; Hu, Songlin; Rabl, Tilmann; Jacobsen, Hans-Arno; Li, Jintao; Wang, Jiye in IEEE Trans. Services Computing 2018 .

Apache Hive has been widely used for big data processing over large scale clusters by many companies. It provides adeclarative query language called HiveQL. The efficiency of filtering out query-irrelevant data from HDFS closely affects theperformance of query processing. This is especially true for multi-dimensional, high-selective, and few columns involving queries,which provides sufficient information to reduce the amount of bytes read. Indexing (Compact Index, Aggregate Index, Bitmap Index,DGFIndex, and the index in ORC file) and columnar storage (RCFile, ORC file, and Parquet) are powerful techniques to achieve this.However, it is not trivial to choosing a suitable index and columnar storage based on data and query features. In this paper, wecompare the data filtering performance of the above indexes with different columnar storage formats by conducting comprehensiveexperiments using uniform and skew TPC-H data sets and various multi-dimensional queries, and suggest the best practices ofimproving multi-dimensional queries in Hive under different conditions.
Weitere Informationen
TagsServicesComputing