04.07.2018 - Arvind Kumar Shekar (external)
Multivariate Correlation Analysis for Supervised Feature Selection in High-Dimensional Data
In today’s scenario, several application domains involve collection of a large number of process variables also known as features. The high-dimensional feature space is commonly used for performing analytical tasks such as regression and classification. However, from the high-dimensional feature space, it is necessary to select or extract information that are relevant for a defined analytical task. The topic of multivariate correlation analysis is of paramount importance for both feature selection and extraction tasks.
The main theme of this dissertation focuses on multivariate correlation analysis on different data types. In this thesis we identify, analyze and fulfill the research gaps in the topic of multivariate correlation analysis. For this, we developed multiple novel techniques to address the correlation of the features to a target, i.e., relevance, and the correlation between the features, i.e., redundancy, on multiple data types such as continuous, categorical and time series.
Multiple views of the feature space exhibit different interactions between features and the target. Harnessing these interactions for the selection of relevant subsets may enrich the prediction model with novel information. Nevertheless, several existing feature selection algorithms focus on obtaining a single projection of the features and are not able to exploit the multiple local interactions from different feature subsets. In such datasets, few features by itself can have a small correlation with the target, but by combining these features with some other features, they can be strongly correlated with the target. Hence, it is necessary to evaluate the relevance of a feature based on its higher-order interactions in the dataset. By computing pairwise correlations, several existing works fail to address higher-order interactions between more than two features. In addition to dimensionality, the time series dataset 1 demand extraction and evaluation of a high number of subsequences for feature extraction. This requires an efficient framework to simultaneously extract relevant and novel multivariate subsequences and transform them into features. However, traditional feature transformation approaches are often unsupervised or require additional post-processing techniques.
Addressing all aforementioned problems require novel frameworks which performs large number of statistical computations. This hinders the user understanding of complex multivariate correlations. Consequently, the final problem we intend to address is enhancing the transparency of multivariate correlation analysis. Hence, in addition to the algorithmic contributions, we aim to enhance the user’s understandability of multivariate correlations in a dataset by presenting a novel software framework.
First, we present our algorithm called diverse subset selection strategy (DS3) that identifies diverse and complementary views of the dataset. We extend the concept of multiple views to our relevance and redundancy (RaR) ranking framework for mixed datasets which exhibit higher-order interactions. By evaluating the co-occurrence of patterns in multiple dimensions, our ordinal feature extraction (ordex) algorithm evaluates higher-order interactions in time series applications. Finally, we provide a software framework for exploring and understanding multivariate correlations (FEXUM), to help users understand and evaluate the multivariate correlations in the data.
In addition, this dissertation includes an extensive experimental and theoretical evaluation of the quality and scalability of our approaches with respect to the existing works. Apart from theoretical time complexity analysis, our evaluation methods are two-fold, i.e., we evaluate the proposed algorithms on synthetic and real world data. Overall, our findings show that our proposed contributions enhance the prediction accuracy and efficiency in comparison to several traditional approaches.