Autoencoder/Embedding vs PCA, tSNE & Co.
Unfortunately, our world has only three dimensions, computer screens only two even. To visualise high-dimensional data, we have to reduce it to see it. This process is comes with the cost of losing information, as all the data has to be compressed into a much smaller space. In the case of embeddings (i.e. word2vec) this is very helpful. Mikolov et al. used a very simple neural network to reduce a large vocabulary of words down to 50/100/300 dimensions and found, that this lower dimensional space has useful properties, for example that the Euclidean distance between data points can be interpreted as semantic similarity. To visualise the vocabulary, data scientists and researchers often use tSNE (Marten et al, 2009). While other dimensionality reduction approaches (e.g. PCA) may not reflect relatedness properties well, tSNE uses a cost function, which keeps points that are clustered in the high dimensional space together even in the low dimensional space.
In this master’s thesis, we want to compare the influence algorithms for the first stage in tSNE. Furthermore, we would like to transfer the tSNE cost function to some kind of auto-encoder, so that we are able to reduce all data in one go with a single (simple) model. This thesis can be very low-level and math heavy (proper understanding of mathematical model behind neural nets & develop cost function) or engineering heavy (comparing lots of existing methods to find good constellations).