Novels are one of the longest document types and thus one of the most complex types of texts. Many NLP tasks utilize document embeddings as machine-understandable semantic representations of documents. However, such document embeddings are optimized for short texts, such as sentences or paragraphs. When faced with longer texts, these models either truncate the long text or split it sequentially into smaller chunks. We show that when applied to a fictional novel, these traditional document embeddings fail to capture all its facets. Complex information, such as time, place, atmosphere, style, and plot is typically not represented adequately.
To this end, we propose lib2vec which computes and combines multiple embedding vectors based on various facets. Instead of splitting the text sequentially, lib2vec splits the text semantically based on domain-specific facets. We evaluate the semantic expressiveness using human-assessed book comparisons as well as content-based information retrieval tasks. The results show that our approach outperforms state-of-the-art document embeddings for long texts.