Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Mapping and Understanding the Evolution of Language

Our natural language is constantly evolving. The words we use change over time, but also their meaning or the context in which we use them. One word can even mean different things, for example, “apple” is a fruit or a company. In the research area of Natural Language Processing (NLP), there are already models that try to analyze evolving language or that automatically identify words with multiple senses. However, doing both at the same time or successfully using such models in applications for automated text processing is only now moving into the focus of current research.

A vision mockup of our language model visualization; a timeline visualization in which similar words are close one another and lines split or merge as multiple senses arise.

Master's Project WS 2019/20

In this master’s project, we will develop a novel approach that represents the usage of words over time as a graph. Using carefully crafted restrictions to the two-dimensional layout when drawing that graph, we will have fine-grained control to understand the resulting language models. Although this is mainly a computational problem, we will focus on visualization subsets of the model to explore the feasibility and quality of such an approach.

The visualization will show the graph structure of selected words and their context as used in the underlying text corpus. In a large timeline, where each line represents a word, it will be possible to see how they appear, split into different meanings, or disappear again. The closeness of lines in the resulting chart conveys semantic similarity. The figure above shows a sketch of an interactive visualization that we aim to develop. This will help to investigate the feasibility and quality of this approach and provide a better way for qualitative evaluation of dynamic language models.

Students will train word embeddings on documents from different time slices. In the visualization, each time slice is a vertical axis on which words are placed based on their similarity in the embedding. After connecting the words across time slices, we use the weighted edges as force constraints. An iterative algorithm will try to optimize network layout by reordering nodes (i.e. words on the axes) or splitting them. For the visualization, we only draw selected words and their neighbors, and use additional meta-data to make the timeline more interesting, for example, topic zones, frequencies as sizes, or color for highlights.

Contact

This is part of the Mímir project, students will be supervised closely by Prof. Naumann and Tim Repke.

Related Work