Mapping and Understanding the Evolution of Language

Our natural language is constantly evolving. The words we use change over time, but also their meaning or the context in which we use them. One word can even mean different things, for example, “apple” is a fruit or a company. In the research area of Natural Language Processing (NLP), there are already models that try to analyze evolving language or that automatically identify words with multiple senses. However, doing both at the same time or successfully using such models in applications for automated text processing is only now moving into the focus of current research.

A vision mockup of our language model visualization; a timeline visualization in which similar words are close one another and lines split or merge as multiple senses arise.

Master's Project Winter 2019/20

Team: Jan Ehmüller, Lasse Kohlmeyer, Holly McKee, Daniel Paeschke

As language evolves, a word can gain new senses. For instance, the word “cloud” was once only used in the newspaper section of weather forecasts. Nowadays, it is increasingly used in the context of “Cloud Computing”. In this master’s project, we develop a novel approach that represents the usage of words over time as a graph. We address the following research question: How can we identify words with new senses? We develop an approach to rank the likelihood that one word gained a new sense over the last 200 years. Our approach was applied to the “Corpus of Historical American English” (COHA).

To learn more, watch our 13min Presentation Video.

This project is a great foundation for future work towards the vision presented above. The analysis of the visualised sense graphs already provides valuable insights into the usage and context of terms over time. In collaboration with the Theodor-Fontane-Archiv and Digital Humanities Network we are working applying our model on other historical corpora.

Data

As stated in our LWDA 2020 paper, we conducted a survey in which linguists annoted a list of words. They were asked to state, whether a word has gained an additional sense since 1800 or not. If you are using the data, please cite our work or contact us with any questions you might have.

Survey as PDF
Spreadsheet of results [full xlsx, selective tsv]
(first few rows are by us, last rows are by liguists)
Some Statisics (script)

Contact

This is part of the Mímir project, if you have any questions, please contact Tim Repke.