Mapping and Understanding the Evolution of Language

Our natural language is constantly evolving. The words we use change over time, but also their meaning or the context in which we use them. One word can even mean different things, for example, “apple” is a fruit or a company. In the research area of Natural Language Processing (NLP), there are already models that try to analyze evolving language or that automatically identify words with multiple senses. However, doing both at the same time or successfully using such models in applications for automated text processing is only now moving into the focus of current research.

A vision mockup of our language model visualization; a timeline visualization in which similar words are close one another and lines split or merge as multiple senses arise.

Master's Project Winter 2019/20

Team: Jan Ehmüller, Lasse Kohlmeyer, Holly McKee, Daniel Paeschke

As language evolves, a word can gain new senses. For instance, the word “cloud” was once only used in the newspaper section of weather forecasts. Nowadays, it is increasingly used in the context of “Cloud Computing”. In this master’s project, we develop a novel approach that represents the usage of words over time as a graph. We address the following research question: How can we identify words with new senses? We develop an approach to rank the likelihood that one word gained a new sense over the last 200 years. Our approach was applied to the “Corpus of Historical American English” (COHA).

To learn more, watch our 13min Presentation Video.

Data processing pipeline with raw text input, tagging, term selection, graph building, graph filtering, sense clustering and matching across time slices and scoring

Example sense tree over 170 years for the word "MOUSE". It shows the matched clusters of the ego networks across three decades.

Actual sense tree for the word "acid" in COHA. Yellow nodes are decades of the corpus, subtrees branching off them are new senses. Branches within blue subtrees can be considered as "sub-senses".

This project is a great foundation for future work towards the vision presented above. The analysis of the visualised sense graphs already provides valuable insights into the usage and context of terms over time. In collaboration with the Theodor-Fontane-Archiv and Digital Humanities Network we are working applying our model on other historical corpora.

Data

As stated in our LWDA 2020 paper, we conducted a survey in which linguists annoted a list of words. They were asked to state, whether a word has gained an additional sense since 1800 or not. If you are using the data, please cite our work or contact us with any questions you might have.

Survey as PDF
Spreadsheet of results [full xlsx, selective tsv]
(first few rows are by us, last rows are by liguists)
Some Statisics (script)

Contact

This is part of the Mímir project, if you have any questions, please contact Tim Repke.

References

Schwanhold, R., Repke, T., Krestel, R. (2021) ‘Modeling the Evolution of Word Senses with Force-Directed Layouts of Co-occurrence Networks’, Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change (LChange@ACL 2021), 1–6.

[ Abstract ] [ BibTeX ] [ Download ]

Ehmüller, J., Kohlmeyer, L., McKee, H., Paeschke, D., Repke, T., Krestel, R., Naumann, F. (2020) ‘Sense Tree: Discovery of New Word Senses with Graph-based Scoring’, in Proceedings of the Conference on "Lernen, Wissen, Daten, Analysen" (LWDA), 1–12.

[ Abstract ] [ BibTeX ] [ Download ]

Mapping and Understanding the Evolution of Language

Master's Project Winter 2019/20

Data

Contact

References

Chair

News

17.11.2025 | New book chapter about "Data Quality for Enterprise AI" published

01.11.2025 | Paper accepted at WOP@ISWC

29.09.2025 | Paper accepted at NeurIPS 2025

29.09.2025 | Paper accepted at SIGMOD 2026

09.07.2025 | Paper accepted in SIGMOD Record

Project highlights

People and open positions