Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Mapping and Understanding the Evolution of Language

Our natural language is constantly evolving. The words we use change over time, but also their meaning or the context in which we use them. One word can even mean different things, for example, “apple” is a fruit or a company. In the research area of Natural Language Processing (NLP), there are already models that try to analyze evolving language or that automatically identify words with multiple senses. However, doing both at the same time or successfully using such models in applications for automated text processing is only now moving into the focus of current research.

A vision mockup of our language model visualization; a timeline visualization in which similar words are close one another and lines split or merge as multiple senses arise.

Master's Project Winter 2019/20

Team: Jan Ehmüller, Lasse Kohlmeyer, Holly McKee, Daniel Paeschke

As language evolves, a word can gain new senses. For instance, the word “cloud” was once only used in the newspaper section of weather forecasts. Nowadays, it is increasingly used in the context of “Cloud Computing”. In this master’s project, we develop a novel approach that represents the usage of words over time as a graph. We address the following research question: How can we identify words with new senses? We develop an approach to rank the likelihood that one word gained a new sense over the last 200 years. Our approach was applied to the “Corpus of Historical American English” (COHA).

To learn more, watch our 13min Presentation Video.

Data processing pipeline with raw text input, tagging, term selection, graph building, graph filtering, sense clustering and matching across time slices and scoring
Example sense tree over 170 years for the word "MOUSE". It shows the matched clusters of the ego networks across three decades.
Actual sense tree for the word "acid" in COHA. Yellow nodes are decades of the corpus, subtrees branching off them are new senses. Branches within blue subtrees can be considered as "sub-senses".

This project is a great foundation for future work towards the vision presented above. The analysis of the visualised sense graphs already provides valuable insights into the usage and context of terms over time. In collaboration with the Theodor-Fontane-Archiv and Digital Humanities Network we are working applying our model on other historical corpora.

Data

As stated in our LWDA 2020 paper, we conducted a survey in which linguists annoted a list of words. They were asked to state, whether a word has gained an additional sense since 1800 or not. If you are using the data, please cite our work or contact us with any questions you might have.

Contact

This is part of the Mímir project, if you have any questions, please contact Tim Repke.

References

  • Sense Tree: Discovery of ... - Download
    Ehmüller, J., Kohlmeyer, L., McKee, H., Paeschke, D., Repke, T., Krestel, R., Naumann, F.(2020)“Sense Tree: Discovery of New Word Senses with Graph-based Scoring”, in Proceedings Of The Conference On "Lernen, Wissen, Daten, Analysen" (Lwda), 1--12.