Machine learning methods can help acquire this metadata automatically, without human
work. One such method is language identification. It is useful both on its own, e.g., by
enabling the filtering of documents by language, and as a preprocessing step for other tasks,
such as optical character recognition (OCR).
Many state-of-the-art machine learning methods rely on large amounts of labeled training
data. However, while we do have a large dataset of historical documents given to us by our
project partner, the Wildenstein Plattner Institute (WPI), we do not have labels available.
Thus, this project aims to solve the language identification task without labels.