Ilia Afanasev (HSE University/MTS AI) A new method for genetic language distance measurement between closely related lects
Measuring distance between different language varieties (lects) generally must rely on an extensive linguistic research that includes collecting wordlists and information on evolution of the phonetic system (Campbell, 2013). However, sometimes gathering this kind of data seems to be impossible, due to the lack of material, as the only one researchers stay with is a small sample of remaining texts. Most often this is the case of historical small territorial varieties. This eliminates any possibility of a reliable automatic classification, yet still preserves the possibility of a preliminary one.
The talk proposes a new method for measuring language distance between small historical closely-related lects, that is based on the combination of frequency-based methods and string similarity measures, and introduces a corpus-based string similarity measure that intends to imitate more advanced phonetic-based scores. The materials for its evaluations are modern and historical Slavic lects, including Slovak, Slovenian and Croatian standards, Belogornoje, Megra and Zialionka dialects, as well as Novgorod, Smolensk and Polack legal texts of XII – XIV centuries. The key technique used is cross-evaluation with more traditional dialectometry methods, where it is possible. Python implementation of the methods given is available as a Python package.