George Moroz (HSE), Olga Gich (FEFU), Anna Grishanova (HSE), Natalia Koshelyuk (HSE), Chiara Naccarato (HSE), Anna Panova (HSE), Anastasia Yakovleva (HSE), Svetlana Zemicheva (HSE) The DiaL2 project: pipeline, results, news and future work
There are 24 dialectal and 8 bilingual corpora of Russian at the Linguistic Convergence Laboratory (see the resources page), and more are coming. The DiaL2 project was launched two years ago with an aim to study the linguistic variation found in these corpora. We applied a UDpipe morphological and syntactic parser, manually annotated a set of linguistic features (sometimes relistening the recordings in order to check the transcriptions), and implemented statistical models for each feature that predict the probability of divergence from Standard Russian. During the talk we will discuss our results based on several features:
- non-standard marking in numeral constructions (dva dom [two.M house.SG] ‘two houses’);
- preposition drop (rodilas' [v] tridcat' devjatom godu ‘(she) was born (in) nineteen thirty-nine’);
- non-standard marking in negative existential constructions (ranše sadiki ne byli ‘there were no kindergartens before.’).
As possible predictors in the models, we used sociolinguistic features (gender, year of birth, years of education), measures of collocationality, and some relevant linguistic features. During the work we discovered multiple typos, inconsistent and wrong transcriptions, and corrected a lot of them. Therefore, we started a parallel project dedicated to automatic correction of the Lab’s corpora, which will also be discussed during the talk.