Research Projects

Spoken corpora of nonstandard varieties of Russian and other languages

The laboratory creates spoken corpora of dialects and regional varieties of Russian and other languages. The corpora contain audio files and a transcription in standardized orthography. Using the search function, you can listen to fragments of texts that contain a word or collocation of interest. All corpora are publicly accessible and for many of them, full texts are available.

Daghestanian Stops

The aim of the project is to describe the variation in the acoustic features of stops in East Caucasian languages. It is probable that the acoustic features of the sounds that fill ‘identical’ slots in the phonetic inventories of East Caucasian languages (such as ejectives in Archi vs. ejectives in Lak) are slightly but consistently different. The immediate goal is to prove the presence of such differences in a statistically significant way. The ultimate goal, ideally, is to show that the differences are areally distributed (in a macro perspective, e.g. South Daghestan vs. North Daghestan, and in a local perspective, e.g. showing influence of neighbouring languages on different lects of the same language). Accounting for acoustic differences and similarities in areal terms is, as far as we know, a truly innovative research challenge. This project includes annotation of the recorded data, acoustic analysis and collecting more data during future fieldwork.

Participants: Sven Grawunder, George Moroz, Michael Daniel, Vasilisa Zhigulskaya

Daghestanian Loans

The Daghestanian Loans project studies the lexical influence of different languages in Daghestan on a microlevel, i.e. on the level of granularity that is sensitive to the difference between village varieties. Data from the project on multilingualism in Daghestan show that the conditions and the degree of language contact for each village are unique. Our aim is to discover the lexical correlates of these differences. For this purpose, we compiled a shortlist of 160 concepts for cross-linguistic comparison, and developed a method for quick data collection in the field. Using a fixed list of concepts for comparison allows us to find the quantitative correlates of qualitative differences between areas, such as the spread of a certain lingua franca, the presence and degree of contact with particular languages, as well as migratory processes.

Collecting data in neighboring villages allows us to show variation between villages on the map, and it reveals the contours of various zones of influence for specific L2s. For example, lexical influence of local Turkic languages (Azerbaijani, Kumyk and Nogai) is found throughout Daghestan. In the south, however, where Azerbaijani served as lingua franca for a long time, this influence is much stronger. In the north of Daghestan bilingualism with Turkic languages was not common, and almost all Turkic borrowings in minor local languages are shared with Avar, a major native language. Turkic influence in the north was thus most likely mediated by Avar. Our first paper (currently in the final stages of preparation) details how we can detect different zones by comparing lexical samples from villages and major neighboring languages.

At the moment our database contains translations of the shortlist in 14 different languages as spoken in 30 different villages in Daghestan and five villages in the Qax region of Azerbaijan. These 35 villages are distributed over five distinct geographical and linguistic areas.The data are available in an online database.

Participants: Michael Daniel, Ilya Chechuro, Samira Verhees, Nina Dobrushina

Ustja Corpus

The Ustja River Basin Corpus is a growing corpus of a northern Russian dialect (south of Arkhangelskaja oblastj) where the normalized orthographic annotation is aligned with the audio of the interviews. The research based on this corpus is aimed at establishing the dynamics of dialect loss - correlation between dialect variables, consistency of speakers, age outliers (people who are ahead or behind their age peers) etc. It involves a vast amount of perceptive and sometimes instrumental acoustic data annotation. After the first publication (see below) we plan to study how the use of dialect correlates with gender within the same age group.

Participants: Ruprecht von Waldenfels, Nina Dobrushina, Michael Daniel

Daniel M., P. Kazakova, A. Ter-Avanesova et al. Dialect loss in the Russian North: modelling change across variables. Accepted for publication in “Language variation and change”.

Daghestanian Multilingualism

The research aims to capture the social and geographical specifics of multilingualism in Daghestan with statistical methods. We studied the distribution of multilingualism among men and women, and we wrote a paper on the hypothesis that the introduction of compulsory school education was instrumental in the spread of Russian as an L2. We are currently working on the statistical validation of data we acquired through indirect interviews, where people described the language repertoire of their deceased relatives.

Participants: Nina Dobrushina, Michael DanielGeorge Moroz

Dobrushina N., Kozhukhar A. A.,Moroz G. Gendered multilingualism in highland Daghestan: story of a loss // Journal of Multilingual and Multicultural Development . 2019. Vol. 40. No. 2. P. 115-132.
Dobrushina N.,Daniel M. Field linguistics in Daghestan: A very personal account, in: Word hunters Vol. 194. John Benjamins Publishing Company, 2018.doi P. 79-94.

Dialectal Differentiation of Even

Even is a Northern Tungusic language spoken in a number of small communities scattered across northeast Siberia. This dispersed mode of settlement has led to considerable dialectal fragmentation with diversification at the lexical, phonological, morphological, and syntactic level. This diversification can be assumed to be the result of multiple factors: differential retention of ancestral variation, independent innovation, as well as contact with typologically different languages. We want to elucidate the relative impact of these different factors during the differentiation of the dialects, and especially, to what extent language contact played a role. That there would have been some contact in the history of the dialects is indicated by molecular genetic data showing intermarriage of different Even groups with their neighbors. This study focuses on two of the geographically most disparate Even dialects: the westernmost still viable Even dialect, Lamunkhin, spoken in the village of Sebjan-Küöl in Yakutia, and one of the easternmost dialects, namely the Bystraja dialect spoken in Central Kamchatka. Oral corpora for both dialects has been already glossed: with the Lamunkhin corpus comprising around 52,000 words and the Bystraja corpus comprising around 34,000 words. An important prerequisite for answering the question of how these dialects diverged is to establish in what way they differ.

The study of dialectal differences usually entails categorical differences, i.e. the presence of a feature in one dialect which is absent in another. Often however features differ in frequency, having become less prominent in one dialect in the course of its diachronic evolution, or because a form has developed new functions. When working with smaller corpora, the study of dialectal differences through variation in frequency is problematic. An observed difference in frequency might be result of a speaker’s preference, rather than a feature of the dialect as a whole. In our first publication we proposed a statistical method that allows us to trace differences in frequency while taking into account the ideolectal heterogeneity of the corpora. Further, we plan to elaborate on statistical models that we use as well as to continue with the linguistic interpretation of the differences we find from the point of view of functional divergence, contact situations and the typology of grammaticalization processes.

Participants: Brigitte Pakendorf, Vasilisa AndriyanetsMichael Daniel

Andriyanets V., Daniel M., Pakendorf B.Discovering dialectal differences based on oral corpora, in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 30 мая — 2 июня 2018 г.) / Под общ. ред.: В. Селегей, И. М. Кобозева, Т. Е. Янко, И. Богуславский, Л. Л. Иомдин,М. А. Кронгауз,А. Ч. Пиперски. Вып. 17(24). М. : Издательский центр «Российский государственный гуманитарный университет», 2018. P. 28-38.

Non-standard Word Order in Daghestanian Russian

The aim of this project is to investigate non-standard word-order realizations in Daghestanian Russian. At the current stage, the focus of the research is constituted by noun phrases with a genitive modifier. Whereas in Standard Russian the neutral word order in such phrases is noun + genitive, in Daghestanian Russian the opposite word order is often employed. Our hypothesis is that non-standard word order in such constructions is the result of contact with the speakers’ first languages (East Caucasian and Turkic), which show the order genitive + noun in such phrases. The alternative hypothesis would be that the order genitive + noun is rather a general feature of spoken Russian discourse, in which constructions of this type are also admissible. To verify our hypotheses, we conduct an analysis of noun phrases with genitive modifiers based on the corpus of spoken Daghestanian Russian and on the spoken subcorpus of the Russian National Corpus. The results of this study will contribute to a description of the syntactic properties of the variety of Russian spoken in Daghestan.

Participants: Chiara Naccarato, Natalia Stoynova, Anastasia Panova

Syntactic Annotation of the Corpora in the Universal Dependencies Format

The unified representation of dependency trees makes it possible to examine the syntactic parallelism across languages and word order effects. In the long run, it would be particularly interesting to apply quantitative methods in order to study the effects of language convergence. At the moment, the collection of existing UD treebanks covers ca. 40-50 languages including Russian, Bielorussian and Buryat (under Creative Commons license). From the point of view of the UD treebanks, the main contribution will be the development of the UD guidelines for ergative and polysynthetic languages based on the manual annotation of the corpora available in the Lab. The UD community provides tools for data annotation, validation, and visualization, as well as a number of online search engines. In this project, we are planning to work with the following language varieties: Mehweb, Adyghe, Even, Mari, spoken Russian and the spoken Russian of Daghestan.

Participants: Olga Lyashevskaya

Droganova, Kira, and Olga Lyashevskaya. Cross-tagset parsing evaluation for Russian. In: Digital Transformation and Global Society Third International Conference, DTGS 2018, St. Petersburg, Russia, May 30 –June 2, 2018, Revised Selected Papers, Part I / Ed. by Daniel A. Alexandrov, A. V. Boukhanovsky, A. V. Chugunov, Y. Kabanov, O. Koltsova. Issue 858. Cham : Springer, 2018. doi Ch. 31. P. 380-390.PDF
Droganova, Kira, Olga Lyashevskaya, and Daniel Zeman. Data Conversion and Consistency of Monolingual Corpora: Russian UD Treebanks. In: Proceedings of TLT 2018 International Workshop on Treebanks and Linguistic Theories, 13-14 November 2018, Oslo, Norway. NEALT Proceedings Series. Linköping University Electronic Press, 2018. P. 52-65.PDF

Intonation in regional varieties of Russian

The aim of this project is to document, annotate and perform quantitative analyses of intonational patterns in spontaneous speech in regional varieties of Russian. There is currently no comprehensive description available regarding the intonational patterns found in nonstandard varieties of Russian, and the mutual influence of regional varieties and contact languages in this respect is virtually unresearched. In addition, the data for the project are not tidy samples obtained in a laboratory setting, but field recordings. This ensures that the observed patterns are representative of real language use, but it also imposes some additional demands on the processing of the material before any analysis is carried out. Based on these data, we propose to develop multifactorial models of pitch movement and other characteristics of intonation, depending on the communicative type of the intonational construction, the gender, age and location of the speaker, as well as their individual peculiarities.

Participants: Olga Lyashevskaya, Ilya Chechuro

Circassian Isoglosses

The two Circassian languages of the Northwest Caucasian language family (West Circassian, also known as Adyghe, and Kabardian, also known as East Circassian) are considered to be one language by their speakers. However, this assumed linguistic continuum shows a lot of variation. The aim of the Circassian Isoglosses project is to survey various features and their distribution among regional varieties of Circassian, based on existing literature and fieldwork. The prospective result of the project will be a database of isoglosses that will allow us to compare Circassian idioms. At the present stage, the project focuses on varieties of West Circassian as spoken in the Republic of Adygea and the Krasnodar Kray. In addition, we carried out fieldwork with Israeli Circassians in the fall of 2017.

Participants: Yury Lander, George Moroz, Aleksei Fedorenko

Meadow Mari Corpus

Meadow Mari is a Uralic language spoken by about 375 thousand people. The aim of the project is creating a corpus of spoken Meadow Mari. The basis of the corpus will be the audio- and video-recordings made in 2000-2001 by a fieldwork party of the Moscow State University. Tasks of participants include technical support of the corpus (including glossing, annotating and aligning orthographic annotation with the audio) as well as data analysis. The focus of the project is studying the influence of Russian on Meadow Mari.

Participants: Anna Volkova, Mikhail Voronov

Relativization in Nakh-Daghestanian in Intragenetic and Areal Perspective

In Nakh-Daghestanian languages relative clauses are predominantly formed with a participle construction. Even though they can express different aspectual meanings, participles lack any syntactic orientation. There are no syntactic limitations on the target of relativization. The gap in the relative clause can correspond to a core argument, a peripheral participant or even a participant that is not part of the verb’s argument structure. The relativization of facts, places and time is also frequent. A pilot study on relativization targets in several Daghestanian languages revealed that preferences for the relativization of certain arguments differ. It is not apriori clear whether this is due to the counting method used, the particularities of certain corpora, or the grammar of specific languages. Within the project, relativization will be studied on the basis of more substantial corpus data, using a unified markup for relative clauses. Several Nakh-Daghestanian languages will be researched (Agul, Archi, Ingush, Udi and others), as well as other Caucasian languages, which are typologically and/or genetically far removed from Nakh-Daghestanian languages (e.g. Adyghe). The resulting generalizations will allow us to verify claims on the hierarchy of arguments in relativization as they are proposed in current syntactic theories.

Participants: Anna Volkova, Michael Daniel, Yury Lander, Timur Maisak, Johanna Nichols

Nominal inflection typology

Verb inflection is one of the most useful parameters in the Autotyp database from a typological and geographical perspective. The goal of our project is to create a database similar to Autotyp for nouns. In 2017-2018 we created a database, and carried out a pilot investigation in Eurasia. This year we will add languages from other parts of the world to our database, and measure correlations between the complexity of verbal and nominal inflection systems.

Participants: Elena Sokur, Johanna Nichols

Typological atlas of Daghestan

Languages of Daghestan have a long descriptive tradition. Available grammars contain a wealth of data which however have not been analyzed form an areal point of view. The goal of this project is to develop a tool for the visualization of information about linguistic structures characteristic of Daghestan. The atlas is based almost exclusively on data from published grammars, and can therefore be used for bibliographical research and as a source of references on parameters of interest. A key task of the project is the creation of maps and visualizations that allow us to combine metadata and genealogical parameters with information on a particular feature. This task presupposes the inventory and evaluation of available sources on grammatical information. Data from the Atlas can be used to formulate hypotheses about the area and scenarios for the distribution of certain phenomena. The Atlas will also allow a wider audience to become familiar with the linguistic diversity of Daghestan.

Participants: Konstantion Filatov, Michael Daniel, George Moroz


