The poetic texts pose a challenge to full morphological tagging and lemmatization since the authors seek to extend the vocabulary, employ morphologically and semantically deficient forms, go beyond standard syntactic templates, use non-projective constructions and non-standard word order, among other techniques of the creative language game. In this paper we evaluate a number of probabilistic taggers based on decision trees, CRF and neural network algorithms as well as a state-of-the-art dictionary-based tagger. The taggers were trained on prosaic texts and tested on three poetic samples of different complexity. Firstly, we suggest a method to compile the gold standard datasets for the Russian poetry. Secondly, we focus on the taggers’ performance in the identification of the part of speech tags and lemmas. We reveal what kind of POS classes, paradigm classes and syntactic patterns mostly affect the quality of processing.
The paper compares two rival word-formation constructions giving rise to compound agent nouns in Russian, i.e., (para)synthetic compounds formed with the agentive suffixes -ec and -tel’, such as basnopisec ‘fable writer’ and bytopisatel’ ‘everyday-life writer’. To understand what makes these constructions different from one another, compounds in -ec and -tel’ are analyzed based on a number of formal and semantic criteria, i.e., the part of speech and semantic role of the non-verbal element of the compound, the transitivity and formal aspect of the verbal base of the compound, the animacy of the compound’s referent, and the semantics of the compound. The study is supported by statistical analyses, i.e., conditional inference trees and random forests, which help discriminate the behavior of rival constructions and determine which parameters are more relevant for the comparison. To understand whether diachronic and/or stylistic factors also affect the survival of rival constructions, the data are checked in the Russian National Corpus, which allows retrieving information about the texts in which compounds occur, such as their creation date and textual genre. Finally, the productivity of rival word-formation constructions in modern Russian is discussed both in terms of diachronic changes and in terms of restrictions that the two constructions are subject to. The analyses carried out demonstrate that the two constructions show significant differences regarding their semantics, but also their diachronic and stylistic distribution, as well as their productivity, which prevents one construction from completely ousting the other in modern Russian.
This paper discusses novel facts regarding adpositional agreement in Avar in light of recent theories of feature valuation. I show that the traditional notion of downward Agree/upward valuation is sufficient to account for the observed facts, rendering the competing mechanism of upward Agree/downward valuation superfluous.
The paper discusses the standardization efforts to create a morphological standard for the Middle Russian corpus, which is part of the historical collection of the Russian National Corpus (RNC). To meet the needs of different categories of corpus researchers as well as NLP developers, we consider two styles of the morphological annotation (RNC schema and Universal Dependencies schema). A number of specifications of the feature list proposed to facilitate data reusability, linking and conversion.
The paper reports a method to create a speaker’s prosodic fingerprint based on the global characteristics of the pitch movement. Prosodic fingerprint is the distribution of f0 in the low, middle, and high ranges and the distribution of pitch movements from one range into other [Šimko et al. 2017]. This fully automated method can be used to classify the records and to provide the reference level for more sophisticated analysis of the pitch movement and intonation strategies. We evaluate the method by applying it to the spontaneous Russian spoken data recorded in different regions. We model the correlation between the fingerprint and sociolinguistic features such as age, gender, and region. The results of this analysis allow to formulate several sociolinguistic hypotheses that can further be tested with a more detailed analytic technique.
This paper surveys relative clause constructions in West Circassian (Adyghe) and Kabardian.
The paper provides linguistic explanations to the results of the supervised machine learning experiments for identification of verbal metaphor in Russian texts. We look at the classification accuracy of models based on different features (distributional semantics, and lexical and morphosyntactic co-occurrence, etc.) and explore the behavior of verb constructions and wider context in order to investigate the reasons behind the most and the least successful performances.
This paper presents a description of evidentiality marking in the Rikvani dialect of Andi. As a language spoken in the Caucasus, Andi is situated in the centre of a large area within Eurasia where evidentiality is frequently expressed with a perfect or resultative form of the verb (general indirective), and special particles marking hearsay (and sometimes also inference). Both are attested in Andi and form independent evidential paradigms. I will explore the way these forms are used in natural texts and elicitation and how they interact with each other. An important issue is to what extent evidentiality can be considered grammaticalized as part of the verbal paradigm in Andi. I will compare my observations on Andi to the systems found in other East Caucasian languages.
The paper traces the level of bilingualism in several highland villages of Daghestan (Northeast Caucasus) through the 20th century. We show that historically, men were more multilingual than women, but this was not true to the same extent for all languages. Highlanders’ repertoires suggest a correlation between the social function of the second language and the degree to which its command was gendered. We also explore the dynamics of multilingualism from the generation born at the end of the 19th century to the generation born in the 1990s. We show that during the 20th century local L2s were gradually displaced by Russian, and Daghestanian multilingualism lost its gendered character. We argue that these changes were caused by the introduction of Soviet schooling.
This paper describes the semantic and morphosyntactic properties of general converb constructions in Andi, a language of the Avar-Andic group of the East Caucasian language family. There are two general converbs in Andi, both of which are homophonous with a finite verb form (the aorist and the perfect, respectively). Each converb has a particular contextual meaning (manner and cause for the perfect converb, and means in the case of the aorist converb), while both can be used interchangeably to indicate the first stage of a complex event. The two constructions seem to be diachronically related, the aorist converbial construction being secondary and morphosyntactically more constrained. The aim of this paper is to describe and compare these two partially competing constructions in view of how similar forms are used in closely related languages.
The paper focuses on a two aspectual morphemes in Moksha Mordvin (< Mordvin < Finno-Ugric). The first of them, the Frequentative, has four phonologically conditioned allomorphs, -ənd-, -n’ə-, -s’ə-, and -kšn’ə-. These affixes used to be sepa-rate morphemes in Proto-Finno-Ugric, but ended up as having the same meaning and being complementarily distributed. A remnant of a more archaic stage of lan-guage evolution is the Avertive marker, -əkšn’ə-, only different from one of the Fre-quentative allomorphs by one phoneme, which can hardly be a coincidence. A dia-chronic hypothesis about how iterative-avertive polyfunctionality could have arisen is suggested.
This paper describes the distribution of colour adjectives in Russian poetry of the Silver Age and defines individual preferences with regard to poetic tradition, syllable structure, and metrical restrictions. The research method combines a lexico-semantic approach, formal literary analysis, and quantitative metrics obtained via the frequency database of the Russian Poetry Corpus (over 10 M words, incl. 1 M adjectives). The database allows the user to compare subcorpora and create graphs of timeline distribution, which demonstrate that the lexical diversity and relative frequencies of colour adjectives start to grow rapidly in the 1890s, as modernists employ colour adjectives to upgrade the poetic inventory. The adjectives referring to non-banal hues (e.g. fioletovyj ‘violet’, lazorevyj ‘azur’) belong to the middle part of the ranked wordlist. Correspondence analysis of the data reveals individual colour preferences and stylistic similarities among the most prominent poets of the Silver Age; for example, Anna Akhmatova and Alexander Blok are similar regarding their use of the white hues. The distribution of the selected colour hue adjectives across metrical types highlights the strong association of multi-syllabic adjectives with certain meters, although some words have a more complex distribution.
This paper describes the range of patterns used for the expression of ‘other’ in East Caucasian (Nakh-Daghestanian) languages, an indigenous language family of the Eastern Caucasus mainly spoken in the Republics of Daghestan, Chechnya and Ingushetia (Russian Federation), as well as in northern regions of Azerbaijan and eastern parts of Georgia.
This paper describes the repetitive prefix in Agul (Lezgic, East Caucasian), focusing on the grammaticalization path of this morpheme. The main question to be addressed is the hypothesis that the prefix has been copied from the closely related Lezgian language.
In this paper, we discuss the most recent trends in the study of space and time. We consider four volumes [Filipović and Jaszczolt 2012], [Vulchanova and van der Zee 2013], [Moore 2014], and [Luraghi et al. 2017] that cover an relatively broad set of topics and approaches. The main topics the authors focus on are: language-specific systems of space and time conceptualization, cultural differences in understanding time, space and time (dis)analogy, granularity, frame of reference, verbs of motion, and Source vs. Goal asymmetry. The methods that the contributors apply are versatile ranging from formal and experimental to anthropological participant observation, and lexical typology. Many of the papers collected in these volumes deal with similar problems applying different frameworks to them, which makes it possible to compare how different approaches handle similar problems and thus reveal how they may be combined. This reflects one of the strongest trends in modern linguistics, namely the tendency to conduct interdisciplinary studies that allow to simultaneously view the same data from different angles.
This chapter presents an overview of the Northwest Caucasian (West Caucasian, Abkhaz-Adyghe) family.
In polysynthetic West Caucasian languages, the morphological verbal complex amounts to a clause, with all kinds of participants cross-referenced by affixes. Relativization is performed by introducing a relative affix in the cross-reference slot which corresponds to the relativized participant. However, these languages display several cross-linguistically rare features of relativization. Firstly, while under the view of the verbal complex as a clause this affix appears to be a relative pronoun, it is an unusual relative pronoun because it remains in situ. Secondly, relative affixes may appear several times in the same clause. Thirdly, relative pronouns are not expected to occur in languages with prenominal relative clauses. Fourthly, in the Circassian branch, relative pronouns are identical to reflexive pronouns. These features are explained by considering relative prefixes to be resumptive pronouns. This interpretation finds a parallel in the neighboring East Caucasian languages, where reflexive pronouns also show resumptive usages. Finally, since in some West Caucasian languages the relative affix is a morpheme with a dedicated relative function but still shows properties of a resumptive pronoun, our data suggest that the distinction between relative pronouns and resumptive pronouns may not be so clear as is usually assumed.
The article deals with different aspects of language interaction in agroup of neighboring languages in the Akhvakh district of Daghestan, in particular Karata, Tukita, Tad-Magitl’ and Tlibisho (this zone later referred to as Karata cluster). The villages of the Karata cluster are all located within a short walk-ing distanceof 30–120 min from each other, in all four villages different languages are spoken: Karata, Tukita, Akhvakh and Bagvalal respectively.Qualitative and quantitative data was collected during a fieldtrip in March 2018 as part of a long-term project focussing on neighbor multilingualism in highland Daghestan. The research employed the method of retrospective family interviews. Respondents were interviewed about their language reper-toire and the repertoire of their close relatives that they remembered, which enabled the researchers to conclude which languages were used in the interaction between neighboring villages before the Russi-fication and which languages are used today.We found out that interaction between neighboring villages employed and still employs Avar, that is, the lingua franca model is the common strategy in the Karata cluster. Today more than 90% of the popu-lation of the four villages concerned have command of Avar, which is different from many other areas of highland Daghestan. In other parts of Daghestan the most common model for neighbor interaction was the use of a language of one of the neighbors (asymmetrical bilingualism). Symmetrical bilingualism (when both sides have command of each other’s languages) and lingua franca were less common.Whereas the level of Avar language is high, the level of active multilingualism in the languages of Karata cluster remains low. Passive knowledge of the neighboring languages is more wide-spread. We also found out that passive knowledge is asymmetrical forseveral reasons, which are discussed in the article. A suggestion is put forward that the level of understanding of neighboring languages is not only dependent on the genetic affinity of the languages but also on the direction of socio-economic contact.Similar to other regions of Daghestan, the command of Russian has grown in Karata, however, unlike in many other places, Avar as a lingua franca has not yet been displaced by Russian.
This paper presents an overview of Russian and foreign existing approaches that have been practiced in relation to the compilation of lexical minima. Special attention is paid to the most influential English-speaking tradition, as well as the German-speaking tradition. The purpose of the review is to follow the development of lexical list science and also to define the criteria list compilers should be oriented in order to compose the best lexical minima for the modern user. The first chapter of the article discusses Russian approaches to the lexical minima compilation, the second chapter discusses the approaches used abroad, the third chapter compares domestic and foreign traditions and summarizes the review. The review given in the article gives grounds for the conclusion that the creation of LM requires a combination of both statistical and communicatively oriented methods. In addition, to compile an up-to-date and reliable corpus it is necessary to have an equal proportion parts of the data analyzed: in addition to the fiction texts corpus, authors should refer to the oral corpus data, as well as sources diverse in style and genre, such as newspaper, art and academic corpses, and internet speech corpus.