The laboratory organized a grant competition for creating spoken language corpora
From May 30th to June 25th the International linguistic convergence laboratory organized a grant competition for creating spoken language corpora, specifically aimed at regional varieties of Russian in areas of intenstive contact with other languages, dialects of Russian and other languages spoken in Russia and former republics of the Soviet Union. The organizer of the event and the head of the laboratory Nina Dobrushina talks about the goals of the event and why recording spoken language is important.
What are spoken language corpora used for? Why is it important that a corpus contains spoken language?
Many objectives in modern science demand the automatic processing of large quantities of texts. This is relatively easy to achieve with written texts – you can just enter them online (although there are some challenges and limitations here as well). With spoken texts this is a lot more difficult: first of all, linguists need high-quality recordings made with a decent voice recorder. Second, the texts have to be transcribed, a task that is still hard to do automatically. There are, of course, programmes that can recognize sound and translate spoken language to written text, but even for main, literary languages they have a large error ratio. With non-standard varieties, dialects and languages with “accent” they do not manage at all. The following is a word-for-word translation of an attempt to transcribe a recording of a woman born in 1922 speaking a northern Russian dialect:
"теперь в лондон только надо уроки до пруда для здоровья только путину другой мир был бы мир если только мир мир где все тут гибель аль амир все хорошо будет она сильная еще хотел спросить вот о как раньше сорван италии чечен как частными лис был велик был великий его но вы то я так видит его белье погоди а как частными или утром или вечером выметали а избили как воздуха а говорят что то хотела вечером плохо а мы привыкли к нам так нагло теперь в твоем магазине editor редактор газеты принесшей меня найдет я будок а выкидывать худородным идешь куда выкидывали мусор детскому другом а зачем мешок вот моцарта"
For now these texts have to be processed manually. It takes a highly qualified researcher to transcribe them – someone who is familiar with the dialect and aware of the peculiarities of its lexicon, phonetics and morphology. In addition, the transcription should be carried out with a special programme that connects the sound with the text. With this programme the researcher has to manually cut out fragments of sound along with the corresponding transcription in order to create a corpus that can return text as well as sound in response to search queries, which is important for many research objectives. Because manual transcription is a very painstaking task, there are few good spoken text corpora available. Our laboratory aims to increase the number of such corpora available while simultaneously working on research connected to spoken corpora.
Are there any language varieties that are currently of more interest?
The main scientific interest of our laboratory is language contact. Therefore, we are mostly interested in dialects located on the border of linguistic areas, which experience influence from other languages and dialects. Russian dialects are heavily influenced by the literary language, as are all of the languages spoken in Russia. We are also interested in features of local languages that are reflected in the Russian of those who live in these areas. Soon we will launch a corpus of the Russian spoken in Daghestan and start researching the peculiarities of this variety.
Are there any such corpora currently available, that are created and used in the projects of the laboratory?
We have one example of such a corpus: the Ustja River Basin Corpus corpus, which was created at the School of Linguistics with the active participation of students. At present this is the largest corpus featuring a Russian dialect. It is available online and it allows for search queries that return both text and sound. Several articles have been written based on the data from this corpus, and one article has recently been submitted to a very good journal.
Did the competition spark a lot of interest? How many applications did you receive?
The first competition was only a small try-out. We did not advertise it, because it was not our aim to receive a large amount of applications. We wanted to see whether there was an interest in projects like these and evaluate the quality of applications we would receive. We will not reveal the results prematurely, but we can say that the competition has taken place and we received some very interesting applications.