Automatic language profiling of a dialect speaker: the case of the Timok variety spoken in the village of Berčinovac (Eastern Serbia)

DOI:10.30842/alp2306573716207

Makarova A. L., Koner D. V., Vukovich T., Sobolev A. N., Vinistorfer O. Avtomaticheskiy metod yazykovogo profilirovaniya nositelya dialekta (na materiale vostochnoserbskogo idioma sela Berchinovats). Acta Linguistica Petropolitana. 2020. XVI(2): 160–180.

In a previously published paper [Konior et al. 2019], which thematically led up to the present article, we explored the possibility of developing a quantitative tool for assessing the intrasystemic dialectal coherence and the degree of dialectal authenticity (preservation) for a particular variety of Slavic (and more broadly Balkan) dialectal speech. In order to do so, we analysed and manually counted all cases of presence or absence of specific phonemes, direct and indirect object reduplication, ways of expressing peripheral cases meaning, presence of a postpositive article, and some other language features. The data used for that purpose was extracted from “Linguistic Atlas of Eastern Serbia and Western Bulgaria” [SAOSWB]; an idiolect of a native speaker of the Timok dialect spoken in the village of Berčinovac (near the town of Knjaževac in the Zaječar district, Eastern Serbia) was chosen for analysis. Subsequently, the following question arose: how can the use of modern technologies for automatic text processing increase the efficiency of dialectologists’ work, and what technical obstacles must be overcome in this regard? In the article, we present a method of (semi-)automatic analysis of phonetic and morphosyntactic features in a dialect text with the use of morphological annotation (the tagger model is based on the ReLDI tagger [Ljubešić et al. 2016] and user Python scripts). An algorithm searching for some important dialect features is described and exemplified. Trying to imitate and automate historical and structural linguistic analysis, we open a discussion about the advantages and disadvantages of computer analysis of dialect data as compared with the manual analysis. In the future, the automatic method is expected to be helpful in managing larger amounts of dialect data.

Keywords

statistical methods in linguistics, machine text analysis, linguistic profiling, dialect speakers, Balkan Slavic languages, Serbian dialects, Timok dialect, idiolect of dialect speaker, village of Berčinovac, Eastern Serbia

About the authors

Makarova A. L.

anastasia.makarova@uzh.ch

Konior D. V.

dsuetina@yandex.ru

Vuković T.

teodora.vukovic2@uzh.ch

Sobolev A. N.

sobolev@staff.uni-marburg.de

Winistörfer O.

olivier-andreas.winistoerfer@uzh.ch

References

Birkner 2015

V. Birkner. The advantages and disadvantages of employing corpus evidence in sociolinguistic studies. The Teacher Magazine. 2015. Vol. 2. P. 11–17.

Dash 2012

N. S. Dash. Etymological Annotation: a New Concept of Corpus Annotation. Proceedings of the 34th All India Conference of Linguists (34-AICL). Shillong, India, 2012. P. 100–104.

Dash, Arulmozi 2018

N. S. Dash, S. Arulmozi. Limitations of language corpora. N. Dash, S. Arulmozi. History, features, and typology of language corpora. Singapore: Springer Singapore, 2018. P. 259–272.

Dash, Hussain 2013

N. S. Dash, M. M. Hussain. Designing a Generic Scheme for Etymological Annotation: a New Type of Language Corpora Annotation. P. Bhattacharayya, K.-S. Choi (eds.). Proceedings of the 11th Workshop on Asian Language Resources. Nagoya: Asian Federation of Natural Language Processing, 2013. P. 64–71.

Deemter, Kibble 1999

K. van Deemter, R. Kibble. What is coreference, and what should coreference annotation be? A. Bagga, B. Baldwin, S. Shelton (eds.). Proceedings of the Workshop on Coreference and Its Applications. Stroudsburg, PA: Association for Computational Linguistics, 1999. P. 90–96.

Erjavec et al. 2003

T. Erjavec, C. Krstev, V. Petkevic, K. Simov, M. Tadic, D. Vitas. The MULTEXT-east morphosyntactic specifications for Slavic languages. T. Erjavec, D. Vitas (eds.). Proceedings of the Workshop on Morphological Processing of Slavic Languages, EACL 2003. Stroudsburg, PA: Association for Computational Linguistics, 2003. P. 25–32.

Escher 2021

A. L. Escher. Double argument marking in Timok dialect texts (in Balkan Slavic context). Zeitschrift für Slawistik. Forthcoming.

Goedertier et al. 2000

W. Goedertier, S. Goddijn, J.-P. Martens. Orthographic transcription of the spoken Dutch corpus. M. Gavrilidou, G. Carayannis, S. Markantonatou, S. Piperidis, G. Stainhouer (eds.). Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece. Athens: National Technical University of Athens Press, 2000. P. 909–914.

Konior et al. 2019

D. V. Konior, A. L. Makarova, A. N. Sobolev. Statisticheskiy metod yazykovogo profilirovaniya nositelya dialekta (na materiale vostochnoserbskogo idioma sela Berchinovats) [Quantitative method of language profiling of a dialect speaker (based on the material of the East Serbian idiom of the village of Bercinovac)]. Tomsk State University Journal of Philology. 2019. No. 58. P. 17–33.

Ljubešić et al. 2016

N. Ljubešić, F. Klubička, Ž. Agić, I.-P. Jazbec. New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian. N. Calzolari, Kh. Choukri, Th. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (eds.). Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Paris : European Language Resources Association, 2016. P. 4264–4270.

Sikimić, Sobolev 2020

B. Sikimić, A. N. Sobolev. Processy divergentcii v razdelennom gosudarstvennoy granitcey zapadnoyuzhnoslavyanskom dialekte (na materiale sovremennoy dialektnoy rechi Vostochnoy Serbii i Zapadnoy Bolgarii) [Divergence Processes in the West South Slavic Dialect Divided by the State Border (Based on the Modern Dialect Speech of Eastern Serbia and Western Bulgaria)]. Tomsk State University Journal of Philology. 2020. No. 66. P. 158–176. DOI: 10.17223/ 19986645/66/9.

Sobolev 1998

A. N. Sobolev. O dialektologicheskom atlase Vostochnoy Serbii i Zapadnoy Bolgarii [On the dialectological atlas of Eastern Serbia and Western Bulgaria]. G. P. Klepikova (ed.). Issledovaniya po slavyanskoy dialektologii [Studies in Slavic Dialectology]. Iss. 5. Moscow: Institute of Slavic Studies RAS, 1998. P. 106–167.

Vuković et al. 2019

T. Vuković, N. Muheim, O. Winistörfer, I. Simko, A. Makarova, S. Bradjan. Corpora and Processing Tools for Non-Standard Contemporary and Diachronic Balkan Slavic. I. Temnikova, I. Nikolova, N. Konstantinova (eds.). Proceedings of the Student Research Workshop associated with The 12th International Conference on Recent Advances in Natural Language Processing (RANLP 2019). Shoumen: Incoma, 2019. P. 62–68.

Vuković et al. 2020

T. Vuković, B. Sonnenhauser, A. Escher. Degrees of non-standardness. Feature-based analysis of variation in a Torlak dialect corpus. Manuscript.

Keywords