Automatic language profiling of a dialect
speaker: the case of the Timok variety spoken in the village of
Berčinovac (Eastern Serbia)
DOI:10.30842/alp2306573716207
Makarova A. L., Koner D. V., Vukovich T.,
Sobolev A. N., Vinistorfer O. Avtomaticheskiy metod yazykovogo
profilirovaniya nositelya dialekta (na materiale vostochnoserbskogo
idioma sela Berchinovats). Acta Linguistica Petropolitana.
2020. XVI(2): 160–180.
In a previously published paper [Konior et al. 2019], which
thematically led up to the present article, we explored the
possibility of developing a quantitative tool for assessing the
intrasystemic dialectal coherence and the degree of dialectal
authenticity (preservation) for a particular variety of Slavic (and
more broadly Balkan) dialectal speech. In order to do so, we
analysed and manually counted all cases of presence or absence of
specific phonemes, direct and indirect object reduplication, ways
of expressing peripheral cases meaning, presence of a postpositive
article, and some other language features. The data used for that
purpose was extracted from “Linguistic Atlas of Eastern Serbia and
Western Bulgaria” [SAOSWB]; an idiolect of a native speaker of the
Timok dialect spoken in the village of Berčinovac (near the town of
Knjaževac in the Zaječar district, Eastern Serbia) was chosen for
analysis. Subsequently, the following question arose: how can the
use of modern technologies for automatic text processing increase
the efficiency of dialectologists’ work, and what technical
obstacles must be overcome in this regard? In the article, we
present a method of (semi-)automatic analysis of phonetic and
morphosyntactic features in a dialect text with the use of
morphological annotation (the tagger model is based on the ReLDI
tagger [Ljubešić et al. 2016] and user Python scripts). An
algorithm searching for some important dialect features is
described and exemplified. Trying to imitate and automate
historical and structural linguistic analysis, we open a discussion
about the advantages and disadvantages of computer analysis of
dialect data as compared with the manual analysis. In the future,
the automatic method is expected to be helpful in managing larger
amounts of dialect data.
Keywords
statistical methods in linguistics, machine
text analysis, linguistic profiling, dialect speakers, Balkan
Slavic languages, Serbian dialects, Timok dialect, idiolect of
dialect speaker, village of Berčinovac, Eastern Serbia
References
Birkner 2015
V. Birkner. The advantages and
disadvantages of employing corpus evidence in sociolinguistic
studies. The Teacher Magazine. 2015. Vol. 2. P.
11–17.
Dash 2012
N. S. Dash. Etymological Annotation:
a New Concept of Corpus Annotation. Proceedings of the 34th All
India Conference of Linguists (34-AICL). Shillong, India, 2012. P.
100–104.
Dash, Arulmozi 2018
N. S. Dash, S. Arulmozi. Limitations
of language corpora. N. Dash, S. Arulmozi. History, features,
and typology of language corpora. Singapore: Springer
Singapore, 2018. P. 259–272.
Dash, Hussain 2013
N. S. Dash, M. M. Hussain. Designing
a Generic Scheme for Etymological Annotation: a New Type of
Language Corpora Annotation. P. Bhattacharayya, K.-S. Choi (eds.).
Proceedings of the 11th Workshop on Asian Language Resources.
Nagoya: Asian Federation of Natural Language Processing, 2013. P.
64–71.
Deemter, Kibble 1999
K. van Deemter, R. Kibble. What is
coreference, and what should coreference annotation be? A. Bagga,
B. Baldwin, S. Shelton (eds.). Proceedings of the Workshop on
Coreference and Its Applications. Stroudsburg, PA: Association
for Computational Linguistics, 1999. P. 90–96.
Erjavec et al. 2003
T. Erjavec, C. Krstev, V. Petkevic,
K. Simov, M. Tadic, D. Vitas. The MULTEXT-east morphosyntactic
specifications for Slavic languages. T. Erjavec, D. Vitas (eds.).
Proceedings of the Workshop on Morphological Processing of
Slavic Languages, EACL 2003. Stroudsburg, PA: Association for
Computational Linguistics, 2003. P. 25–32.
Escher 2021
A. L. Escher. Double argument marking
in Timok dialect texts (in Balkan Slavic context). Zeitschrift
für Slawistik. Forthcoming.
Goedertier et al. 2000
W. Goedertier, S. Goddijn, J.-P.
Martens. Orthographic transcription of the spoken Dutch corpus. M.
Gavrilidou, G. Carayannis, S. Markantonatou, S. Piperidis, G.
Stainhouer (eds.). Proceedings of the Second International
Conference on Language Resources and Evaluation (LREC 2000),
Athens, Greece. Athens: National Technical University of Athens
Press, 2000. P. 909–914.
Konior et al. 2019
D. V. Konior, A. L. Makarova, A. N.
Sobolev. Statisticheskiy metod yazykovogo profilirovaniya nositelya
dialekta (na materiale vostochnoserbskogo idioma sela Berchinovats)
[Quantitative method of language profiling of a dialect speaker
(based on the material of the East Serbian idiom of the village of
Bercinovac)]. Tomsk State University Journal of Philology.
2019. No. 58. P. 17–33.
Ljubešić et al. 2016
N. Ljubešić, F. Klubička, Ž. Agić,
I.-P. Jazbec. New Inflectional Lexicons and Training Corpora for
Improved Morphosyntactic Annotation of Croatian and Serbian. N.
Calzolari, Kh. Choukri, Th. Declerck, S. Goggi, M. Grobelnik, B.
Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis
(eds.). Proceedings of the Tenth International Conference on
Language Resources and Evaluation (LREC 2016). Paris :
European Language Resources Association, 2016. P. 4264–4270.
Sikimić, Sobolev 2020
B. Sikimić, A. N. Sobolev. Processy
divergentcii v razdelennom gosudarstvennoy granitcey
zapadnoyuzhnoslavyanskom dialekte (na materiale sovremennoy
dialektnoy rechi Vostochnoy Serbii i Zapadnoy Bolgarii) [Divergence
Processes in the West South Slavic Dialect Divided by the State
Border (Based on the Modern Dialect Speech of Eastern Serbia and
Western Bulgaria)]. Tomsk State University Journal of
Philology. 2020. No. 66. P. 158–176. DOI: 10.17223/
19986645/66/9.
Sobolev 1998
A. N. Sobolev. O dialektologicheskom
atlase Vostochnoy Serbii i Zapadnoy Bolgarii [On the
dialectological atlas of Eastern Serbia and Western Bulgaria]. G.
P. Klepikova (ed.). Issledovaniya po slavyanskoy
dialektologii [Studies in Slavic Dialectology]. Iss. 5.
Moscow: Institute of Slavic Studies RAS, 1998. P. 106–167.
Vuković et al. 2019
T. Vuković, N. Muheim, O.
Winistörfer, I. Simko, A. Makarova, S. Bradjan. Corpora and
Processing Tools for Non-Standard Contemporary and Diachronic
Balkan Slavic. I. Temnikova, I. Nikolova, N. Konstantinova (eds.).
Proceedings of the Student Research Workshop associated with The
12th International Conference on Recent Advances in Natural
Language Processing (RANLP 2019). Shoumen: Incoma, 2019. P.
62–68.
Vuković et al. 2020
T. Vuković, B. Sonnenhauser, A.
Escher. Degrees of non-standardness. Feature-based analysis of
variation in a Torlak dialect corpus. Manuscript.
Keywords
statistical methods in linguistics, machine
text analysis, linguistic profiling, dialect speakers, Balkan
Slavic languages, Serbian dialects, Timok dialect, idiolect of
dialect speaker, village of Berčinovac, Eastern Serbia