Search | arXiv e-print repository

arXiv:2405.01972 [pdf, other]

doi 10.5287/ora-8gv0b4qyo

A quantitative and typological study of Early Slavic participle clauses and their competition

Abstract: This thesis is a corpus-based, quantitative, and typological analysis of the functions of Early Slavic participle constructions and their finite competitors ($jegda$-'when'-clauses). The first part leverages detailed linguistic annotation on Early Slavic corpora at the morphosyntactic, dependency, information-structural, and lexical levels to obtain indirect evidence for different potential functi… ▽ More This thesis is a corpus-based, quantitative, and typological analysis of the functions of Early Slavic participle constructions and their finite competitors ($jegda$-'when'-clauses). The first part leverages detailed linguistic annotation on Early Slavic corpora at the morphosyntactic, dependency, information-structural, and lexical levels to obtain indirect evidence for different potential functions of participle clauses and their main finite competitor and understand the roles of compositionality and default discourse reasoning as explanations for the distribution of participle constructions and $jegda$-clauses in the corpus. The second part uses massively parallel data to analyze typological variation in how languages express the semantic space of English $when$, whose scope encompasses that of Early Slavic participle constructions and $jegda$-clauses. Probabilistic semantic maps are generated and statistical methods (including Kriging, Gaussian Mixture Modelling, precision and recall analysis) are used to induce cross-linguistically salient dimensions from the parallel corpus and to study conceptual variation within the semantic space of the hypothetical concept WHEN. △ Less

Submitted 8 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

Comments: 259 pages, 138 figures. DPhil Thesis in Linguistics submitted and defended at the University of Oxford (December 2023). This manuscript is a version formatted for improved readability and broader dissemination

MSC Class: 68T50; 68U15; 68T35; (Primary); 86A32; 15A03 (Secondary) ACM Class: I.2.7

arXiv:2404.18257 [pdf, other]

Map** 'when'-clauses in Latin American and Caribbean languages: an experiment in subtoken-based typology

Authors: Nilo Pedrazzini

Abstract: Languages can encode temporal subordination lexically, via subordinating conjunctions, and morphologically, by marking the relation on the predicate. Systematic cross-linguistic variation among the former can be studied using well-established token-based typological approaches to token-aligned parallel corpora. Variation among different morphological means is instead much harder to tackle and ther… ▽ More Languages can encode temporal subordination lexically, via subordinating conjunctions, and morphologically, by marking the relation on the predicate. Systematic cross-linguistic variation among the former can be studied using well-established token-based typological approaches to token-aligned parallel corpora. Variation among different morphological means is instead much harder to tackle and therefore more poorly understood, despite being predominant in several language groups. This paper explores variation in the expression of generic temporal subordination ('when'-clauses) among the languages of Latin America and the Caribbean, where morphological marking is particularly common. It presents probabilistic semantic maps computed on the basis of the languages of the region, thus avoiding bias towards the many world's languages that exclusively use lexified connectors, incorporating associations between character $n$-grams and English $when$. The approach allows capturing morphological clause-linkage devices in addition to lexified connectors, paving the way for larger-scale, strategy-agnostic analyses of typological variation in temporal subordination. △ Less

Submitted 28 April, 2024; originally announced April 2024.

Comments: 10 pages, 6 figures. To be published in the 2024 Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

MSC Class: 68T50; 68U15; 68T35; (Primary); 86A32; 15A03 (Secondary) ACM Class: I.2.7

arXiv:2011.06467 [pdf, ps, other]

Exploiting Cross-Dialectal Gold Syntax for Low-Resource Historical Languages: Towards a Generic Parser for Pre-Modern Slavic

Authors: Nilo Pedrazzini

Abstract: This paper explores the possibility of improving the performance of specialized parsers for pre-modern Slavic by training them on data from different related varieties. Because of their linguistic heterogeneity, pre-modern Slavic varieties are treated as low-resource historical languages, whereby cross-dialectal treebank data may be exploited to overcome data scarcity and attempt the training of a… ▽ More This paper explores the possibility of improving the performance of specialized parsers for pre-modern Slavic by training them on data from different related varieties. Because of their linguistic heterogeneity, pre-modern Slavic varieties are treated as low-resource historical languages, whereby cross-dialectal treebank data may be exploited to overcome data scarcity and attempt the training of a variety-agnostic parser. Previous experiments on early Slavic dependency parsing are discussed, particularly with regard to their ability to tackle different orthographic, regional and stylistic features. A generic pre-modern Slavic parser and two specialized parsers -- one for East Slavic and one for South Slavic -- are trained using jPTDP (Nguyen & Verspoor 2018), a neural network model for joint part-of-speech (POS) tagging and dependency parsing which had shown promising results on a number of Universal Dependency (UD) treebanks, including Old Church Slavonic (OCS). With these experiments, a new state of the art is obtained for both OCS (83.79\% unlabelled attachment score (UAS) and 78.43\% labelled attachement score (LAS)) and Old East Slavic (OES) (85.7\% UAS and 80.16\% LAS). △ Less

Submitted 12 November, 2020; originally announced November 2020.

Comments: Edited by Folgert Karsdorp, Barbara McGillivray, Adina Nerghes & Melvin Wevers. Conference paper (Preprint version). 11 pages. A link to the repository with the datasets used in the paper can be found in the relevant footnotes

MSC Class: 68T50; 68T07 (Primary); 91F20 (Secondary) ACM Class: I.2.7

Journal ref: Proceedings of the Workshop on Computational Humanities Research, 18-20 November 2020 (CEUR Workshop Proceedings, Vol. 2723), 237-247

Showing 1–3 of 3 results for author: Pedrazzini, N