-
Set-to-Sequence Methods in Machine Learning: a Review
Abstract: Machine learning on sets towards sequential output is an important and ubiquitous task, with applications ranging from language modeling and meta-learning to multi-agent strategy games and power grid optimization. Combining elements of representation learning and structured prediction, its two primary challenges include obtaining a meaningful, permutation invariant set representation and subsequen… ▽ More
Submitted 16 August, 2021; v1 submitted 17 March, 2021; originally announced March 2021.
Comments: 46 pages of text, with 10 pages of references. Contains 2 tables and 4 figures. Updated version includes expanded notes on method comparison
MSC Class: 68T07 (Primary) 68T01 (Secondary) ACM Class: A.1; I.2.6
Journal ref: Journal of Artificial Intelligence Research 71 (2021): 885 - 924
-
The Danish Gigaword Project
Abstract: Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialect… ▽ More
Submitted 12 May, 2021; v1 submitted 7 May, 2020; originally announced May 2020.
Comments: Identical to the NoDaLiDa 2021 version