-
arXiv:1911.07555 [pdf, ps, other]
Short Text Language Identification for Under Resourced Languages
Abstract: The paper presents a hierarchical naive Bayesian and lexicon based classifier for short text language identification (LID) useful for under resourced languages. The algorithm is evaluated on short pieces of text for the 11 official South African languages some of which are similar languages. The algorithm is compared to recent approaches using test sets from previous works on South African languag… ▽ More
Submitted 21 November, 2019; v1 submitted 18 November, 2019; originally announced November 2019.
Comments: Presented at NeurIPS 2019 Workshop on Machine Learning for the Develo** World
MSC Class: 68T50
-
Improved Text Language Identification for the South African Languages
Abstract: Virtual assistants and text chatbots have recently been gaining popularity. Given the short message nature of text-based chat interactions, the language identification systems of these bots might only have 15 or 20 characters to make a prediction. However, accurate text language identification is important, especially in the early stages of many multilingual natural language processing pipelines.… ▽ More
Submitted 1 November, 2017; originally announced November 2017.
Comments: Accepted to appear in the proceedings of The 28th Annual Symposium of the Pattern Recognition Association of South Africa, 2017