-
NLPStatTest: A Toolkit for Comparing NLP System Performance
Abstract: Statistical significance testing centered on p-values is commonly used to compare NLP system performance, but p-values alone are insufficient because statistical significance differs from practical significance. The latter can be measured by estimating effect size. In this paper, we propose a three-stage procedure for comparing NLP system performance and provide a toolkit, NLPStatTest, that automa… ▽ More
Submitted 26 November, 2020; originally announced November 2020.
Comments: Will appear in AACL-IJCNLP 2020
Journal ref: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations (2020) 40-46
-
Probing for Multilingual Numerical Understanding in Transformer-Based Language Models
Abstract: Natural language numbers are an example of compositional structures, where larger numbers are composed of operations on smaller numbers. Given that compositional reasoning is a key to natural language understanding, we propose novel multilingual probing tasks tested on DistilBERT, XLM, and BERT to investigate for evidence of compositional reasoning over numerical data in various natural language n… ▽ More
Submitted 13 October, 2020; originally announced October 2020.
Comments: BlackboxNLP (EMNLP 2020)