Search | arXiv e-print repository

LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages

Authors: Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A. Chi, Ryan Chi, Scott A. Hale, Hannah Rose Kirk

Abstract: In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark cover… ▽ More In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark covers more than 90 mostly low-resource languages, minimising issues of data contamination, and contains 1,133 problems across 6 formats and 5 levels of human difficulty. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation. Scores from 11 state-of-the-art LLMs demonstrate the benchmark to be challenging, and models perform poorly on the higher difficulty problems. On harder problems, even the top model only achieved 38.7% accuracy, 24.7% improvement over the no-context baseline. Large closed models typically outperform open models, and in general, the higher resource the language, the better the scores. These results indicate, in absence of memorisation, true multi-step out-of-domain reasoning remains a challenge for current language models. △ Less

Submitted 11 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: 9 pages, 5 figures, 16 pages supplemental materials

arXiv:2403.15475 [pdf, other]

Large language models can help boost food production, but be mindful of their risks

Authors: Djavan De Clercq, Elias Nehring, Harry Mayne, Adam Mahdi

Abstract: Coverage of ChatGPT-style large language models (LLMs) in the media has focused on their eye-catching achievements, including solving advanced mathematical problems and reaching expert proficiency in medical examinations. But the gradual adoption of LLMs in agriculture, an industry which touches every human life, has received much less public scrutiny. In this short perspective, we examine risks a… ▽ More Coverage of ChatGPT-style large language models (LLMs) in the media has focused on their eye-catching achievements, including solving advanced mathematical problems and reaching expert proficiency in medical examinations. But the gradual adoption of LLMs in agriculture, an industry which touches every human life, has received much less public scrutiny. In this short perspective, we examine risks and opportunities related to more widespread adoption of language models in food production systems. While LLMs can potentially enhance agricultural efficiency, drive innovation, and inform better policies, challenges like agricultural misinformation, collection of vast amounts of farmer data, and threats to agricultural jobs are important concerns. The rapid evolution of the LLM landscape underscores the need for agricultural policymakers to think carefully about frameworks and guidelines that ensure the responsible use of LLMs in food production before these technologies become so ingrained that policy intervention becomes challenging. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.02945 [pdf, other]

Unsupervised Learning Approaches for Identifying ICU Patient Subgroups: Do Results Generalise?

Authors: Harry Mayne, Guy Parsons, Adam Mahdi

Abstract: The use of unsupervised learning to identify patient subgroups has emerged as a potentially promising direction to improve the efficiency of Intensive Care Units (ICUs). By identifying subgroups of patients with similar levels of medical resource need, ICUs could be restructured into a collection of smaller subunits, each catering to a specific group. However, it is unclear whether common patient… ▽ More The use of unsupervised learning to identify patient subgroups has emerged as a potentially promising direction to improve the efficiency of Intensive Care Units (ICUs). By identifying subgroups of patients with similar levels of medical resource need, ICUs could be restructured into a collection of smaller subunits, each catering to a specific group. However, it is unclear whether common patient subgroups exist across different ICUs, which would determine whether ICU restructuring could be operationalised in a standardised manner. In this paper, we tested the hypothesis that common ICU patient subgroups exist by examining whether the results from one existing study generalise to a different dataset. We extracted 16 features representing medical resource need and used consensus clustering to derive patient subgroups, replicating the previous study. We found limited similarities between our results and those of the previous study, providing evidence against the hypothesis. Our findings imply that there is significant variation between ICUs; thus, a standardised restructuring approach is unlikely to be appropriate. Instead, potential efficiency gains might be greater when the number and nature of the subunits are tailored to each ICU individually. △ Less

Submitted 5 March, 2024; originally announced March 2024.

arXiv:2303.04932 [pdf, other]

Team Northeastern's Approach to ANA XPRIZE Avatar Final Testing: A Holistic Approach to Telepresence and Lessons Learned

Authors: Rui Luo, Chunpeng Wang, Colin Keil, David Nguyen, Henry Mayne, Stephen Alt, Eric Schwarm, Evelyn Mendoza, Taşkın Padır, John Peter Whitney

Abstract: This paper reports on Team Northeastern's Avatar system for telepresence, and our holistic approach to meet the ANA Avatar XPRIZE Final testing task requirements. The system features a dual-arm configuration with hydraulically actuated glove-gripper pair for haptic force feedback. Our proposed Avatar system was evaluated in the ANA Avatar XPRIZE Finals and completed all 10 tasks, scored 14.5 point… ▽ More This paper reports on Team Northeastern's Avatar system for telepresence, and our holistic approach to meet the ANA Avatar XPRIZE Final testing task requirements. The system features a dual-arm configuration with hydraulically actuated glove-gripper pair for haptic force feedback. Our proposed Avatar system was evaluated in the ANA Avatar XPRIZE Finals and completed all 10 tasks, scored 14.5 points out of 15.0, and received the 3rd Place Award. We provide the details of improvements over our first generation Avatar, covering manipulation, perception, locomotion, power, network, and controller design. We also extensively discuss the major lessons learned during our participation in the competition. △ Less

Submitted 8 March, 2023; originally announced March 2023.

Comments: 7 pages, submitted to IROS 2023

Showing 1–4 of 4 results for author: Mayne, H