Search | arXiv e-print repository

Bryndza at ClimateActivism 2024: Stance, Target and Hate Event Detection via Retrieval-Augmented GPT-4 and LLaMA

Authors: Marek Šuppa, Daniel Skala, Daniela Jašš, Samuel Sučík, Andrej Švec, Peter Hraška

Abstract: This study details our approach for the CASE 2024 Shared Task on Climate Activism Stance and Hate Event Detection, focusing on Hate Speech Detection, Hate Speech Target Identification, and Stance Detection as classification challenges. We explored the capability of Large Language Models (LLMs), particularly GPT-4, in zero- or few-shot settings enhanced by retrieval augmentation and re-ranking for… ▽ More This study details our approach for the CASE 2024 Shared Task on Climate Activism Stance and Hate Event Detection, focusing on Hate Speech Detection, Hate Speech Target Identification, and Stance Detection as classification challenges. We explored the capability of Large Language Models (LLMs), particularly GPT-4, in zero- or few-shot settings enhanced by retrieval augmentation and re-ranking for Tweet classification. Our goal was to determine if LLMs could match or surpass traditional methods in this context. We conducted an ablation study with LLaMA for comparison, and our results indicate that our models significantly outperformed the baselines, securing second place in the Target Detection task. The code for our submission is available at https://github.com/NaiveNeuron/bryndza-case-2024 △ Less

Submitted 9 February, 2024; originally announced February 2024.

Comments: Accepted to the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2024)

arXiv:2311.09122 [pdf, other]

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

Authors: Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter

Abstract: We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse langu… ▽ More We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public. △ Less

Submitted 29 June, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

Comments: NAACL 2024 Camera-ready

arXiv:2304.04026 [pdf, other]

WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition

Authors: Dávid Šuba, Marek Šuppa, Jozef Kubík, Endre Hamerlik, Martin Takáč

Abstract: Named Entity Recognition (NER) is a fundamental NLP tasks with a wide range of practical applications. The performance of state-of-the-art NER methods depends on high quality manually anotated datasets which still do not exist for some languages. In this work we aim to remedy this situation in Slovak by introducing WikiGoldSK, the first sizable human labelled Slovak NER dataset. We benchmark it by… ▽ More Named Entity Recognition (NER) is a fundamental NLP tasks with a wide range of practical applications. The performance of state-of-the-art NER methods depends on high quality manually anotated datasets which still do not exist for some languages. In this work we aim to remedy this situation in Slovak by introducing WikiGoldSK, the first sizable human labelled Slovak NER dataset. We benchmark it by evaluating state-of-the-art multilingual Pretrained Language Models and comparing it to the existing silver-standard Slovak NER dataset. We also conduct few-shot experiments and show that training on a sliver-standard dataset yields better results. To enable future work that can be based on Slovak NER, we release the dataset, code, as well as the trained models publicly under permissible licensing terms at https://github.com/NaiveNeuron/WikiGoldSK. △ Less

Submitted 8 April, 2023; originally announced April 2023.

Comments: BSNLP 2023 Workshop at EACL 2023

arXiv:2105.01753 [pdf, other]

WaveGlove: Transformer-based hand gesture recognition using multiple inertial sensors

Authors: Matej Králik, Marek Šuppa

Abstract: Hand Gesture Recognition (HGR) based on inertial data has grown considerably in recent years, with the state-of-the-art approaches utilizing a single handheld sensor and a vocabulary comprised of simple gestures. In this work we explore the benefits of using multiple inertial sensors. Using WaveGlove, a custom hardware prototype in the form of a glove with five inertial sensors, we acquire two d… ▽ More Hand Gesture Recognition (HGR) based on inertial data has grown considerably in recent years, with the state-of-the-art approaches utilizing a single handheld sensor and a vocabulary comprised of simple gestures. In this work we explore the benefits of using multiple inertial sensors. Using WaveGlove, a custom hardware prototype in the form of a glove with five inertial sensors, we acquire two datasets consisting of over $11000$ samples. To make them comparable with prior work, they are normalized along with $9$ other publicly available datasets, and subsequently used to evaluate a range of Machine Learning approaches for gesture recognition, including a newly proposed Transformer-based architecture. Our results show that even complex gestures involving different fingers can be recognized with high accuracy. An ablation study performed on the acquired datasets demonstrates the importance of multiple sensors, with an increase in performance when using up to three sensors and no significant improvements beyond that. △ Less

Submitted 4 May, 2021; originally announced May 2021.

Comments: Accepted to EUSIPCO 2021

arXiv:2104.05456 [pdf, other]

doi 10.1145/3430665.3456387

TermAdventure: Interactively Teaching UNIX Command Line, Text Adventure Style

Authors: Marek Šuppa, Ondrej Jariabka, Adrián Matejov, Marek Nagy

Abstract: Introductory UNIX courses are typically organized as lectures, accompanied by a set of exercises, whose solutions are submitted to and reviewed by the lecturers. While this arrangement has become standard practice, it often requires the use of an external tool or interface for submission and does not automatically check its correctness. That in turn leads to increased workload and makes it difficu… ▽ More Introductory UNIX courses are typically organized as lectures, accompanied by a set of exercises, whose solutions are submitted to and reviewed by the lecturers. While this arrangement has become standard practice, it often requires the use of an external tool or interface for submission and does not automatically check its correctness. That in turn leads to increased workload and makes it difficult to deal with potential plagiarism. In this work we present TermAdventure (TA), a suite of tools for creating interactive UNIX exercises. These resemble text adventure games, which immerse the user in a text environment and let them interact with it using textual commands. In our case the ''adventure'' takes place inside a UNIX system and the user interaction happens via the standard UNIX command line. The adventure is a set of exercises, which are presented and automatically evaluated by the system, all from within the command line environment. The suite is released under an open source license, has minimal dependencies and can be used either on a UNIX-style server or a desktop computer running any major OS platform through Docker. We also reflect on our experience of using the presented suite as the primary teaching tool for an introductory UNIX course for Data Scientists and discuss the implications of its deployment in similar courses. The suite is released under the terms of an open-source license at \url{https://github.com/NaiveNeuron/TermAdventure}. △ Less

Submitted 12 April, 2021; originally announced April 2021.

Comments: Accepted at ITiCSE 2021

arXiv:2103.10673 [pdf, other]

Cost-effective Deployment of BERT Models in Serverless Environment

Authors: Katarína Benešová, Andrej Švec, Marek Šuppa

Abstract: In this study we demonstrate the viability of deploying BERT-style models to serverless environments in a production setting. Since the freely available pre-trained models are too large to be deployed in this way, we utilize knowledge distillation and fine-tune the models on proprietary datasets for two real-world tasks: sentiment analysis and semantic textual similarity. As a result, we obtain mo… ▽ More In this study we demonstrate the viability of deploying BERT-style models to serverless environments in a production setting. Since the freely available pre-trained models are too large to be deployed in this way, we utilize knowledge distillation and fine-tune the models on proprietary datasets for two real-world tasks: sentiment analysis and semantic textual similarity. As a result, we obtain models that are tuned for a specific domain and deployable in serverless environments. The subsequent performance analysis shows that this solution results in latency levels acceptable for production use and that it is also a cost-effective approach for small-to-medium size deployments of BERT models, all without any infrastructure overhead. △ Less

Submitted 19 April, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

Comments: NAACL-HLT 2021 Industry Track Camera Ready

Showing 1–6 of 6 results for author: Šuppa, M