Search | arXiv e-print repository

Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models

Authors: Jesse Atuhurra, Iqra Ali, Tatsuya Hiraoka, Hidetaka Kamigaito, Tomoya Iwakura, Taro Watanabe

Abstract: Large language models (LLMs) have increased interest in vision language models (VLMs), which process image-text pairs as input. Studies investigating the visual understanding ability of VLMs have been proposed, but such studies are still preliminary because existing datasets do not permit a comprehensive evaluation of the fine-grained visual linguistic abilities of VLMs across multiple languages.… ▽ More Large language models (LLMs) have increased interest in vision language models (VLMs), which process image-text pairs as input. Studies investigating the visual understanding ability of VLMs have been proposed, but such studies are still preliminary because existing datasets do not permit a comprehensive evaluation of the fine-grained visual linguistic abilities of VLMs across multiple languages. To further explore the strengths of VLMs, such as GPT-4V \cite{openai2023GPT4}, we developed new datasets for the systematic and qualitative analysis of VLMs. Our contribution is four-fold: 1) we introduced nine vision-and-language (VL) tasks (including object recognition, image-text matching, and more) and constructed multilingual visual-text datasets in four languages: English, Japanese, Swahili, and Urdu through utilizing templates containing \textit{questions} and prompting GPT4-V to generate the \textit{answers} and the \textit{rationales}, 2) introduced a new VL task named \textit{unrelatedness}, 3) introduced rationales to enable human understanding of the VLM reasoning process, and 4) employed human evaluation to measure the suitability of proposed datasets for VL tasks. We show that VLMs can be fine-tuned on our datasets. Our work is the first to conduct such analyses in Swahili and Urdu. Also, it introduces \textit{rationales} in VL analysis, which played a vital role in the evaluation. △ Less

Submitted 29 March, 2024; originally announced June 2024.

arXiv:2406.15358 [pdf, other]

Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili

Authors: Jesse Atuhurra, Hiroyuki Shindo, Hidetaka Kamigaito, Taro Watanabe

Abstract: Many attempts have been made in multilingual NLP to ensure that pre-trained language models, such as mBERT or GPT2 get better and become applicable to low-resource languages. To achieve multilingualism for pre-trained language models (PLMs), we need techniques to create word embeddings that capture the linguistic characteristics of any language. Tokenization is one such technique because it allows… ▽ More Many attempts have been made in multilingual NLP to ensure that pre-trained language models, such as mBERT or GPT2 get better and become applicable to low-resource languages. To achieve multilingualism for pre-trained language models (PLMs), we need techniques to create word embeddings that capture the linguistic characteristics of any language. Tokenization is one such technique because it allows for the words to be split based on characters or subwords, creating word embeddings that best represent the structure of the language. Creating such word embeddings is essential to applying PLMs to other languages where the model was not trained, enabling multilingual NLP. However, most PLMs use generic tokenization methods like BPE, wordpiece, or unigram which may not suit specific languages. We hypothesize that tokenization based on syllables within the input text, which we call syllable tokenization, should facilitate the development of syllable-aware language models. The syllable-aware language models make it possible to apply PLMs to languages that are rich in syllables, for instance, Swahili. Previous works introduced subword tokenization. Our work extends such efforts. Notably, we propose a syllable tokenizer and adopt an experiment-centric approach to validate the proposed tokenizer based on the Swahili language. We conducted text-generation experiments with GPT2 to evaluate the effectiveness of the syllable tokenizer. Our results show that the proposed syllable tokenizer generates syllable embeddings that effectively represent the Swahili language. △ Less

Submitted 26 March, 2024; originally announced June 2024.

arXiv:2405.00693 [pdf, other]

Large Language Models for Human-Robot Interaction: Opportunities and Risks

Authors: Jesse Atuhurra

Abstract: The tremendous development in large language models (LLM) has led to a new wave of innovations and applications and yielded research results that were initially forecast to take longer. In this work, we tap into these recent developments and present a meta-study about the potential of large language models if deployed in social robots. We place particular emphasis on the applications of social rob… ▽ More The tremendous development in large language models (LLM) has led to a new wave of innovations and applications and yielded research results that were initially forecast to take longer. In this work, we tap into these recent developments and present a meta-study about the potential of large language models if deployed in social robots. We place particular emphasis on the applications of social robots: education, healthcare, and entertainment. Before being deployed in social robots, we also study how these language models could be safely trained to ``understand'' societal norms and issues, such as trust, bias, ethics, cognition, and teamwork. We hope this study provides a resourceful guide to other robotics researchers interested in incorporating language models in their robots. △ Less

Submitted 26 March, 2024; originally announced May 2024.

arXiv:2404.14415 [pdf, other]

Domain Adaptation in Intent Classification Systems: A Review

Authors: Jesse Atuhurra, Hidetaka Kamigaito, Taro Watanabe, Eric Nichols

Abstract: Dialogue agents, which perform specific tasks, are part of the long-term goal of NLP researchers to build intelligent agents that communicate with humans in natural language. Such systems should adapt easily from one domain to another to assist users in completing tasks. Researchers have developed a broad range of techniques, objectives, and datasets for intent classification to achieve such syste… ▽ More Dialogue agents, which perform specific tasks, are part of the long-term goal of NLP researchers to build intelligent agents that communicate with humans in natural language. Such systems should adapt easily from one domain to another to assist users in completing tasks. Researchers have developed a broad range of techniques, objectives, and datasets for intent classification to achieve such systems. Despite the progress in develo** intent classification systems (ICS), a systematic review of the progress from a technical perspective is yet to be conducted. In effect, important implementation details of intent classification remain restricted and unclear, making it hard for natural language processing (NLP) researchers to develop new methods. To fill this gap, we review contemporary works in intent classification. Specifically, we conduct a thorough technical review of the datasets, domains, tasks, and methods needed to train the intent classification part of dialogue systems. Our structured analysis describes why intent classification is difficult and studies the limitations to domain adaptation while presenting opportunities for future work. △ Less

Submitted 26 March, 2024; originally announced April 2024.

arXiv:2404.08666 [pdf, other]

Revealing Trends in Datasets from the 2022 ACL and EMNLP Conferences

Authors: Jesse Atuhurra, Hidetaka Kamigaito

Abstract: Natural language processing (NLP) has grown significantly since the advent of the Transformer architecture. Transformers have given birth to pre-trained large language models (PLMs). There has been tremendous improvement in the performance of NLP systems across several tasks. NLP systems are on par or, in some cases, better than humans at accomplishing specific tasks. However, it remains the norm… ▽ More Natural language processing (NLP) has grown significantly since the advent of the Transformer architecture. Transformers have given birth to pre-trained large language models (PLMs). There has been tremendous improvement in the performance of NLP systems across several tasks. NLP systems are on par or, in some cases, better than humans at accomplishing specific tasks. However, it remains the norm that \emph{better quality datasets at the time of pretraining enable PLMs to achieve better performance, regardless of the task.} The need to have quality datasets has prompted NLP researchers to continue creating new datasets to satisfy particular needs. For example, the two top NLP conferences, ACL and EMNLP, accepted ninety-two papers in 2022, introducing new datasets. This work aims to uncover the trends and insights mined within these datasets. Moreover, we provide valuable suggestions to researchers interested in curating datasets in the future. △ Less

Submitted 31 March, 2024; originally announced April 2024.

arXiv:2403.18989 [pdf, other]

Dealing with Imbalanced Classes in Bot-IoT Dataset

Authors: Jesse Atuhurra, Takanori Hara, Yuanyu Zhang, Masahiro Sasabe, Shoji Kasahara

Abstract: With the rapidly spreading usage of Internet of Things (IoT) devices, a network intrusion detection system (NIDS) plays an important role in detecting and protecting various types of attacks in the IoT network. To evaluate the robustness of the NIDS in the IoT network, the existing work proposed a realistic botnet dataset in the IoT network (Bot-IoT dataset) and applied it to machine learning-base… ▽ More With the rapidly spreading usage of Internet of Things (IoT) devices, a network intrusion detection system (NIDS) plays an important role in detecting and protecting various types of attacks in the IoT network. To evaluate the robustness of the NIDS in the IoT network, the existing work proposed a realistic botnet dataset in the IoT network (Bot-IoT dataset) and applied it to machine learning-based anomaly detection. This dataset contains imbalanced normal and attack packets because the number of normal packets is much smaller than that of attack ones. The nature of imbalanced data may make it difficult to identify the minority class correctly. In this thesis, to address the class imbalance problem in the Bot-IoT dataset, we propose a binary classification method with synthetic minority over-sampling techniques (SMOTE). The proposed classifier aims to detect attack packets and overcome the class imbalance problem using the SMOTE algorithm. Through numerical results, we demonstrate the proposed classifier's fundamental characteristics and the impact of imbalanced data on its performance. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.15430 [pdf, other]

Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

Authors: Jesse Atuhurra, Seiveright Cargill Dujohn, Hidetaka Kamigaito, Hiroyuki Shindo, Taro Watanabe

Abstract: Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on end… ▽ More Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Showing 1–7 of 7 results for author: Atuhurra, J