Skip to main content

Showing 1–21 of 21 results for author: Imperial, J

.
  1. arXiv:2406.10118  [pdf, other

    cs.CL

    SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

    Authors: Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse , et al. (36 additional authors not shown)

    Abstract: Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due t… ▽ More

    Submitted 8 July, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: https://github.com/SEACrowd

  2. arXiv:2405.08597  [pdf, other

    cs.LG

    Risks and Opportunities of Open-Source Generative AI

    Authors: Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Aaron Purewal, Csaba Botos, Fabro Steibel, Fazel Keshtkar, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Imperial, Juan Arturo Nolazco, Lori Landay, Matthew Jackson, Phillip H. S. Torr, Trevor Darrell, Yong Lee, Jakob Foerster

    Abstract: Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about the potential risks of the technology, and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This reg… ▽ More

    Submitted 29 May, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

    Comments: Extension of arXiv:2404.17047

  3. arXiv:2404.17047  [pdf, other

    cs.LG

    Near to Mid-term Risks and Opportunities of Open-Source Generative AI

    Authors: Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder de Witt, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Botos Csaba, Fabro Steibel, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Marvin Imperial, Juan A. Nolazco-Flores, Lori Landay, Matthew Jackson, Paul Röttger, Philip H. S. Torr, Trevor Darrell, Yong Suk Lee, Jakob Foerster

    Abstract: In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This regulation i… ▽ More

    Submitted 24 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Accepted to ICML'24 as a position paper

  4. arXiv:2404.12241  [pdf, other

    cs.CL cs.AI

    Introducing v0.5 of the AI Safety Benchmark from MLCommons

    Authors: Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Max Bartolo, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller , et al. (75 additional authors not shown)

    Abstract: This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-pu… ▽ More

    Submitted 13 May, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

  5. arXiv:2402.12593  [pdf, other

    cs.CL

    Standardize: Aligning Language Models with Expert-Defined Standards for Content Generation

    Authors: Joseph Marvin Imperial, Gail Forey, Harish Tayyar Madabushi

    Abstract: Domain experts across engineering, healthcare, and education follow strict standards for producing quality content such as technical manuals, medication instructions, and children's reading materials. However, current works in controllable text generation have yet to explore using these standards as references for control. Towards this end, we introduce Standardize, a retrieval-style in-context le… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

  6. arXiv:2311.09122  [pdf, other

    cs.CL

    Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

    Authors: Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter

    Abstract: We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse langu… ▽ More

    Submitted 29 June, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 Camera-ready

  7. arXiv:2310.11584  [pdf, other

    cs.CL

    BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment in Central Philippine Languages

    Authors: Joseph Marvin Imperial, Ekaterina Kochmar

    Abstract: Current research on automatic readability assessment (ARA) has focused on improving the performance of models in high-resource languages such as English. In this work, we introduce and release BasahaCorpus as part of an initiative aimed at expanding available corpora and baseline models for readability assessment in lower resource languages in the Philippines. We compiled a corpus of short fiction… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Final camera-ready paper for EMNLP 2023 (Main)

  8. arXiv:2310.00679  [pdf, other

    cs.CL

    CebuaNER: A New Baseline Cebuano Named Entity Recognition Model

    Authors: Ma. Beatrice Emanuela Pilar, Ellyza Mari Papas, Mary Loise Buenaventura, Dane Dedoroy, Myron Darrel Montefalcon, Jay Rhald Padilla, Lany Maceda, Mideth Abisado, Joseph Marvin Imperial

    Abstract: Despite being one of the most linguistically diverse groups of countries, computational linguistics and language processing research in Southeast Asia has struggled to match the level of countries from the Global North. Thus, initiatives such as open-sourcing corpora and the development of baseline models for basic language processing tasks are important step** stones to encourage the growth of… ▽ More

    Submitted 1 October, 2023; originally announced October 2023.

    Comments: Accepted for PACLIC2023

  9. arXiv:2309.05454  [pdf, other

    cs.CL

    Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models

    Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi

    Abstract: Readability metrics and standards such as Flesch Kincaid Grade Level (FKGL) and the Common European Framework of Reference for Languages (CEFR) exist to guide teachers and educators to properly assess the complexity of educational materials before administering them for classroom use. In this study, we select a diverse set of open and closed-source instruction-tuned language models and investigate… ▽ More

    Submitted 3 November, 2023; v1 submitted 11 September, 2023; originally announced September 2023.

    Comments: Final camera-ready for EMNLP GEM Workshop 2023

  10. arXiv:2305.13478  [pdf, other

    cs.CL

    Automatic Readability Assessment for Closely Related Languages

    Authors: Joseph Marvin Imperial, Ekaterina Kochmar

    Abstract: In recent years, the main focus of research on automatic readability assessment (ARA) has shifted towards using expensive deep learning-based methods with the primary goal of increasing models' accuracy. This, however, is rarely applicable for low-resource languages where traditional handcrafted features are still widely used due to the lack of existing NLP tools to extract deeper linguistic repre… ▽ More

    Submitted 25 May, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: Camera-ready version for ACL 2023

  11. arXiv:2204.05185  [pdf, other

    cs.CL cs.LG

    Uniform Complexity for Text Generation

    Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi

    Abstract: Large language models (LLMs) have shown promising results in a wide array of generative NLP tasks, such as summarization and machine translation. In the context of narrative generation, however, existing models still do not capture factors that contribute to producing consistent text. For instance, it is logical that a piece of text or a story should be uniformly readable throughout and that this… ▽ More

    Submitted 19 October, 2023; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Final camera-ready for EMNLP 2023

  12. arXiv:2203.17225  [pdf, other

    cs.CL

    A Baseline Readability Model for Cebuano

    Authors: Lloyd Lois Antonie Reyes, Michael Antonio Ibañez, Ranz Sapinit, Mohammed Hussien, Joseph Marvin Imperial

    Abstract: In this study, we developed the first baseline readability model for the Cebuano language. Cebuano is the second most-used native language in the Philippines with about 27.5 million speakers. As the baseline, we extracted traditional or surface-based features, syllable patterns based from Cebuano's documented orthography, and neural embeddings from the multilingual BERT model. Results show that th… ▽ More

    Submitted 20 May, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted to BEA Workshop at NAACL 2022

  13. arXiv:2202.10855  [pdf, other

    cs.CL cs.LG

    NU HLT at CMCL 2022 Shared Task: Multilingual and Crosslingual Prediction of Human Reading Behavior in Universal Language Space

    Authors: Joseph Marvin Imperial

    Abstract: In this paper, we present a unified model that works for both multilingual and crosslingual prediction of reading times of words in various languages. The secret behind the success of this model is in the preprocessing step where all words are transformed to their universal language representation via the International Phonetic Alphabet (IPA). To the best of our knowledge, this is the first study… ▽ More

    Submitted 26 February, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

  14. arXiv:2110.00157  [pdf, other

    cs.CL cs.LG

    Under the Microscope: Interpreting Readability Assessment Models for Filipino

    Authors: Joseph Marvin Imperial, Ethel Ong

    Abstract: Readability assessment is the process of identifying the level of ease or difficulty of a certain piece of text for its intended audience. Approaches have evolved from the use of arithmetic formulas to more complex pattern-recognizing models trained using machine learning algorithms. While using these approaches provide competitive results, limited work is done on analyzing how linguistic variable… ▽ More

    Submitted 30 September, 2021; originally announced October 2021.

    Comments: Accepted for oral presentation at PACLIC 2021

  15. arXiv:2108.00241  [pdf

    cs.CL cs.LG

    Diverse Linguistic Features for Assessing Reading Difficulty of Educational Filipino Texts

    Authors: Joseph Marvin Imperial, Ethel Ong

    Abstract: In order to ensure quality and effective learning, fluency, and comprehension, the proper identification of the difficulty levels of reading materials should be observed. In this paper, we describe the development of automatic machine learning-based readability assessment models for educational Filipino texts using the most diverse set of linguistic features for the language. Results show that usi… ▽ More

    Submitted 31 July, 2021; originally announced August 2021.

    Comments: Accepted at ICCE 2021

  16. arXiv:2107.09881  [pdf, other

    cs.CL

    How Do Pedophiles Tweet? Investigating the Writing Styles and Online Personas of Child Cybersex Traffickers in the Philippines

    Authors: Joseph Marvin Imperial

    Abstract: One of the most important humanitarian responsibility of every individual is to protect the future of our children. This entails not only protection of physical welfare but also from ill events that can potentially affect the mental well-being of a child such as sexual coercion and abuse which, in worst-case scenarios, can result to lifelong trauma. In this study, we perform a preliminary investig… ▽ More

    Submitted 21 July, 2021; originally announced July 2021.

    Comments: Submitted as a short paper for a conference

  17. arXiv:2106.07935  [pdf, other

    cs.CL

    BERT Embeddings for Automatic Readability Assessment

    Authors: Joseph Marvin Imperial

    Abstract: Automatic readability assessment (ARA) is the task of evaluating the level of ease or difficulty of text documents for a target audience. For researchers, one of the many open problems in the field is to make such models trained for the task show efficacy even for low-resource languages. In this study, we propose an alternative way of utilizing the information-rich embeddings of BERT models with h… ▽ More

    Submitted 30 July, 2021; v1 submitted 15 June, 2021; originally announced June 2021.

    Comments: Accepted at RANLP 2021

  18. arXiv:2103.07277  [pdf, other

    cs.CL

    A Simple Post-Processing Technique for Improving Readability Assessment of Texts using Word Mover's Distance

    Authors: Joseph Marvin Imperial, Ethel Ong

    Abstract: Assessing the proper difficulty levels of reading materials or texts in general is the first step towards effective comprehension and learning. In this study, we improve the conventional methodology of automatic readability assessment by incorporating the Word Mover's Distance (WMD) of ranked texts as an additional post-processing technique to further ground the difficulty level given by a model.… ▽ More

    Submitted 19 September, 2021; v1 submitted 12 March, 2021; originally announced March 2021.

  19. arXiv:2101.10537  [pdf

    cs.CL cs.LG

    Application of Lexical Features Towards Improvement of Filipino Readability Identification of Children's Literature

    Authors: Joseph Marvin Imperial, Ethel Ong

    Abstract: Proper identification of grade levels of children's reading materials is an important step towards effective learning. Recent studies in readability assessment for the English domain applied modern approaches in natural language processing (NLP) such as machine learning (ML) techniques to automate the process. There is also a need to extract the correct linguistic features when modeling readabilit… ▽ More

    Submitted 22 January, 2021; originally announced January 2021.

    Comments: 8 tables, 1 figure. Presented at the Philippine Computing Science Congress 2020

  20. arXiv:2101.10014  [pdf, other

    cs.CL

    A Simple Disaster-Related Knowledge Base for Intelligent Agents

    Authors: Clark Emmanuel Paulo, Arvin Ken Ramirez, David Clarence Reducindo, Rannie Mark Mateo, Joseph Marvin Imperial

    Abstract: In this paper, we describe our efforts in establishing a simple knowledge base by building a semantic network composed of concepts and word relationships in the context of disasters in the Philippines. Our primary source of data is a collection of news articles scraped from various Philippine news websites. Using word embeddings, we extract semantically similar and co-occurring words from an initi… ▽ More

    Submitted 25 January, 2021; originally announced January 2021.

    Comments: 7 tables, 1 figure, presented at 34th Pacific Asia Conference on Language, Information and Computation

  21. arXiv:1908.01765  [pdf

    cs.NE cs.CL

    Sentiment Analysis of Typhoon Related Tweets using Standard and Bidirectional Recurrent Neural Networks

    Authors: Joseph Marvin Imperial, Jeyrome Orosco, Shiela Mae Mazo, Lany Maceda

    Abstract: The Philippines is a common ground to natural calamities like typhoons, floods, volcanic eruptions and earthquakes. With Twitter as one of the most used social media platform in the Philippines, a total of 39,867 preprocessed tweets were obtained given a time frame starting from November 1, 2013 to January 31, 2014. Sentiment analysis determines the underlying emotion given a series of words. The… ▽ More

    Submitted 3 August, 2019; originally announced August 2019.

    Comments: 5 figures, 2 tables, presented at the 14th National Natural Language Processing Research Symposium - Student Research Workshop