Skip to main content

Showing 1–50 of 154 results for author: Arora, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19238  [pdf, other

    cs.CL cs.CY cs.LG

    Revealing Fine-Grained Values and Opinions in Large Language Models

    Authors: Dustin Wright, Arnav Arora, Nadav Borenstein, Srishti Yadav, Serge Belongie, Isabelle Augenstein

    Abstract: Uncovering latent values and opinions in large language models (LLMs) can help identify biases and mitigate potential harm. Recently, this has been approached by presenting LLMs with survey questions and quantifying their stances towards morally and politically charged statements. However, the stances generated by LLMs can vary greatly depending on how they are prompted, and there are many ways to… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 28 pages, 20 figures, 7 tables

  2. arXiv:2406.15593  [pdf, other

    cs.CL econ.GN

    News Deja Vu: Connecting Past and Present with Semantic Search

    Authors: Brevin Franklin, Emily Silcock, Abhishek Arora, Tom Bryan, Melissa Dell

    Abstract: Social scientists and the general public often analyze contemporary events by drawing parallels with the past, a process complicated by the vast, noisy, and unstructured nature of historical texts. For example, hundreds of millions of page scans from historical newspapers have been noisily transcribed. Traditional sparse methods for searching for relevant material in these vast corpora, e.g., with… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  3. arXiv:2406.15576  [pdf, other

    cs.CL econ.GN

    Contrastive Entity Coreference and Disambiguation for Historical Texts

    Authors: Abhishek Arora, Emily Silcock, Leander Heldring, Melissa Dell

    Abstract: Massive-scale historical document collections are crucial for social science research. Despite increasing digitization, these documents typically lack unique cross-document identifiers for individuals mentioned within the texts, as well as individual identifiers from external knowledgebases like Wikipedia/Wikidata. Existing entity disambiguation methods often fall short in accuracy for historical… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  4. arXiv:2406.15556  [pdf, other

    cs.CV

    Open-Vocabulary Temporal Action Localization using Multimodal Guidance

    Authors: Akshita Gupta, Aditya Arora, Sanath Narayan, Salman Khan, Fahad Shahbaz Khan, Graham W. Taylor

    Abstract: Open-Vocabulary Temporal Action Localization (OVTAL) enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. However, this flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference. Unlike standard tempor… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  5. arXiv:2406.09490  [pdf, other

    cs.CL econ.GN

    Newswire: A Large-Scale Structured Database of a Century of Historical News

    Authors: Emily Silcock, Abhishek Arora, Luca D'Amico-Wong, Melissa Dell

    Abstract: In the U.S. historically, local newspapers drew their content largely from newswires like the Associated Press. Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world, but there is no comprehensive archive of the content sent over newswires. We reconstruct such an archive by applying a customized deep learning pipeline to hundred… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2306.17810, arXiv:2308.12477

  6. arXiv:2406.05190  [pdf, other

    cs.LG cs.AI cs.CL

    Evaluating the Effectiveness of Data Augmentation for Emotion Classification in Low-Resource Settings

    Authors: Aashish Arora, Elsbeth Turcan

    Abstract: Data augmentation has the potential to improve the performance of machine learning models by increasing the amount of training data available. In this study, we evaluated the effectiveness of different data augmentation techniques for a multi-label emotion classification task using a low-resource dataset. Our results showed that Back Translation outperformed autoencoder-based approaches and that g… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: The first author contributed significantly

  7. arXiv:2405.16661  [pdf, other

    cs.CL cs.AI cs.LG cs.LO

    RLSF: Reinforcement Learning via Symbolic Feedback

    Authors: Piyush Jha, Prithwish Jana, Arnav Arora, Vijay Ganesh

    Abstract: In recent years, large language models (LLMs) have had a dramatic impact on various sub-fields of AI, most notably on natural language understanding tasks. However, there is widespread agreement that the logical reasoning capabilities of contemporary LLMs are, at best, fragmentary (i.e., may work well on some problem instances but fail dramatically on others). While traditional LLM fine-tuning app… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  8. arXiv:2405.15152  [pdf, other

    cs.CL cs.AI

    Machine Unlearning in Large Language Models

    Authors: Saaketh Koundinya Gundavarapu, Shreya Agarwal, Arushi Arora, Chandana Thimmalapura Jagadeeshaiah

    Abstract: Machine unlearning, a novel area within artificial intelligence, focuses on addressing the challenge of selectively forgetting or reducing undesirable knowledge or behaviors in machine learning models, particularly in the context of large language models (LLMs). This paper introduces a methodology to align LLMs, such as Open Pre-trained Transformer Language Models, with ethical, privacy, and safet… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 10 pages

  9. arXiv:2405.07284  [pdf

    cs.CV cs.AI

    Zero Shot Context-Based Object Segmentation using SLIP (SAM+CLIP)

    Authors: Saaketh Koundinya Gundavarapu, Arushi Arora, Shreya Agarwal

    Abstract: We present SLIP (SAM+CLIP), an enhanced architecture for zero-shot object segmentation. SLIP combines the Segment Anything Model (SAM) \cite{kirillov2023segment} with the Contrastive Language-Image Pretraining (CLIP) \cite{radford2021learning}. By incorporating text prompts into SAM using CLIP, SLIP enables object segmentation without prior training on specific classes or categories. We fine-tune… ▽ More

    Submitted 12 May, 2024; originally announced May 2024.

    Comments: 5 pages, 3 figures

  10. arXiv:2405.06787  [pdf, other

    quant-ph cs.CR

    A computational test of quantum contextuality, and even simpler proofs of quantumness

    Authors: Atul Singh Arora, Kishor Bharti, Alexandru Cojocaru, Andrea Coladangelo

    Abstract: Bell non-locality is a fundamental feature of quantum mechanics whereby measurements performed on "spatially separated" quantum systems can exhibit correlations that cannot be understood as revealing predetermined values. This is a special case of the more general phenomenon of "quantum contextuality", which says that such correlations can occur even when the measurements are not necessarily on se… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: 69 pages, 6 figures. For updates see https://atulsingharora.github.io/PoC

  11. arXiv:2405.06691  [pdf, other

    cs.CL cs.AI cs.LG cs.NE

    Fleet of Agents: Coordinated Problem Solving with Large Language Models using Genetic Particle Filtering

    Authors: Akhil Arora, Lars Klein, Nearchos Potamitis, Roland Aydin, Caglar Gulcehre, Robert West

    Abstract: Large language models (LLMs) have significantly evolved, moving from simple output generation to complex reasoning and from stand-alone usage to being embedded into broader frameworks. In this paper, we introduce \emph{Fleet of Agents (FoA)}, a novel framework utilizing LLMs as agents to navigate through dynamic tree searches, employing a genetic-type particle filtering approach. FoA spawns a mult… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: 11 pages, 1 figure, 4 tables

  12. arXiv:2405.00820  [pdf, other

    cs.AR cs.LG

    HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and Beyond

    Authors: Stefan Abi-Karam, Rishov Sarkar, Allison Seigler, Sean Lowe, Zhigang Wei, Hanqiu Chen, Nanditha Rao, Lizy John, Aman Arora, Cong Hao

    Abstract: Machine learning (ML) techniques have been applied to high-level synthesis (HLS) flows for quality-of-result (QoR) prediction and design space exploration (DSE). Nevertheless, the scarcity of accessible high-quality HLS datasets and the complexity of building such datasets present challenges. Existing datasets have limitations in terms of benchmark coverage, design space enumeration, vendor extens… ▽ More

    Submitted 17 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: Edit to "Section V.E" for proper attribution of open-source HLSyn, AutoDSE, and the Merlin compiler

  13. arXiv:2404.17079  [pdf, other

    quant-ph cs.CR

    Improving device-independent weak coin flip** protocols

    Authors: Atul Singh Arora, Jamie Sikora, Thomas Van Himbeeck

    Abstract: Weak coin flip** is the cryptographic task where Alice and Bob remotely flip a coin but want opposite outcomes. This work studies this task in the device-independent regime where Alice and Bob neither trust each other, nor their quantum devices. The best protocol was devised over a decade ago by Silman, Chailloux, Aharon, Kerenidis, Pironio, and Massar with bias $\varepsilon \approx 0.33664$, wh… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: 25 pages, 7 figures

  14. arXiv:2404.11066  [pdf, other

    cs.AR

    Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

    Authors: Endri Taka, Dimitrios Gourounas, Andreas Gerstlauer, Diana Marculescu, Aman Arora

    Abstract: FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more efficiently support the computational demands of DL workloads. However, the two most prominent AI-optimized FPGAs, i.e., AMD/Xilinx Versal ACAP and Intel Stratix 10… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Comments: Accepted as full paper at FCCM 2024

  15. arXiv:2404.10076  [pdf, other

    cs.AR

    Field-Programmable Gate Array Architecture for Deep Learning: Survey & Future Directions

    Authors: Andrew Boutros, Aman Arora, Vaughn Betz

    Abstract: Deep learning (DL) is becoming the cornerstone of numerous applications both in datacenters and at the edge. Specialized hardware is often necessary to meet the performance requirements of state-of-the-art DL models, but the rapid pace of change in DL models and the wide variety of systems integrating DL make it impossible to create custom computer chips for all but the largest markets. Field-prog… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  16. arXiv:2404.03592  [pdf, other

    cs.CL cs.AI cs.LG

    ReFT: Representation Finetuning for Language Models

    Authors: Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts

    Abstract: Parameter-efficient finetuning (PEFT) methods seek to adapt large neural models via updates to a small number of weights. However, much prior interpretability work has shown that representations encode rich semantic information, suggesting that editing representations might be a more powerful alternative. We pursue this hypothesis by develo** a family of Representation Finetuning (ReFT) methods.… ▽ More

    Submitted 22 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

    Comments: preprint

  17. arXiv:2403.07809  [pdf, other

    cs.LG cs.CL

    pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

    Authors: Zhengxuan Wu, Atticus Geiger, Aryaman Arora, **g Huang, Zheng Wang, Noah D. Goodman, Christopher D. Manning, Christopher Potts

    Abstract: Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability. To facilitate such research, we introduce $\textbf{pyvene}$, an open-source Python library that supports customizable interventions on a range of different PyTorch modules. $\textbf{pyvene}$ supports complex intervention schemes with an intuiti… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 8 pages, 3 figures

  18. arXiv:2402.19334  [pdf, other

    cs.CL

    Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge

    Authors: Ansh Arora, Xuanli He, Maximilian Mozes, Srinibas Swain, Mark Dras, Qiongkai Xu

    Abstract: The democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. However, this openness also brings significant security risks, including backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliabili… ▽ More

    Submitted 3 June, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: accepted to ACL2024 (Findings)

  19. arXiv:2402.15855  [pdf, other

    quant-ph cs.CR

    Protocols for Quantum Weak Coin Flip**

    Authors: Atul Singh Arora, Jérémie Roland, Chrysoula Vlachou, Stephan Weis

    Abstract: Weak coin flip** is an important cryptographic primitive -- it is the strongest known secure two-party computation primitive that classically becomes secure only under certain assumptions (e.g. computational hardness), while quantumly there exist protocols that achieve arbitrarily close to perfect security. This breakthrough result was established by Mochon in 2007 [arXiv:0711.4114]. However, hi… ▽ More

    Submitted 24 February, 2024; originally announced February 2024.

    Comments: 51 pages (+ 9 appendix), 12 figures. This is a self-contained, concise version of our main results in arXiv:1811.02984 (STOC '19) and arXiv:1911.13283v2 (SODA '21). The Cryptology ePrint 2022/1101 is the comprehensive version, subsuming the above

  20. arXiv:2402.14177  [pdf, other

    cs.SI cs.CY

    Investigating Human Values in Online Communities

    Authors: Nadav Borenstein, Arnav Arora, Lucie-Aimée Kaffee, Isabelle Augenstein

    Abstract: Human values play a vital role as an analytical tool in social sciences, enabling the study of diverse dimensions within society as a whole and among individual communities. This paper addresses the limitations of traditional survey-based studies of human values by proposing a computational application of Schwartz's values framework to Reddit, a platform organized into distinct online communities.… ▽ More

    Submitted 17 June, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

  21. arXiv:2402.12560  [pdf, other

    cs.CL cs.AI

    CausalGym: Benchmarking causal interpretability methods on linguistic tasks

    Authors: Aryaman Arora, Dan Jurafsky, Christopher Potts

    Abstract: Language models (LMs) have proven to be powerful tools for psycholinguistic research, but most prior work has focused on purely behavioural measures (e.g., surprisal comparisons). At the same time, research in model interpretability has begun to illuminate the abstract causal mechanisms sha** LM behavior. To help bring these strands of research closer together, we introduce CausalGym. We adapt a… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

    Comments: 9 pages main text, 26 pages total

    ACM Class: I.2.7

  22. arXiv:2402.05244  [pdf, ps, other

    cs.DC

    CRIU -- Checkpoint Restore in Userspace for computational simulations and scientific applications

    Authors: Fabio Andrijauskas, Igor Sfiligoi, Diego Davila, Aashay Arora, Jonathan Guiang, Brian Bockelman, Greg Thain, Frank Wurthwein

    Abstract: Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks, some cannot and may take hours or weeks to run to completion. To better manage those longer-running jobs, it would be desirable to stop them at any arbitrary p… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

    Comments: 26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS - 2023

  23. arXiv:2402.02302  [pdf, other

    eess.AS cs.CL

    Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens

    Authors: Nay San, Georgios Paraskevopoulos, Aryaman Arora, Xiluo He, Prabhjot Kaur, Oliver Adams, Dan Jurafsky

    Abstract: While massively multilingual speech models like wav2vec 2.0 XLSR-128 can be directly fine-tuned for automatic speech recognition (ASR), downstream performance can still be relatively poor on languages that are under-represented in the pre-training data. Continued pre-training on 70-200 hours of untranscribed speech in these languages can help -- but what about languages without that much recorded… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

    Comments: Accepted for SIGTYP2024

  24. arXiv:2401.14560  [pdf

    cs.CY

    The Role of Intelligent Transportation Systems and Artificial Intelligence in Energy Efficiency and Emission Reduction

    Authors: Omar Rinchi, Ahmad Alsharoa, Ibrahem Shatnawi, Anvita Arora

    Abstract: Despite the technological advancements in the transportation sector, the industry continues to grapple with increasing energy consumption and vehicular emissions, which intensify environmental degradation and climate change. The inefficient management of traffic flow, the underutilization of transport network interconnectivity, and the limited implementation of artificial intelligence (AI)-driven… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

    Comments: 25 pages, 4 figures

  25. arXiv:2401.13851  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages

    Authors: Akshit Arora, Rohan Badlani, Sungwon Kim, Rafael Valle, Bryan Catanzaro

    Abstract: In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets.… ▽ More

    Submitted 29 January, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Presentation accepted at ICASSP 2024

  26. arXiv:2401.12631  [pdf, other

    cs.LG cs.AI cs.CL

    A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

    Authors: Zhengxuan Wu, Atticus Geiger, **g Huang, Aryaman Arora, Thomas Icard, Christopher Potts, Noah D. Goodman

    Abstract: We respond to the recent paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause "interpretability illusions". We first review Makelov et al. (2023)'s technical notion of what an "interpretability illusion" is, and then we show that even intuitive and desirabl… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

    Comments: 20 pages, 14 figures

  27. arXiv:2401.03677  [pdf, other

    cs.CL cs.LG

    Overview of the 2023 ICON Shared Task on Gendered Abuse Detection in Indic Languages

    Authors: Aatman Vaidya, Arnav Arora, Aditya Joshi, Tarunima Prabhakar

    Abstract: This paper reports the findings of the ICON 2023 on Gendered Abuse Detection in Indic Languages. The shared task deals with the detection of gendered abuse in online text. The shared task was conducted as a part of ICON 2023, based on a novel dataset in Hindi, Tamil and the Indian dialect of English. The participants were given three subtasks with the train dataset consisting of approximately 6500… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

    Comments: This paper has been accepted at 20th International Conference on Natural Language Processing (ICON), it is of 5 pages

  28. arXiv:2312.12589  [pdf, other

    cs.NI

    400Gbps benchmark of XRootD HTTP-TPC

    Authors: Aashay Arora, Jonathan Guiang, Diego Davila, Frank Würthwein, Justas Balcas, Harvey Newman

    Abstract: Due to the increased demand of network traffic expected during the HL-LHC era, the T2 sites in the USA will be required to have 400Gbps of available bandwidth to their storage solution. With the above in mind we are pursuing a scale test of XRootD software when used to perform Third Party Copy transfers using the HTTP protocol. Our main objective is to understand the possible limitations in the so… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: 8 pages, 4 figures, submitted to CHEP'23

  29. arXiv:2311.12396  [pdf, other

    cs.AR

    GreenFPGA: Evaluating FPGAs as Environmentally Sustainable Computing Solutions

    Authors: Chetan Choppali Sudarshan, Aman Arora, Vidya A. Chhabria

    Abstract: Growing global concerns about climate change highlight the need for environmentally sustainable computing. The ecological impact of computing, including operational and embodied, is a key consideration. Field Programmable Gate Arrays (FPGAs) stand out as promising sustainable computing platforms due to their reconfigurability across various applications. This paper introduces GreenFPGA, a tool est… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: Under review at DAC 2024

  30. arXiv:2311.11384  [pdf, other

    cs.AR

    PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

    Authors: Aman Arora, Jian Weng, Siyuan Ma, Tony Nowatzki, Lizy K. John

    Abstract: Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for accelerator architectures, for parallel workloads such as Deep Learning (DL), because of its capability to achieve massive data parallelism at a low area overhead and provide orders-of-magnitude data movement savings by moving computational resources closer to the data. While many PIM architectures have been proposed, improvements… ▽ More

    Submitted 19 November, 2023; originally announced November 2023.

    Comments: Aman Arora and Jian Weng are co-first authors with equal contribution

  31. arXiv:2311.09086  [pdf, other

    cs.CL cs.AI cs.SI

    The Uli Dataset: An Exercise in Experience Led Annotation of oGBV

    Authors: Arnav Arora, Maha **adoss, Cheshta Arora, Denny George, Brindaalakshmi, Haseena Dawood Khan, Kirti Rawat, Div, Ritash, Seema Mathur, Shivani Yadav, Shehla Rashid Shora, Rie Raut, Sumit Pawar, Apurva Paithane, Sonia, Vivek, Dharini Priscilla, Khairunnisha, Grace Banu, Ambika Tandon, Rishav Thakker, Rahul Dev Korra, Aatman Vaidya, Tarunima Prabhakar

    Abstract: Online gender based violence has grown concomitantly with adoption of the internet and social media. Its effects are worse in the Global majority where many users use social media in languages other than English. The scale and volume of conversations on the internet has necessitated the need for automated detection of hate speech, and more specifically gendered abuse. There is, however, a lack of… ▽ More

    Submitted 24 June, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

  32. arXiv:2311.09000  [pdf, other

    cs.CL

    Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

    Authors: Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, Preslav Nakov

    Abstract: The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factu… ▽ More

    Submitted 16 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: 30 pages, 13 figures

  33. arXiv:2311.07804  [pdf, other

    cs.CL

    IruMozhi: Automatically classifying diglossia in Tamil

    Authors: Kabilan Prasanna, Aryaman Arora

    Abstract: Tamil, a Dravidian language of South Asia, is a highly diglossic language with two very different registers in everyday use: Literary Tamil (preferred in writing and formal communication) and Spoken Tamil (confined to speech and informal media). Spoken Tamil is under-supported in modern NLP systems. In this paper, we release IruMozhi, a human-annotated dataset of parallel text in Literary and Spok… ▽ More

    Submitted 13 November, 2023; originally announced November 2023.

    Comments: 4 pages main text, 7 total

    ACM Class: I.2.7

  34. arXiv:2311.04980  [pdf, other

    cs.AR

    MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine

    Authors: Endri Taka, Aman Arora, Kai-Chiang Wu, Diana Marculescu

    Abstract: The increasing computational and memory requirements of Deep Learning (DL) workloads has led to outstanding innovations in hardware architectures. An archetype of such architectures is the novel Versal AI Engine (AIE) by AMD/Xilinx. The AIE comprises multiple programmable processors optimized for vector-based algorithms. An AIE array consisting of 400 processor cores, operating at 1.25 GHz is able… ▽ More

    Submitted 13 November, 2023; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: Accepted as full paper at FPT 2023

  35. arXiv:2310.10050  [pdf, other

    cs.CV cs.CL econ.GN

    EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge

    Authors: Tom Bryan, Jacob Carlson, Abhishek Arora, Melissa Dell

    Abstract: Billions of public domain documents remain trapped in hard copy or lack an accurate digitization. Modern natural language processing methods cannot be used to index, retrieve, and summarize their texts; conduct computational textual analyses; or extract information for statistical analyses, and these texts cannot be incorporated into language model training. Given the diversity and sheer quantity… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

  36. arXiv:2310.05779  [pdf, other

    cs.LG

    Why Should This Article Be Deleted? Transparent Stance Detection in Multilingual Wikipedia Editor Discussions

    Authors: Lucie-Aimée Kaffee, Arnav Arora, Isabelle Augenstein

    Abstract: The moderation of content on online platforms is usually non-transparent. On Wikipedia, however, this discussion is carried out publicly and the editors are encouraged to use the content moderation policies as explanations for making moderation decisions. Currently, only a few comments explicitly mention those policies -- 20% of the English ones, but as few as 2% of the German and Turkish comments… ▽ More

    Submitted 23 October, 2023; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: This submission has been accepted to 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)

  37. arXiv:2309.00789  [pdf, other

    cs.CL

    LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models

    Authors: Abhishek Arora, Melissa Dell

    Abstract: Linking information across sources is fundamental to a variety of analyses in social science, business, and government. While large language models (LLMs) offer enormous promise for improving record linkage in noisy datasets, in many domains approximate string matching packages in popular softwares such as R and Stata remain predominant. These packages have clean, simple interfaces and can be easi… ▽ More

    Submitted 24 June, 2024; v1 submitted 1 September, 2023; originally announced September 2023.

  38. arXiv:2308.14179  [pdf, other

    cs.CL cs.AI cs.CV

    Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP

    Authors: Vedant Palit, Rohan Pandey, Aryaman Arora, Paul Pu Liang

    Abstract: Mechanistic interpretability seeks to understand the neural mechanisms that enable specific behaviors in Large Language Models (LLMs) by leveraging causality-based methods. While these approaches have identified neural circuits that copy spans of text, capture factual knowledge, and more, they remain unusable for multimodal models since adapting these tools to the vision-language domain requires c… ▽ More

    Submitted 27 August, 2023; originally announced August 2023.

    Comments: Final version for 5th Workshop on Closing the Loop Between Vision and Language (CLVL) @ ICCV 2023. 4 pages, 5 figures

  39. arXiv:2308.12477  [pdf, other

    cs.CL cs.CV econ.GN

    American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

    Authors: Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring

    Abstract: Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and app… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

  40. arXiv:2308.09829  [pdf, other

    cs.LG cs.NI

    Learning from A Single Graph is All You Need for Near-Shortest Path Routing in Wireless Networks

    Authors: Yung-Fu Chen, Sen Lin, Anish Arora

    Abstract: We propose a learning algorithm for local routing policies that needs only a few data samples obtained from a single graph while generalizing to all random graphs in a standard model of wireless networks. We thus solve the all-pairs near-shortest path problem by training deep neural networks (DNNs) that efficiently and scalably learn routing policies that are local, i.e., they only consider node s… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

  41. arXiv:2308.02582  [pdf, other

    cs.CL cs.AI cs.LG

    Adapt and Decompose: Efficient Generalization of Text-to-SQL via Domain Adapted Least-To-Most Prompting

    Authors: Aseem Arora, Shabbirhussain Bhaisaheb, Harshit Nigam, Manasi Patwardhan, Lovekesh Vig, Gautam Shroff

    Abstract: Cross-domain and cross-compositional generalization of Text-to-SQL semantic parsing is a challenging task. Existing Large Language Model (LLM) based solutions rely on inference-time retrieval of few-shot exemplars from the training set to synthesize a run-time prompt for each Natural Language (NL) test query. In contrast, we devise an algorithm which performs offline sampling of a minimal set-of f… ▽ More

    Submitted 9 August, 2023; v1 submitted 1 August, 2023; originally announced August 2023.

    Comments: 22 Pages

  42. arXiv:2306.03940  [pdf, other

    cs.SI cs.CY cs.DL

    Orphan Articles: The Dark Matter of Wikipedia

    Authors: Akhil Arora, Robert West, Martin Gerlach

    Abstract: With 60M articles in more than 300 language versions, Wikipedia is the largest platform for open and freely accessible knowledge. While the available content has been growing continuously at a rate of around 200K new articles each month, very little attention has been paid to the accessibility of the content. One crucial aspect of accessibility is the integration of hyperlinks into the network so… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

  43. arXiv:2306.02514  [pdf, other

    cs.CL

    Jambu: A historical linguistic database for South Asian languages

    Authors: Aryaman Arora, Adam Farris, Samopriya Basu, Suresh Kolichala

    Abstract: We introduce Jambu, a cognate database of South Asian languages which unifies dozens of previous sources in a structured and accessible format. The database includes 287k lemmata from 602 lects, grouped together in 23k sets of cognates. We outline the data wrangling necessary to compile the dataset and train neural models for reflex prediction on the Indo-Aryan subset of the data. We hope that Jam… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: 5 pages main text, 10 pages total. To appear at SIGMORPHON

  44. arXiv:2306.00765  [pdf, other

    cs.CL cs.AI cs.IR stat.CO stat.ML

    Topic-Guided Sampling For Data-Efficient Multi-Domain Stance Detection

    Authors: Erik Arakelyan, Arnav Arora, Isabelle Augenstein

    Abstract: Stance Detection is concerned with identifying the attitudes expressed by an author towards a target of interest. This task spans a variety of domains ranging from social media opinion identification to detecting the stance for a legal claim. However, the framing of the task varies within these domains, in terms of the data collection protocol, the label dictionary and the number of available anno… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: ACL 2023 (Oral)

  45. arXiv:2305.17347  [pdf, other

    cs.CL

    CGELBank Annotation Manual v1.1

    Authors: Brett Reynolds, Nathan Schneider, Aryaman Arora

    Abstract: CGELBank is a treebank and associated tools based on a syntactic formalism for English derived from the Cambridge Grammar of the English Language. This document lays out the particularities of the CGELBank annotation scheme.

    Submitted 4 June, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

  46. arXiv:2305.15041  [pdf, other

    cs.CL

    Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

    Authors: Veniamin Veselovsky, Manoel Horta Ribeiro, Akhil Arora, Martin Josifoski, Ashton Anderson, Robert West

    Abstract: Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detectio… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 8 pages

  47. arXiv:2305.14672  [pdf, other

    cs.CL cs.CV econ.GN

    Quantifying Character Similarity with Vision Transformers

    Authors: Xinmei Yang, Abhishek Arora, Shao-Yu Jheng, Melissa Dell

    Abstract: Record linkage is a bedrock of quantitative social science, as analyses often require linking data from multiple, noisy sources. Off-the-shelf string matching methods are widely used, as they are straightforward and cheap to implement and scale. Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions ar… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  48. arXiv:2305.05718  [pdf, other

    cs.NI

    QF-Geo: Capacity Aware Geographic Routing using Bounded Regions of Wireless Meshes

    Authors: Yung-Fu Chen, Kenneth W. Parker, Anish Arora

    Abstract: Routing in wireless meshes must detour around holes. Extant routing protocols often underperform in minimally connected networks where holes are larger and more frequent. Minimal density networks are common in practice due to deployment cost constraints, mobility dynamics, and/or adversarial jamming. Protocols that use global search to determine optimal paths incur search overhead that limits scal… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

  49. arXiv:2304.10618  [pdf, other

    cs.AR eess.SP

    ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks

    Authors: Zachary Susskind, Aman Arora, Igor D. S. Miranda, Alan T. L. Bacellar, Luis A. Q. Villon, Rafael F. Katopodis, Leandro S. de Araujo, Diego L. C. Dutra, Priscila M. V. Lima, Felipe M. G. Franca, Mauricio Breternitz Jr., Lizy K. John

    Abstract: The deployment of AI models on low-power, real-time edge devices requires accelerators for which energy, latency, and area are all first-order concerns. There are many approaches to enabling deep neural networks (DNNs) in this domain, including pruning, quantization, compression, and binary neural networks (BNNs), but with the emergence of the "extreme edge", there is now a demand for even more ef… ▽ More

    Submitted 20 April, 2023; originally announced April 2023.

    Comments: 14 pages, 14 figures Portions of this article draw heavily from arXiv:2203.01479, most notably sections 5E and 5F.2

  50. arXiv:2304.08315  [pdf, other

    cs.CL cs.AI

    Thorny Roses: Investigating the Dual Use Dilemma in Natural Language Processing

    Authors: Lucie-Aimée Kaffee, Arnav Arora, Zeerak Talat, Isabelle Augenstein

    Abstract: Dual use, the intentional, harmful reuse of technology and scientific artefacts, is a problem yet to be well-defined within the context of Natural Language Processing (NLP). However, as NLP technologies continue to advance and become increasingly widespread in society, their inner workings have become increasingly opaque. Therefore, understanding dual use concerns and potential ways of limiting th… ▽ More

    Submitted 30 October, 2023; v1 submitted 17 April, 2023; originally announced April 2023.