Search | arXiv e-print repository

Fuzzing at Scale: The Untold Story of the Scheduler

Abstract: How to search for bugs in 1,000 programs using a pre-existing fuzzer and a standard PC? We consider this problem and show that a well-designed strategy that determines which programs to fuzz and for how long can greatly impact the number of bugs found across the programs. In fact, the impact of employing an effective strategy is comparable to that of utilizing a state-of-the-art fuzzer. The consid… ▽ More How to search for bugs in 1,000 programs using a pre-existing fuzzer and a standard PC? We consider this problem and show that a well-designed strategy that determines which programs to fuzz and for how long can greatly impact the number of bugs found across the programs. In fact, the impact of employing an effective strategy is comparable to that of utilizing a state-of-the-art fuzzer. The considered problem is referred to as fuzzing at scale, and the strategy as scheduler. We show that besides a naive scheduler, that allocates equal fuzz time to all programs, we can consider dynamic schedulers that adjust time allocation based on the ongoing fuzzing progress of individual programs. Such schedulers are superior because they lead both to higher number of total found bugs and to higher number of found bugs for most programs. The performance gap between naive and dynamic schedulers can be as wide (or even wider) as the gap between two fuzzers. Our findings thus suggest that the problem of advancing schedulers is fundamental for fuzzing at scale. We develop several schedulers and leverage the most sophisticated one to fuzz simultaneously our newly compiled benchmark of around 5,000 Ubuntu programs, and detect 4908 bugs. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.15754 [pdf, other]

Multimodal Segmentation for Vocal Tract Modeling

Authors: Rishi Jain, Bohan Yu, Peter Wu, Tejas Prabhune, Gopala Anumanchipalli

Abstract: Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech… ▽ More Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at \url{rishiraij.github.io/multimodal-mri-avatar/}. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: Interspeech 2024

arXiv:2406.15563 [pdf, ps, other]

Exponential Time Approximation for Coloring 3-Colorable Graphs

Authors: Venkatesan Guruswami, Rhea Jain

Abstract: The problem of efficiently coloring $3$-colorable graphs with few colors has received much attention on both the algorithmic and inapproximability fronts. We consider exponential time approximations, in which given a parameter $r$, we aim to develop an $r$-approximation algorithm with the best possible runtime, providing a tradeoff between runtime and approximation ratio. In this vein, an algorith… ▽ More The problem of efficiently coloring $3$-colorable graphs with few colors has received much attention on both the algorithmic and inapproximability fronts. We consider exponential time approximations, in which given a parameter $r$, we aim to develop an $r$-approximation algorithm with the best possible runtime, providing a tradeoff between runtime and approximation ratio. In this vein, an algorithm to $O(n^\varepsilon)$-color a 3-colorable graphs in time $2^{Θ(n^{1-2\varepsilon}\log(n))}$ is given in (Atserias and Dalmau, SODA 2022.) We build on tools developed in (Bansal et al., Algorithmic, 2019) to obtain an algorithm to color $3$-colorable graphs with $O(r)$ colors in $\exp\left(\tilde{O}\left(\frac {n\log^{11/2}r} {r^3}\right)\right)$ time, asymptotically improving upon the bound given by Atserias and Dalmau. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14290 [pdf, ps, other]

Examining the Implications of Deepfakes for Election Integrity

Authors: Hriday Ranka, Mokshit Surana, Neel Kothari, Veer Pariawala, Pratyay Banerjee, Aditya Surve, Sainath Reddy Sankepally, Raghav Jain, Jhagrut Lalwani, Swapneel Mehta

Abstract: It is becoming cheaper to launch disinformation operations at scale using AI-generated content, in particular 'deepfake' technology. We have observed instances of deepfakes in political campaigns, where generated content is employed to both bolster the credibility of certain narratives (reinforcing outcomes) and manipulate public perception to the detriment of targeted candidates or causes (advers… ▽ More It is becoming cheaper to launch disinformation operations at scale using AI-generated content, in particular 'deepfake' technology. We have observed instances of deepfakes in political campaigns, where generated content is employed to both bolster the credibility of certain narratives (reinforcing outcomes) and manipulate public perception to the detriment of targeted candidates or causes (adversarial outcomes). We discuss the threats from deepfakes in politics, highlight model specifications underlying different types of deepfake generation methods, and contribute an accessible evaluation of the efficacy of existing detection methods. We provide this as a summary for lawmakers and civil society actors to understand how the technology may be applied in light of existing policies regulating its use. We highlight the limitations of existing detection mechanisms and discuss the areas where policies and regulations are required to address the challenges of deepfakes. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Accepted at the AAAI 2024 conference, AI for Credible Elections Workshop-AI4CE 2024

arXiv:2406.09574 [pdf, other]

Online Bandit Learning with Offline Preference Data

Authors: Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen

Abstract: Reinforcement Learning with Human Feedback (RLHF) is at the core of fine-tuning methods for generative AI models for language and images. Such feedback is often sought as rank or preference feedback from human raters, as opposed to eliciting scores since the latter tends to be very noisy. On the other hand, RL theory and algorithms predominantly assume that a reward feedback is available. In parti… ▽ More Reinforcement Learning with Human Feedback (RLHF) is at the core of fine-tuning methods for generative AI models for language and images. Such feedback is often sought as rank or preference feedback from human raters, as opposed to eliciting scores since the latter tends to be very noisy. On the other hand, RL theory and algorithms predominantly assume that a reward feedback is available. In particular, approaches for online learning that can be helpful in adaptive data collection via active learning cannot incorporate offline preference data. In this paper, we adopt a finite-armed linear bandit model as a prototypical model of online learning. We consider an offline preference dataset to be available generated by an expert of unknown 'competence'. We propose $\texttt{warmPref-PS}$, a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback. We show that by modeling the competence of the expert that generated it, we are able to use such a dataset most effectively. We support our claims with novel theoretical analysis of its Bayesian regret, as well as extensive empirical evaluation of an approximate algorithm which performs substantially better (almost 25 to 50% regret reduction in our studies) as compared to baselines. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.09563 [pdf, other]

e-COP : Episodic Constrained Optimization of Policies

Authors: Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Sahil Singla

Abstract: In this paper, we present the $\texttt{e-COP}$ algorithm, the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings. Such formulations are applicable when there are separate sets of optimization criteria and constraints on a system's behavior. We approach this problem by first establishing a policy difference lemma for the episodic se… ▽ More In this paper, we present the $\texttt{e-COP}$ algorithm, the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings. Such formulations are applicable when there are separate sets of optimization criteria and constraints on a system's behavior. We approach this problem by first establishing a policy difference lemma for the episodic setting, which provides the theoretical foundation for the algorithm. Then, we propose to combine a set of established and novel solution ideas to yield the $\texttt{e-COP}$ algorithm that is easy to implement and numerically stable, and provide a theoretical guarantee on optimality under certain scaling assumptions. Through extensive empirical analysis using benchmarks in the Safety Gym suite, we show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting. The scalability of the algorithm opens the door to its application in safety-constrained Reinforcement Learning from Human Feedback for Large Language or Diffusion Models. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08354 [pdf, other]

DocSynthv2: A Practical Autoregressive Modeling for Document Generation

Authors: Sanket Biswas, Rajiv Jain, Vlad I. Morariu, Jiuxiang Gu, Puneet Mathur, Curtis Wigington, Tong Sun, Josep Lladós

Abstract: While the generation of document layouts has been extensively explored, comprehensive document generation encompassing both layout and content presents a more complex challenge. This paper delves into this advanced domain, proposing a novel approach called DocSynthv2 through the development of a simple yet effective autoregressive structured model. Our model, distinct in its integration of both la… ▽ More While the generation of document layouts has been extensively explored, comprehensive document generation encompassing both layout and content presents a more complex challenge. This paper delves into this advanced domain, proposing a novel approach called DocSynthv2 through the development of a simple yet effective autoregressive structured model. Our model, distinct in its integration of both layout and textual cues, marks a step beyond existing layout-generation approaches. By focusing on the relationship between the structural elements and the textual content within documents, we aim to generate cohesive and contextually relevant documents without any reliance on visual components. Through experimental studies on our curated benchmark for the new task, we demonstrate the ability of our model combining layout and textual information in enhancing the generation quality and relevance of documents, opening new pathways for research in document creation and automated design. Our findings emphasize the effectiveness of autoregressive models in handling complex document generation tasks. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Spotlight (Oral) Acceptance to CVPR 2024 Workshop for Graphic Design Understanding and Generation (GDUG)

arXiv:2406.05344 [pdf, other]

MemeGuard: An LLM and VLM-based Framework for Advancing Content Moderation via Meme Intervention

Authors: Prince Jha, Raghav Jain, Konika Mandal, Aman Chadha, Sriparna Saha, Pushpak Bhattacharyya

Abstract: In the digital world, memes present a unique challenge for content moderation due to their potential to spread harmful content. Although detection methods have improved, proactive solutions such as intervention are still limited, with current research focusing mostly on text-based content, neglecting the widespread influence of multimodal content like memes. Addressing this gap, we present \textit… ▽ More In the digital world, memes present a unique challenge for content moderation due to their potential to spread harmful content. Although detection methods have improved, proactive solutions such as intervention are still limited, with current research focusing mostly on text-based content, neglecting the widespread influence of multimodal content like memes. Addressing this gap, we present \textit{MemeGuard}, a comprehensive framework leveraging Large Language Models (LLMs) and Visual Language Models (VLMs) for meme intervention. \textit{MemeGuard} harnesses a specially fine-tuned VLM, \textit{VLMeme}, for meme interpretation, and a multimodal knowledge selection and ranking mechanism (\textit{MKS}) for distilling relevant knowledge. This knowledge is then employed by a general-purpose LLM to generate contextually appropriate interventions. Another key contribution of this work is the \textit{\textbf{I}ntervening} \textit{\textbf{C}yberbullying in \textbf{M}ultimodal \textbf{M}emes (ICMM)} dataset, a high-quality, labeled dataset featuring toxic memes and their corresponding human-annotated interventions. We leverage \textit{ICMM} to test \textit{MemeGuard}, demonstrating its proficiency in generating relevant and effective responses to toxic memes. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.00833 [pdf, other]

Harvard Undergraduate Survey on Generative AI

Authors: Shikoh Hirabayashi, Rishab Jain, Nikola Jurković, Gabriel Wu

Abstract: How has generative AI impacted the experiences of college students? We study the influence of AI on the study habits, class choices, and career prospects of Harvard undergraduates (n=326), finding that almost 90% of students use generative AI. For roughly 25% of these students, AI has begun to substitute for attending office hours and completing required readings. Half of students are concerned th… ▽ More How has generative AI impacted the experiences of college students? We study the influence of AI on the study habits, class choices, and career prospects of Harvard undergraduates (n=326), finding that almost 90% of students use generative AI. For roughly 25% of these students, AI has begun to substitute for attending office hours and completing required readings. Half of students are concerned that AI will negatively impact their job prospects, and over half of students wish that Harvard had more classes on the future impacts of AI. We also investigate students' outlook on the broader social implications of AI, finding that half of students are worried that AI will increase economic inequality, and 40% believe that extinction risk from AI should be treated as a global priority with the same urgency as pandemics and nuclear war. Around half of students who have taken a class on AI expect AI to exceed human capabilities on almost all tasks within 30 years. We make some recommendations to the Harvard community in light of these results. △ Less

Submitted 2 June, 2024; originally announced June 2024.

arXiv:2405.20648 [pdf, other]

Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization

Authors: Richard Luo, Austin Peng, Adithya Vasudev, Rishabh Jain

Abstract: Video is an increasingly prominent and information-dense medium, yet it poses substantial challenges for language models. A typical video consists of a sequence of shorter segments, or shots, that collectively form a coherent narrative. Each shot is analogous to a word in a sentence where multiple data streams of information (such as visual and auditory data) must be processed simultaneously. Comp… ▽ More Video is an increasingly prominent and information-dense medium, yet it poses substantial challenges for language models. A typical video consists of a sequence of shorter segments, or shots, that collectively form a coherent narrative. Each shot is analogous to a word in a sentence where multiple data streams of information (such as visual and auditory data) must be processed simultaneously. Comprehension of the entire video requires not only understanding the visual-audio information of each shot but also requires that the model links the ideas between each shot to generate a larger, all-encompassing story. Despite significant progress in the field, current works often overlook videos' more granular shot-by-shot semantic information. In this project, we propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning called Shotluck Holmes. By leveraging better pretraining and data collection strategies, we extend the abilities of existing small LLVMs from being able to understand a picture to being able to understand a sequence of frames. Specifically, we show that Shotluck Holmes achieves better performance than state-of-the-art results on the Shot2Story video captioning and summary task with significantly smaller and more computationally efficient models. △ Less

Submitted 31 May, 2024; originally announced May 2024.

arXiv:2405.15090 [pdf, other]

Pure Exploration for Constrained Best Mixed Arm Identification with a Fixed Budget

Authors: Dengwang Tang, Rahul Jain, Ashutosh Nayyar, Pierluigi Nuzzo

Abstract: In this paper, we introduce the constrained best mixed arm identification (CBMAI) problem with a fixed budget. This is a pure exploration problem in a stochastic finite armed bandit model. Each arm is associated with a reward and multiple types of costs from unknown distributions. Unlike the unconstrained best arm identification problem, the optimal solution for the CBMAI problem may be a randomiz… ▽ More In this paper, we introduce the constrained best mixed arm identification (CBMAI) problem with a fixed budget. This is a pure exploration problem in a stochastic finite armed bandit model. Each arm is associated with a reward and multiple types of costs from unknown distributions. Unlike the unconstrained best arm identification problem, the optimal solution for the CBMAI problem may be a randomized mixture of multiple arms. The goal thus is to find the best mixed arm that maximizes the expected reward subject to constraints on the expected costs with a given learning budget $N$. We propose a novel, parameter-free algorithm, called the Score Function-based Successive Reject (SFSR) algorithm, that combines the classical successive reject framework with a novel score-function-based rejection criteria based on linear programming theory to identify the optimal support. We provide a theoretical upper bound on the mis-identification (of the the support of the best mixed arm) probability and show that it decays exponentially in the budget $N$ and some constants that characterize the hardness of the problem instance. We also develop an information theoretic lower bound on the error probability that shows that these constants appropriately characterize the problem difficulty. We validate this empirically on a number of average and hard instances. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 7 pages, 5 figures, 1 table

arXiv:2405.13333 [pdf]

Service Mesh: Architectures, Applications, and Implementations

Authors: Behrooz Farkiani, Raj Jain

Abstract: The scalability and flexibility of microservice architecture have led to major changes in cloud-native application architectures. However, the complexity of managing thousands of small services written in different languages and handling the exchange of data between them have caused significant management challenges. Service mesh is a promising solution that could mitigate these problems by introd… ▽ More The scalability and flexibility of microservice architecture have led to major changes in cloud-native application architectures. However, the complexity of managing thousands of small services written in different languages and handling the exchange of data between them have caused significant management challenges. Service mesh is a promising solution that could mitigate these problems by introducing an overlay layer on top of the services. In this paper, we first study the architecture and components of service mesh architecture. Then, we review two important service mesh implementations and discuss how the service mesh could be helpful in other areas, including 5G. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 20 pages

arXiv:2405.04777 [pdf, other]

Empathy Through Multimodality in Conversational Interfaces

Authors: Mahyar Abbasian, Iman Azimi, Mohammad Feli, Amir M. Rahmani, Ramesh Jain

Abstract: Agents represent one of the most emerging applications of Large Language Models (LLMs) and Generative AI, with their effectiveness hinging on multimodal capabilities to navigate complex user environments. Conversational Health Agents (CHAs), a prime example of this, are redefining healthcare by offering nuanced support that transcends textual analysis to incorporate emotional intelligence. This pa… ▽ More Agents represent one of the most emerging applications of Large Language Models (LLMs) and Generative AI, with their effectiveness hinging on multimodal capabilities to navigate complex user environments. Conversational Health Agents (CHAs), a prime example of this, are redefining healthcare by offering nuanced support that transcends textual analysis to incorporate emotional intelligence. This paper introduces an LLM-based CHA engineered for rich, multimodal dialogue-especially in the realm of mental health support. It adeptly interprets and responds to users' emotional states by analyzing multimodal cues, thus delivering contextually aware and empathetically resonant verbal responses. Our implementation leverages the versatile openCHA framework, and our comprehensive evaluation involves neutral prompts expressed in diverse emotional tones: sadness, anger, and joy. We evaluate the consistency and repeatability of the planning capability of the proposed CHA. Furthermore, human evaluators critique the CHA's empathic delivery, with findings revealing a striking concordance between the CHA's outputs and evaluators' assessments. These results affirm the indispensable role of vocal (soon multimodal) emotion recognition in strengthening the empathetic connection built by CHAs, cementing their place at the forefront of interactive, compassionate digital health solutions. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: 7 pages, 2 figures, 2 tables, conference paper

arXiv:2404.18353 [pdf, other]

Do Neutral Prompts Produce Insecure Code? FormAI-v2 Dataset: Labelling Vulnerabilities in Code Generated by Large Language Models

Authors: Norbert Tihanyi, Tamas Bisztray, Mohamed Amine Ferrag, Ridhi Jain, Lucas C. Cordeiro

Abstract: This study provides a comparative analysis of state-of-the-art large language models (LLMs), analyzing how likely they generate vulnerabilities when writing simple C programs using a neutral zero-shot prompt. We address a significant gap in the literature concerning the security properties of code produced by these models without specific directives. N. Tihanyi et al. introduced the FormAI dataset… ▽ More This study provides a comparative analysis of state-of-the-art large language models (LLMs), analyzing how likely they generate vulnerabilities when writing simple C programs using a neutral zero-shot prompt. We address a significant gap in the literature concerning the security properties of code produced by these models without specific directives. N. Tihanyi et al. introduced the FormAI dataset at PROMISE '23, containing 112,000 GPT-3.5-generated C programs, with over 51.24% identified as vulnerable. We expand that work by introducing the FormAI-v2 dataset comprising 265,000 compilable C programs generated using various LLMs, including robust models such as Google's GEMINI-pro, OpenAI's GPT-4, and TII's 180 billion-parameter Falcon, to Meta's specialized 13 billion-parameter CodeLLama2 and various other compact models. Each program in the dataset is labelled based on the vulnerabilities detected in its source code through formal verification using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique eliminates false positives by delivering a counterexample and ensures the exclusion of false negatives by completing the verification process. Our study reveals that at least 63.47% of the generated programs are vulnerable. The differences between the models are minor, as they all display similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires risk assessment and validation. △ Less

Submitted 28 April, 2024; originally announced April 2024.

arXiv:2404.16870 [pdf, ps, other]

LEMDA: A Novel Feature Engineering Method for Intrusion Detection in IoT Systems

Authors: Ali Ghubaish, Zebo Yang, Aiman Erbad, Raj Jain

Abstract: Intrusion detection systems (IDS) for the Internet of Things (IoT) systems can use AI-based models to ensure secure communications. IoT systems tend to have many connected devices producing massive amounts of data with high dimensionality, which requires complex models. Complex models have notorious problems such as overfitting, low interpretability, and high computational complexity. Adding model… ▽ More Intrusion detection systems (IDS) for the Internet of Things (IoT) systems can use AI-based models to ensure secure communications. IoT systems tend to have many connected devices producing massive amounts of data with high dimensionality, which requires complex models. Complex models have notorious problems such as overfitting, low interpretability, and high computational complexity. Adding model complexity penalty (i.e., regularization) can ease overfitting, but it barely helps interpretability and computational efficiency. Feature engineering can solve these issues; hence, it has become critical for IDS in large-scale IoT systems to reduce the size and dimensionality of data, resulting in less complex models with excellent performance, smaller data storage, and fast detection. This paper proposes a new feature engineering method called LEMDA (Light feature Engineering based on the Mean Decrease in Accuracy). LEMDA applies exponential decay and an optional sensitivity factor to select and create the most informative features. The proposed method has been evaluated and compared to other feature engineering methods using three IoT datasets and four AI/ML models. The results show that LEMDA improves the F1 score performance of all the IDS models by an average of 34% and reduces the average training and detection times in most cases. △ Less

Submitted 20 April, 2024; originally announced April 2024.

arXiv:2404.16725 [pdf, ps, other]

Approximation Algorithms for Hop Constrained and Buy-at-Bulk Network Design via Hop Constrained Oblivious Routing

Authors: Chandra Chekuri, Rhea Jain

Abstract: We consider two-cost network design models in which edges of the input graph have an associated cost and length. We build upon recent advances in hop-constrained oblivious routing to obtain two sets of results. We address multicommodity buy-at-bulk network design in the nonuniform setting. Existing poly-logarithmic approximations are based on the junction tree approach [CHKS09,KN11]. We obtain a… ▽ More We consider two-cost network design models in which edges of the input graph have an associated cost and length. We build upon recent advances in hop-constrained oblivious routing to obtain two sets of results. We address multicommodity buy-at-bulk network design in the nonuniform setting. Existing poly-logarithmic approximations are based on the junction tree approach [CHKS09,KN11]. We obtain a new polylogarithmic approximation via a natural LP relaxation. This establishes an upper bound on its integrality gap and affirmatively answers an open question raised in [CHKS09]. The rounding is based on recent results in hop-constrained oblivious routing [GHZ21], and this technique yields a polylogarithmic approximation in more general settings such as set connectivity. Our algorithm for buy-at-bulk network design is based on an LP-based reduction to hop constrained network design for which we obtain LP-based bicriteria approximation algorithms. We also consider a fault-tolerant version of hop constrained network design where one wants to design a low-cost network to guarantee short paths between a given set of source-sink pairs even when k-1 edges can fail. This model has been considered in network design [GL17,GML18,AJL20] but no approximation algorithms were known. We obtain polylogarithmic bicriteria approximation algorithms for the single-source setting for any fixed k. We build upon the single-source algorithm and the junction-tree approach to obtain an approximation algorithm for the multicommodity setting when at most one edge can fail. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.06768 [pdf, ps, other]

A new approach to construct minimal linear codes over $\mathbb{F}_{3}$

Authors: Wajid M. Shaikh, Rupali S. Jain, B. Surendranath Reddy, Bhagyashri S. Patil, Sahar M. A. Maqbol

Abstract: In this article, we present two new approaches to construct minimal linear codes of dimension $n+1$ over $\mathbb{F}_{3}$ using characteristic and ternary functions. We also obtain the weight distributions of these constructed minimal linear codes. We further show that a specific class of these codes violates Ashikhmin-Barg condition. In this article, we present two new approaches to construct minimal linear codes of dimension $n+1$ over $\mathbb{F}_{3}$ using characteristic and ternary functions. We also obtain the weight distributions of these constructed minimal linear codes. We further show that a specific class of these codes violates Ashikhmin-Barg condition. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Journal ref: MJMS-2024-0154

arXiv:2404.03220 [pdf, ps, other]

Commitments are equivalent to one-way state generators

Authors: Rishabh Batra, Rahul Jain

Abstract: One-way state generators (OWSG) are natural quantum analogs to classical one-way functions. We show that $O\left(\frac{n}{\log(n)}\right)$-copy OWSGs ($n$ represents the input length) are equivalent to $poly(n)$-copy OWSG and to quantum commitments. Since known results show that $o\left(\frac{n}{\log(n)}\right)$-copy OWSG cannot imply commitments, this shows that $O\left(\frac{n}{\log(n)}\right)$-… ▽ More One-way state generators (OWSG) are natural quantum analogs to classical one-way functions. We show that $O\left(\frac{n}{\log(n)}\right)$-copy OWSGs ($n$ represents the input length) are equivalent to $poly(n)$-copy OWSG and to quantum commitments. Since known results show that $o\left(\frac{n}{\log(n)}\right)$-copy OWSG cannot imply commitments, this shows that $O\left(\frac{n}{\log(n)}\right)$-copy OWSGs are the weakest OWSGs from which we can get commitments (and hence much of quantum cryptography). Our construction follows along the lines of Håstad, Impagliazzo, Levin and Luby [HILL], who obtained classical pseudorandom generators (PRG) from classical one-way functions (OWF), however with crucial modifications. Our construction, when applied to the classical case, provides an alternative to the construction provided by [HILL]. Since we do not argue conditioned on the output of the one-way function, our construction and analysis are arguably simpler and may be of independent interest. △ Less

Submitted 17 April, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

Comments: minor changes to previous version

arXiv:2404.03150 [pdf, other]

NLP at UC Santa Cruz at SemEval-2024 Task 5: Legal Answer Validation using Few-Shot Multi-Choice QA

Authors: Anish Pahilajani, Samyak Rajesh Jain, Devasha Trivedi

Abstract: This paper presents our submission to the SemEval 2024 Task 5: The Legal Argument Reasoning Task in Civil Procedure. We present two approaches to solving the task of legal answer validation, given an introduction to the case, a question and an answer candidate. Firstly, we fine-tuned pre-trained BERT-based models and found that models trained on domain knowledge perform better. Secondly, we perfor… ▽ More This paper presents our submission to the SemEval 2024 Task 5: The Legal Argument Reasoning Task in Civil Procedure. We present two approaches to solving the task of legal answer validation, given an introduction to the case, a question and an answer candidate. Firstly, we fine-tuned pre-trained BERT-based models and found that models trained on domain knowledge perform better. Secondly, we performed few-shot prompting on GPT models and found that reformulating the answer validation task to be a multiple-choice QA task remarkably improves the performance of the model. Our best submission is a BERT-based model that achieved the 7th place out of 20. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2404.00477 [pdf, other]

DE-HNN: An effective neural model for Circuit Netlist representation

Authors: Zhishang Luo, Truong Son Hy, Puoya Tabaghi, Donghyeon Koh, Michael Defferrard, Elahe Rezaei, Ryan Carey, Rhett Davis, Rajeev Jain, Yusu Wang

Abstract: The run-time for optimization tools used in chip design has grown with the complexity of designs to the point where it can take several days to go through one design cycle which has become a bottleneck. Designers want fast tools that can quickly give feedback on a design. Using the input and output data of the tools from past designs, one can attempt to build a machine learning model that predicts… ▽ More The run-time for optimization tools used in chip design has grown with the complexity of designs to the point where it can take several days to go through one design cycle which has become a bottleneck. Designers want fast tools that can quickly give feedback on a design. Using the input and output data of the tools from past designs, one can attempt to build a machine learning model that predicts the outcome of a design in significantly shorter time than running the tool. The accuracy of such models is affected by the representation of the design data, which is usually a netlist that describes the elements of the digital circuit and how they are connected. Graph representations for the netlist together with graph neural networks have been investigated for such models. However, the characteristics of netlists pose several challenges for existing graph learning frameworks, due to the large number of nodes and the importance of long-range interactions between nodes. To address these challenges, we represent the netlist as a directed hypergraph and propose a Directional Equivariant Hypergraph Neural Network (DE-HNN) for the effective learning of (directed) hypergraphs. Theoretically, we show that our DE-HNN can universally approximate any node or hyperedge based function that satisfies certain permutation equivariant and invariant properties natural for directed hypergraphs. We compare the proposed DE-HNN with several State-of-the-art (SOTA) machine learning models for (hyper)graphs and netlists, and show that the DE-HNN significantly outperforms them in predicting the outcome of optimized place-and-route tools directly from the input netlists. Our source code and the netlists data used are publicly available at https://github.com/YusuLab/chips.git △ Less

Submitted 16 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

arXiv:2403.15547 [pdf, ps, other]

Approximation Algorithms for Network Design in Non-Uniform Fault Models

Authors: Chandra Chekuri, Rhea Jain

Abstract: The Survivable Network Design problem (SNDP) is a well-studied problem, motivated by the design of networks that are robust to faults under the assumption that any subset of edges up to a specific number can fail. We consider non-uniform fault models where the subset of edges that fail can be specified in different ways. Our primary interest is in the flexible graph connectivity model, in which th… ▽ More The Survivable Network Design problem (SNDP) is a well-studied problem, motivated by the design of networks that are robust to faults under the assumption that any subset of edges up to a specific number can fail. We consider non-uniform fault models where the subset of edges that fail can be specified in different ways. Our primary interest is in the flexible graph connectivity model, in which the edge set is partitioned into safe and unsafe edges. The goal is to design a network that has desired connectivity properties under the assumption that only unsafe edges up to a specific number can fail. We also discuss the bulk-robust model and the relative survivable network design model. While SNDP admits a 2-approximation, the approximability of problems in these more complex models is much less understood even in special cases. We make two contributions. Our first set of results are in the flexible graph connectivity model. Motivated by a conjecture that a constant factor approximation is feasible when the robustness parameters are fixed constants, we consider two important special cases, namely the single pair case, and the global connectivity case. For both these, we obtain constant factor approximations in several parameter ranges of interest. These are based on an augmentation framework and via decomposing the families of cuts that need to be covered into a small number of uncrossable families. Our second set of results are poly-logarithmic approximations for the bulk-robust model when the "width" of the given instance (the maximum number of edges that can fail in any particular scenario) is fixed. Via this, we derive corresponding approximations for the flexible graph connectivity model and the relative survivable network design model. The results are obtained via two algorithmic approaches and they have different tradeoffs in terms of the approximation ratio and generality. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: A preliminary version of this paper appeared in the Proc. of ICALP 2023 (10.4230/LIPIcs.ICALP.2023.36), which combines and extends results from two earlier versions: arXiv:2209.12273 (for the first set of results) and arXiv:2211.08324 (for the second set of results)

arXiv:2403.14416 [pdf, other]

Quantum Channel Simulation in Fidelity is no more difficult than State Splitting

Authors: Michael X. Cao, Rahul Jain, Marco Tomamichel

Abstract: Characterizing the minimal communication needed for the quantum channel simulation is a fundamental task in the quantum information theory. In this paper, we show that, in fidelity, the quantum channel simulation can be directly achieved via quantum state splitting without using a technique known as the de~Finetti reduction, and thus provide a pair of tighter one-shot bounds. Using the bounds, we… ▽ More Characterizing the minimal communication needed for the quantum channel simulation is a fundamental task in the quantum information theory. In this paper, we show that, in fidelity, the quantum channel simulation can be directly achieved via quantum state splitting without using a technique known as the de~Finetti reduction, and thus provide a pair of tighter one-shot bounds. Using the bounds, we also recover the quantum reverse Shannon theorem in a much simpler way. △ Less

Submitted 24 June, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

arXiv:2403.13350 [pdf, ps, other]

Construction of Minimal Binary Linear Codes of dimension $n+3$

Authors: Wajid M. Shaikh, Rupali S. Jain, B. Surendranath Reddy, Bhagyashri S. Patil

Abstract: In this paper, we will give the generic construction of a binary linear code of dimension $n+3$ and derive the necessary and sufficient conditions for the constructed code to be minimal. Using generic construction, a new family of minimal binary linear code will be constructed from a special class of Boolean functions violating the Ashikhmin-Barg condition. We also obtain the weight distribution o… ▽ More In this paper, we will give the generic construction of a binary linear code of dimension $n+3$ and derive the necessary and sufficient conditions for the constructed code to be minimal. Using generic construction, a new family of minimal binary linear code will be constructed from a special class of Boolean functions violating the Ashikhmin-Barg condition. We also obtain the weight distribution of the constructed minimal binary linear code. △ Less

Submitted 20 March, 2024; originally announced March 2024.

MSC Class: 94B05; 94C10; 94A60

arXiv:2403.13106 [pdf, other]

Knowing Your Nonlinearities: Shapley Interactions Reveal the Underlying Structure of Data

Authors: Divyansh Singhvi, Andrej Erkelens, Raghav Jain, Diganta Misra, Naomi Saphra

Abstract: Measuring nonlinear feature interaction is an established approach to understanding complex patterns of attribution in many models. In this paper, we use Shapley Taylor interaction indices (STII) to analyze the impact of underlying data structure on model representations in a variety of modalities, tasks, and architectures. Considering linguistic structure in masked and auto-regressive language mo… ▽ More Measuring nonlinear feature interaction is an established approach to understanding complex patterns of attribution in many models. In this paper, we use Shapley Taylor interaction indices (STII) to analyze the impact of underlying data structure on model representations in a variety of modalities, tasks, and architectures. Considering linguistic structure in masked and auto-regressive language models (MLMs and ALMs), we find that STII increases within idiomatic expressions and that MLMs scale STII with syntactic distance, relying more on syntax in their nonlinear structure than ALMs do. Our speech model findings reflect the phonetic principal that the openness of the oral cavity determines how much a phoneme varies based on its context. Finally, we study image classifiers and illustrate that feature interactions intuitively reflect object boundaries. Our wide range of results illustrates the benefits of interdisciplinary work and domain expertise in interpretability research. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.05530 [pdf, other]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content. △ Less

Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.01317 [pdf, other]

Less is More: Hop-Wise Graph Attention for Scalable and Generalizable Learning on Circuits

Authors: Chenhui Deng, Zichao Yue, Cunxi Yu, Gokce Sarar, Ryan Carey, Rajeev Jain, Zhiru Zhang

Abstract: While graph neural networks (GNNs) have gained popularity for learning circuit representations in various electronic design automation (EDA) tasks, they face challenges in scalability when applied to large graphs and exhibit limited generalizability to new designs. These limitations make them less practical for addressing large-scale, complex circuit problems. In this work we propose HOGA, a novel… ▽ More While graph neural networks (GNNs) have gained popularity for learning circuit representations in various electronic design automation (EDA) tasks, they face challenges in scalability when applied to large graphs and exhibit limited generalizability to new designs. These limitations make them less practical for addressing large-scale, complex circuit problems. In this work we propose HOGA, a novel attention-based model for learning circuit representations in a scalable and generalizable manner. HOGA first computes hop-wise features per node prior to model training. Subsequently, the hop-wise features are solely used to produce node representations through a gated self-attention module, which adaptively learns important features among different hops without involving the graph topology. As a result, HOGA is adaptive to various structures across different circuits and can be efficiently trained in a distributed manner. To demonstrate the efficacy of HOGA, we consider two representative EDA tasks: quality of results (QoR) prediction and functional reasoning. Our experimental results indicate that (1) HOGA reduces estimation error over conventional GNNs by 46.76% for predicting QoR after logic synthesis; (2) HOGA improves 10.0% reasoning accuracy over GNNs for identifying functional blocks on unseen gate-level netlists after complex technology map**; (3) The training time for HOGA almost linearly decreases with an increase in computing resources. △ Less

Submitted 10 April, 2024; v1 submitted 2 March, 2024; originally announced March 2024.

Comments: Published as a conference paper at Design Automation Conference (DAC) 2024

arXiv:2403.00781 [pdf, other]

ChatDiet: Empowering Personalized Nutrition-Oriented Food Recommender Chatbots through an LLM-Augmented Framework

Authors: Zhongqi Yang, Elahe Khatibi, Nitish Nagesh, Mahyar Abbasian, Iman Azimi, Ramesh Jain, Amir M. Rahmani

Abstract: The profound impact of food on health necessitates advanced nutrition-oriented food recommendation services. Conventional methods often lack the crucial elements of personalization, explainability, and interactivity. While Large Language Models (LLMs) bring interpretability and explainability, their standalone use falls short of achieving true personalization. In this paper, we introduce ChatDiet,… ▽ More The profound impact of food on health necessitates advanced nutrition-oriented food recommendation services. Conventional methods often lack the crucial elements of personalization, explainability, and interactivity. While Large Language Models (LLMs) bring interpretability and explainability, their standalone use falls short of achieving true personalization. In this paper, we introduce ChatDiet, a novel LLM-powered framework designed specifically for personalized nutrition-oriented food recommendation chatbots. ChatDiet integrates personal and population models, complemented by an orchestrator, to seamlessly retrieve and process pertinent information. The personal model leverages causal discovery and inference techniques to assess personalized nutritional effects for a specific user, whereas the population model provides generalized information on food nutritional content. The orchestrator retrieves, synergizes and delivers the output of both models to the LLM, providing tailored food recommendations designed to support targeted health outcomes. The result is a dynamic delivery of personalized and explainable food recommendations, tailored to individual user preferences. Our evaluation of ChatDiet includes a compelling case study, where we establish a causal personal model to estimate individual nutrition effects. Our assessments, including a food recommendation test showcasing a 92\% effectiveness rate, coupled with illustrative dialogue examples, underscore ChatDiet's strengths in explainability, personalization, and interactivity. △ Less

Submitted 16 March, 2024; v1 submitted 18 February, 2024; originally announced March 2024.

Comments: Accepted by The IEEE/ACM international conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE) 2024

arXiv:2403.00141 [pdf, other]

EROS: Entity-Driven Controlled Policy Document Summarization

Authors: Joykirat Singh, Sehban Fazili, Rohan Jain, Md Shad Akhtar

Abstract: Privacy policy documents have a crucial role in educating individuals about the collection, usage, and protection of users' personal data by organizations. However, they are notorious for their lengthy, complex, and convoluted language especially involving privacy-related entities. Hence, they pose a significant challenge to users who attempt to comprehend organization's data usage policy. In this… ▽ More Privacy policy documents have a crucial role in educating individuals about the collection, usage, and protection of users' personal data by organizations. However, they are notorious for their lengthy, complex, and convoluted language especially involving privacy-related entities. Hence, they pose a significant challenge to users who attempt to comprehend organization's data usage policy. In this paper, we propose to enhance the interpretability and readability of policy documents by using controlled abstractive summarization -- we enforce the generated summaries to include critical privacy-related entities (e.g., data and medium) and organization's rationale (e.g.,target and reason) in collecting those entities. To achieve this, we develop PD-Sum, a policy-document summarization dataset with marked privacy-related entity labels. Our proposed model, EROS, identifies critical entities through a span-based entity extraction model and employs them to control the information content of the summaries using proximal policy optimization (PPO). Comparison shows encouraging improvement over various baselines. Furthermore, we furnish qualitative and human evaluations to establish the efficacy of EROS. △ Less

Submitted 29 February, 2024; originally announced March 2024.

Comments: Accepted in LREC-COLING 2024

arXiv:2402.11477 [pdf, other]

Studying Differential Mental Health Expressions in India

Authors: Khushi Shelat, Sunny Rai, Devansh R Jain, Kishen Sivabalan, Young Min Cho, Maitreyi Redkar, Samindara Sawant, Sharath Chandra Guntuku

Abstract: Psychosocial stressors and the symptomatology of mental disorders vary across cultures. However, current understandings of mental health expressions on social media are predominantly derived from studies in WEIRD (Western, Educated, Industrialized, Rich, and Democratic) contexts. In this paper, we analyze mental health posts on Reddit made by individuals in India, to identify variations in online… ▽ More Psychosocial stressors and the symptomatology of mental disorders vary across cultures. However, current understandings of mental health expressions on social media are predominantly derived from studies in WEIRD (Western, Educated, Industrialized, Rich, and Democratic) contexts. In this paper, we analyze mental health posts on Reddit made by individuals in India, to identify variations in online depression language specific to the Indian context compared to users from the Rest of the World (ROW). Unlike in Western samples, we observe that mental health discussions in India additionally express sadness, use negation, are present-focused, and are related to work and achievement. Illness is uniquely correlated to India, indicating the association between depression and physical health in Indian patients. Two clinical psychologists validated the findings from social media posts and found 95% of the top 20 topics associated with mental health discussions as prevalent in Indians. Significant linguistic variations in online mental health-related language in India compared to ROW, emphasize the importance of develo** precision-targeted interventions that are culturally appropriate. △ Less

Submitted 16 June, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

arXiv:2402.10400 [pdf, other]

Chain of Logic: Rule-Based Reasoning with Large Language Models

Authors: Sergio Servantez, Joe Barrow, Kristian Hammond, Rajiv Jain

Abstract: Rule-based reasoning, a fundamental type of legal reasoning, enables us to draw conclusions by accurately applying a rule to a set of facts. We explore causal language models as rule-based reasoners, specifically with respect to compositional rules - rules consisting of multiple elements which form a complex logical expression. Reasoning about compositional rules is challenging because it requires… ▽ More Rule-based reasoning, a fundamental type of legal reasoning, enables us to draw conclusions by accurately applying a rule to a set of facts. We explore causal language models as rule-based reasoners, specifically with respect to compositional rules - rules consisting of multiple elements which form a complex logical expression. Reasoning about compositional rules is challenging because it requires multiple reasoning steps, and attending to the logical relationships between elements. We introduce a new prompting method, Chain of Logic, which elicits rule-based reasoning through decomposition (solving elements as independent threads of logic), and recomposition (recombining these sub-answers to resolve the underlying logical expression). This method was inspired by the IRAC (Issue, Rule, Application, Conclusion) framework, a sequential reasoning approach used by lawyers. We evaluate chain of logic across eight rule-based reasoning tasks involving three distinct compositional rules from the LegalBench benchmark and demonstrate it consistently outperforms other prompting methods, including chain of thought and self-ask, using open-source and commercial language models. △ Less

Submitted 23 February, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

arXiv:2402.10153 [pdf, other]

Knowledge-Infused LLM-Powered Conversational Health Agent: A Case Study for Diabetes Patients

Authors: Mahyar Abbasian, Zhongqi Yang, Elahe Khatibi, Pengfei Zhang, Nitish Nagesh, Iman Azimi, Ramesh Jain, Amir M. Rahmani

Abstract: Effective diabetes management is crucial for maintaining health in diabetic patients. Large Language Models (LLMs) have opened new avenues for diabetes management, facilitating their efficacy. However, current LLM-based approaches are limited by their dependence on general sources and lack of integration with domain-specific knowledge, leading to inaccurate responses. In this paper, we propose a k… ▽ More Effective diabetes management is crucial for maintaining health in diabetic patients. Large Language Models (LLMs) have opened new avenues for diabetes management, facilitating their efficacy. However, current LLM-based approaches are limited by their dependence on general sources and lack of integration with domain-specific knowledge, leading to inaccurate responses. In this paper, we propose a knowledge-infused LLM-powered conversational health agent (CHA) for diabetic patients. We customize and leverage the open-source openCHA framework, enhancing our CHA with external knowledge and analytical capabilities. This integration involves two key components: 1) incorporating the American Diabetes Association dietary guidelines and the Nutritionix information and 2) deploying analytical tools that enable nutritional intake calculation and comparison with the guidelines. We compare the proposed CHA with GPT4. Our evaluation includes 100 diabetes-related questions on daily meal choices and assessing the potential risks associated with the suggested diet. Our findings show that the proposed agent demonstrates superior performance in generating responses to manage essential nutrients. △ Less

Submitted 28 February, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: 4 pages, 3 figures, and 2 tables, conference paper

arXiv:2402.07688 [pdf, other]

CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

Authors: Norbert Tihanyi, Mohamed Amine Ferrag, Ridhi Jain, Tamas Bisztray, Merouane Debbah

Abstract: Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different fields of cybersecurity, which includes topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human experts. To accurately test the general knowledge of LLMs in cybersecurity, the research comm… ▽ More Large Language Models (LLMs) are increasingly used across various domains, from software development to cyber threat intelligence. Understanding all the different fields of cybersecurity, which includes topics such as cryptography, reverse engineering, and risk assessment, poses a challenge even for human experts. To accurately test the general knowledge of LLMs in cybersecurity, the research community needs a diverse, accurate, and up-to-date dataset. To address this gap, we present CyberMetric-80, CyberMetric-500, CyberMetric-2000, and CyberMetric-10000, which are multiple-choice Q&A benchmark datasets comprising 80, 500, 2000, and 10,000 questions respectively. By utilizing GPT-3.5 and Retrieval-Augmented Generation (RAG), we collected documents, including NIST standards, research papers, publicly accessible books, RFCs, and other publications in the cybersecurity domain, to generate questions, each with four possible answers. The results underwent several rounds of error checking and refinement. Human experts invested over 200 hours validating the questions and solutions to ensure their accuracy and relevance, and to filter out any questions unrelated to cybersecurity. We have evaluated and compared 25 state-of-the-art LLM models on the CyberMetric datasets. In addition to our primary goal of evaluating LLMs, we involved 30 human participants to solve CyberMetric-80 in a closed-book scenario. The results can serve as a reference for comparing the general cybersecurity knowledge of humans and LLMs. The findings revealed that GPT-4o, GPT-4-turbo, Mixtral-8x7B-Instruct, Falcon-180B-Chat, and GEMINI-pro 1.0 were the best-performing LLMs. Additionally, the top LLMs were more accurate than humans on CyberMetric-80, although highly experienced human experts still outperformed small models such as Llama-3-8B, Phi-2 or Gemma-7b. △ Less

Submitted 3 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2402.07477 [pdf, other]

Food Recommendation as Language Processing (F-RLP): A Personalized and Contextual Paradigm

Authors: Ali Rostami, Ramesh Jain, Amir M. Rahmani

Abstract: State-of-the-art rule-based and classification-based food recommendation systems face significant challenges in becoming practical and useful. This difficulty arises primarily because most machine learning models struggle with problems characterized by an almost infinite number of classes and a limited number of samples within an unbalanced dataset. Conversely, the emergence of Large Language Mode… ▽ More State-of-the-art rule-based and classification-based food recommendation systems face significant challenges in becoming practical and useful. This difficulty arises primarily because most machine learning models struggle with problems characterized by an almost infinite number of classes and a limited number of samples within an unbalanced dataset. Conversely, the emergence of Large Language Models (LLMs) as recommendation engines offers a promising avenue. However, a general-purpose Recommendation as Language Processing (RLP) approach lacks the critical components necessary for effective food recommendations. To address this gap, we introduce Food Recommendation as Language Processing (F-RLP), a novel framework that offers a food-specific, tailored infrastructure. F-RLP leverages the capabilities of LLMs to maximize their potential, thereby paving the way for more accurate, personalized food recommendations. △ Less

Submitted 14 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2402.04754 [pdf, other]

Towards Aligned Layout Generation via Diffusion Model with Aesthetic Constraints

Authors: Jian Chen, Ruiyi Zhang, Yufan Zhou, Rajiv Jain, Zhiqiang Xu, Ryan Rossi, Changyou Chen

Abstract: Controllable layout generation refers to the process of creating a plausible visual arrangement of elements within a graphic design (e.g., document and web designs) with constraints representing design intentions. Although recent diffusion-based models have achieved state-of-the-art FID scores, they tend to exhibit more pronounced misalignment compared to earlier transformer-based models. In this… ▽ More Controllable layout generation refers to the process of creating a plausible visual arrangement of elements within a graphic design (e.g., document and web designs) with constraints representing design intentions. Although recent diffusion-based models have achieved state-of-the-art FID scores, they tend to exhibit more pronounced misalignment compared to earlier transformer-based models. In this work, we propose the $\textbf{LA}$yout $\textbf{C}$onstraint diffusion mod$\textbf{E}$l (LACE), a unified model to handle a broad range of layout generation tasks, such as arranging elements with specified attributes and refining or completing a coarse layout design. The model is based on continuous diffusion models. Compared with existing methods that use discrete diffusion models, continuous state-space design can enable the incorporation of differentiable aesthetic constraint functions in training. For conditional generation, we introduce conditions via masked input. Extensive experiment results show that LACE produces high-quality layouts and outperforms existing state-of-the-art baselines. △ Less

Submitted 15 May, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

Comments: Accepted by ICLR 2024

arXiv:2401.09899 [pdf, other]

Meme-ingful Analysis: Enhanced Understanding of Cyberbullying in Memes Through Multimodal Explanations

Authors: Prince Jha, Krishanu Maity, Raghav Jain, Apoorv Verma, Sriparna Saha, Pushpak Bhattacharyya

Abstract: Internet memes have gained significant influence in communicating political, psychological, and sociocultural ideas. While memes are often humorous, there has been a rise in the use of memes for trolling and cyberbullying. Although a wide variety of effective deep learning-based models have been developed for detecting offensive multimodal memes, only a few works have been done on explainability a… ▽ More Internet memes have gained significant influence in communicating political, psychological, and sociocultural ideas. While memes are often humorous, there has been a rise in the use of memes for trolling and cyberbullying. Although a wide variety of effective deep learning-based models have been developed for detecting offensive multimodal memes, only a few works have been done on explainability aspect. Recent laws like "right to explanations" of General Data Protection Regulation, have spurred research in develo** interpretable models rather than only focusing on performance. Motivated by this, we introduce {\em MultiBully-Ex}, the first benchmark dataset for multimodal explanation from code-mixed cyberbullying memes. Here, both visual and textual modalities are highlighted to explain why a given meme is cyberbullying. A Contrastive Language-Image Pretraining (CLIP) projection-based multimodal shared-private multitask approach has been proposed for visual and textual explanation of a meme. Experimental results demonstrate that training with multimodal explanations improves performance in generating textual justifications and more accurately identifying the visual evidence supporting a decision with reliable performance improvements. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: EACL2024

arXiv:2401.09023 [pdf, other]

Explain Thyself Bully: Sentiment Aided Cyberbullying Detection with Explanation

Authors: Krishanu Maity, Prince Jha, Raghav Jain, Sriparna Saha, Pushpak Bhattacharyya

Abstract: Cyberbullying has become a big issue with the popularity of different social media networks and online communication apps. While plenty of research is going on to develop better models for cyberbullying detection in monolingual language, there is very little research on the code-mixed languages and explainability aspect of cyberbullying. Recent laws like "right to explanations" of General Data Pro… ▽ More Cyberbullying has become a big issue with the popularity of different social media networks and online communication apps. While plenty of research is going on to develop better models for cyberbullying detection in monolingual language, there is very little research on the code-mixed languages and explainability aspect of cyberbullying. Recent laws like "right to explanations" of General Data Protection Regulation, have spurred research in develo** interpretable models rather than focusing on performance. Motivated by this we develop the first interpretable multi-task model called {\em mExCB} for automatic cyberbullying detection from code-mixed languages which can simultaneously solve several tasks, cyberbullying detection, explanation/rationale identification, target group detection and sentiment analysis. We have introduced {\em BullyExplain}, the first benchmark dataset for explainable cyberbullying detection in code-mixed language. Each post in {\em BullyExplain} dataset is annotated with four labels, i.e., {\em bully label, sentiment label, target and rationales (explainability)}, i.e., which phrases are being responsible for annotating the post as a bully. The proposed multitask framework (mExCB) based on CNN and GRU with word and sub-sentence (SS) level attention is able to outperform several baselines and state of the art models when applied on {\em BullyExplain} dataset. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: ICDAR 2023

arXiv:2401.06413 [pdf]

Why Doesn't Microsoft Let Me Sleep? How Automaticity of Windows Updates Impacts User Autonomy

Authors: Sanju Ahuja, Ridhi Jain, Jyoti Kumar

Abstract: 'Automating the user away' has been designated as a dark pattern in literature for performing tasks without user consent or confirmation. However, limited studies have been reported on how users experience the sense of autonomy when digital systems fully or partially bypass consent. More research is required to understand what makes automaticity a threat to autonomy. To address this gap, a qualita… ▽ More 'Automating the user away' has been designated as a dark pattern in literature for performing tasks without user consent or confirmation. However, limited studies have been reported on how users experience the sense of autonomy when digital systems fully or partially bypass consent. More research is required to understand what makes automaticity a threat to autonomy. To address this gap, a qualitative interview study with 10 users was conducted to investigate the user experience of Microsoft Windows updates. It was found that ten design features of Windows updates impact the autonomy experience. For each design feature, the contextual factors which influence its impact on autonomy were also noted. The findings of this paper can help designers understand the ethical concerns posed by automaticity in design and identify measures to mitigate these concerns. △ Less

Submitted 12 January, 2024; originally announced January 2024.

Comments: 6 pages, 2 figures

arXiv:2401.04120 [pdf, other]

Generation Z's Ability to Discriminate Between AI-generated and Human-Authored Text on Discord

Authors: Dhruv Ramu, Rishab Jain, Aditya Jain

Abstract: The growing popularity of generative artificial intelligence (AI) chatbots such as ChatGPT is having transformative effects on social media. As the prevalence of AI-generated content grows, concerns have been raised regarding privacy and misinformation online. Among social media platforms, Discord enables AI integrations -- making their primarily "Generation Z" userbase particularly exposed to AI-… ▽ More The growing popularity of generative artificial intelligence (AI) chatbots such as ChatGPT is having transformative effects on social media. As the prevalence of AI-generated content grows, concerns have been raised regarding privacy and misinformation online. Among social media platforms, Discord enables AI integrations -- making their primarily "Generation Z" userbase particularly exposed to AI-generated content. We surveyed Generation Z aged individuals (n = 335) to evaluate their proficiency in discriminating between AI-generated and human-authored text on Discord. The investigation employed one-shot prompting of ChatGPT, disguised as a text message received on the Discord.com platform. We explore the influence of demographic factors on ability, as well as participants' familiarity with Discord and artificial intelligence technologies. We find that Generation Z individuals are unable to discern between AI and human-authored text (p = 0.011), and that those with lower self-reported familiarity with Discord demonstrated an improved ability in identifying human-authored compared to those with self-reported experience with AI (p << 0.0001). Our results suggest that there is a nuanced relationship between AI technology and popular modes of communication for Generation Z, contributing valuable insights into human-computer interactions, digital communication, and artificial intelligence literacy. △ Less

Submitted 31 December, 2023; originally announced January 2024.

arXiv:2401.01596 [pdf, other]

MedSumm: A Multimodal Approach to Summarizing Code-Mixed Hindi-English Clinical Queries

Authors: Akash Ghosh, Arkadeep Acharya, Prince Jha, Aniket Gaudgaul, Rajdeep Majumdar, Sriparna Saha, Aman Chadha, Raghav Jain, Setu Sinha, Shivani Agarwal

Abstract: In the healthcare domain, summarizing medical questions posed by patients is critical for improving doctor-patient interactions and medical decision-making. Although medical data has grown in complexity and quantity, the current body of research in this domain has primarily concentrated on text-based methods, overlooking the integration of visual cues. Also prior works in the area of medical quest… ▽ More In the healthcare domain, summarizing medical questions posed by patients is critical for improving doctor-patient interactions and medical decision-making. Although medical data has grown in complexity and quantity, the current body of research in this domain has primarily concentrated on text-based methods, overlooking the integration of visual cues. Also prior works in the area of medical question summarisation have been limited to the English language. This work introduces the task of multimodal medical question summarization for codemixed input in a low-resource setting. To address this gap, we introduce the Multimodal Medical Codemixed Question Summarization MMCQS dataset, which combines Hindi-English codemixed medical queries with visual aids. This integration enriches the representation of a patient's medical condition, providing a more comprehensive perspective. We also propose a framework named MedSumm that leverages the power of LLMs and VLMs for this task. By utilizing our MMCQS dataset, we demonstrate the value of integrating visual information from images to improve the creation of medically detailed summaries. This multimodal strategy not only improves healthcare decision-making but also promotes a deeper comprehension of patient queries, paving the way for future exploration in personalized and responsive medical care. Our dataset, code, and pre-trained models will be made publicly available. △ Less

Submitted 3 January, 2024; originally announced January 2024.

Comments: ECIR 2024

arXiv:2312.14300 [pdf, other]

doi 10.1116/5.0172819

Asynchronous Entanglement Routing for the Quantum Internet

Authors: Zebo Yang, Ali Ghubaish, Raj Jain, Hassan Shapourian, Alireza Shabani

Abstract: With the emergence of the Quantum Internet, the need for advanced quantum networking techniques has significantly risen. Various models of quantum repeaters have been presented, each delineating a unique strategy to ensure quantum communication over long distances. We focus on repeaters that employ entanglement generation and swap**. This revolves around establishing remote end-to-end entangleme… ▽ More With the emergence of the Quantum Internet, the need for advanced quantum networking techniques has significantly risen. Various models of quantum repeaters have been presented, each delineating a unique strategy to ensure quantum communication over long distances. We focus on repeaters that employ entanglement generation and swap**. This revolves around establishing remote end-to-end entanglement through repeaters, a concept we denote as the "quantum-native" repeaters (also called "first-generation" repeaters in some literature). The challenges in routing with quantum-native repeaters arise from probabilistic entanglement generation and restricted coherence time. Current approaches use synchronized time slots to search for entanglement-swap** paths, resulting in inefficiencies. Here, we propose a new set of asynchronous routing protocols for quantum networks by incorporating the idea of maintaining a dynamic topology in a distributed manner, which has been extensively studied in classical routing for lossy networks, such as using a destination-oriented directed acyclic graph (DODAG) or a spanning tree. The protocols update the entanglement-link topology asynchronously, identify optimal entanglement-swap** paths, and preserve unused direct-link entanglements. Our results indicate that asynchronous protocols achieve a larger upper bound with an appropriate setting and significantly higher entanglement rate than existing synchronous approaches, and the rate increases with coherence time, suggesting that it will have a much more profound impact on quantum networks as technology advances. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: This article has been accepted for publication in the AVS Quantum Science journal

Report number: 013801

Journal ref: AVS Quantum Sci. 6 (2024)

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.11541 [pdf, other]

CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare

Authors: Akash Ghosh, Arkadeep Acharya, Raghav Jain, Sriparna Saha, Aman Chadha, Setu Sinha

Abstract: In the era of modern healthcare, swiftly generating medical question summaries is crucial for informed and timely patient care. Despite the increasing complexity and volume of medical data, existing studies have focused solely on text-based summarization, neglecting the integration of visual information. Recognizing the untapped potential of combining textual queries with visual representations of… ▽ More In the era of modern healthcare, swiftly generating medical question summaries is crucial for informed and timely patient care. Despite the increasing complexity and volume of medical data, existing studies have focused solely on text-based summarization, neglecting the integration of visual information. Recognizing the untapped potential of combining textual queries with visual representations of medical conditions, we introduce the Multimodal Medical Question Summarization (MMQS) Dataset. This dataset, a major contribution to our work, pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs. We also propose a framework, utilizing the power of Contrastive Language Image Pretraining(CLIP) and Large Language Models(LLMs), consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts, and craft visually aware summaries. Our comprehensive framework harnesses the power of CLIP, a multimodal foundation model, and various general-purpose LLMs, comprising four main modules: the medical disorder identification module, the relevant context generation module, the context filtration module for distilling relevant medical concepts and knowledge, and finally, a general-purpose LLM to generate visually aware medical question summaries. Leveraging our MMQS dataset, we showcase how visual cues from images enhance the generation of medically nuanced summaries. This multimodal approach not only enhances the decision-making process in healthcare but also fosters a more nuanced understanding of patient queries, laying the groundwork for future research in personalized and responsive medical care △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: AAAI 2024

arXiv:2312.10057 [pdf, other]

Generative AI in Writing Research Papers: A New Type of Algorithmic Bias and Uncertainty in Scholarly Work

Authors: Rishab Jain, Aditya Jain

Abstract: The use of artificial intelligence (AI) in research across all disciplines is becoming ubiquitous. However, this ubiquity is largely driven by hyperspecific AI models developed during scientific studies for accomplishing a well-defined, data-dense task. These AI models introduce apparent, human-recognizable biases because they are trained with finite, specific data sets and parameters. However, th… ▽ More The use of artificial intelligence (AI) in research across all disciplines is becoming ubiquitous. However, this ubiquity is largely driven by hyperspecific AI models developed during scientific studies for accomplishing a well-defined, data-dense task. These AI models introduce apparent, human-recognizable biases because they are trained with finite, specific data sets and parameters. However, the efficacy of using large language models (LLMs) -- and LLM-powered generative AI tools, such as ChatGPT -- to assist the research process is currently indeterminate. These generative AI tools, trained on general and imperceptibly large datasets along with human feedback, present challenges in identifying and addressing biases. Furthermore, these models are susceptible to goal misgeneralization, hallucinations, and adversarial attacks such as red teaming prompts -- which can be unintentionally performed by human researchers, resulting in harmful outputs. These outputs are reinforced in research -- where an increasing number of individuals have begun to use generative AI to compose manuscripts. Efforts into AI interpretability lag behind development, and the implicit variations that occur when prompting and providing context to a chatbot introduce uncertainty and irreproducibility. We thereby find that incorporating generative AI in the process of writing research manuscripts introduces a new type of context-induced algorithmic bias and has unintended side effects that are largely detrimental to academia, knowledge production, and communicating research. △ Less

Submitted 3 December, 2023; originally announced December 2023.

Comments: 10 pages, 6 figures

ACM Class: I.2.m

arXiv:2311.14712 [pdf, ps, other]

Multiagent Simulators for Social Networks

Authors: Aditya Surve, Archit Rathod, Mokshit Surana, Gautam Malpani, Aneesh Shamraj, Sainath Reddy Sankepally, Raghav Jain, Swapneel S Mehta

Abstract: Multiagent social network simulations are an avenue that can bridge the communication gap between the public and private platforms in order to develop solutions to a complex array of issues relating to online safety. While there are significant challenges relating to the scale of multiagent simulations, efficient learning from observational and interventional data to accurately model micro and mac… ▽ More Multiagent social network simulations are an avenue that can bridge the communication gap between the public and private platforms in order to develop solutions to a complex array of issues relating to online safety. While there are significant challenges relating to the scale of multiagent simulations, efficient learning from observational and interventional data to accurately model micro and macro-level emergent effects, there are equally promising opportunities not least with the advent of large language models that provide an expressive approximation of user behavior. In this position paper, we review prior art relating to social network simulation, highlighting challenges and opportunities for future work exploring multiagent security using agent-based models of social networks △ Less

Submitted 16 November, 2023; originally announced November 2023.

arXiv:2311.11030 [pdf]

Data Center Audio/Video Intelligence on Device (DAVID) -- An Edge-AI Platform for Smart-Toys

Authors: Gabriel Cosache, Francisco Salgado, Cosmin Rotariu, George Sterpu, Rishabh Jain, Peter Corcoran

Abstract: An overview is given of the DAVID Smart-Toy platform, one of the first Edge AI platform designs to incorporate advanced low-power data processing by neural inference models co-located with the relevant image or audio sensors. There is also on-board capability for in-device text-to-speech generation. Two alternative embodiments are presented: a smart Teddy-bear, and a roving dog-like robot. The pla… ▽ More An overview is given of the DAVID Smart-Toy platform, one of the first Edge AI platform designs to incorporate advanced low-power data processing by neural inference models co-located with the relevant image or audio sensors. There is also on-board capability for in-device text-to-speech generation. Two alternative embodiments are presented: a smart Teddy-bear, and a roving dog-like robot. The platform offers a speech-driven user interface and can observe and interpret user actions and facial expressions via its computer vision sensor node. A particular benefit of this design is that no personally identifiable information passes beyond the neural inference nodes thus providing inbuilt compliance with data protection regulations. △ Less

Submitted 18 November, 2023; originally announced November 2023.

Comments: The 12th IEEE Conference on Speech Technology and Human Dialogue (SpeD 2023) URL: https://sped.pub.ro/

arXiv:2311.10731 [pdf]

Gender-Based Comparative Study of Type 2 Diabetes Risk Factors in Kolkata, India: A Machine Learning Approach

Authors: Rahul Jain, Anoushka Saha, Gourav Daga, Durba Bhattacharya, Madhura Das Gupta, Sourav Chowdhury, Suparna Roychowdhury

Abstract: Type 2 diabetes mellitus represents a prevalent and widespread global health concern, necessitating a comprehensive assessment of its risk factors. This study aimed towards learning whether there is any differential impact of age, Lifestyle, BMI and Waist to height ratio on the risk of Type 2 diabetes mellitus in males and females in Kolkata, West Bengal, India based on a sample observed from the… ▽ More Type 2 diabetes mellitus represents a prevalent and widespread global health concern, necessitating a comprehensive assessment of its risk factors. This study aimed towards learning whether there is any differential impact of age, Lifestyle, BMI and Waist to height ratio on the risk of Type 2 diabetes mellitus in males and females in Kolkata, West Bengal, India based on a sample observed from the out-patient consultation department of Belle Vue Clinic in Kolkata. Various machine learning models like Logistic Regression, Random Forest, and Support Vector Classifier, were used to predict the risk of diabetes, and performance was compared based on different predictors. Our findings indicate a significant age-related increase in risk of diabetes for both males and females. Although exercising and BMI was found to have significant impact on the risk of Type 2 diabetes in males, in females both turned out to be statistically insignificant. For both males and females, predictive models based on WhtR demonstrated superior performance in risk assessment compared to those based on BMI. This study sheds light on the gender-specific differences in the risk factors for Type 2 diabetes, offering valuable insights that can be used towards more targeted healthcare interventions and public health strategies. △ Less

Submitted 14 October, 2023; originally announced November 2023.

Comments: 10 pages, 7 tables,3 figures, submitted to a conference

arXiv:2311.06307 [pdf]

Synthetic Speaking Children -- Why We Need Them and How to Make Them

Authors: Muhammad Ali Farooq, Dan Bigioi, Rishabh Jain, Wang Yao, Mariam Yiwere, Peter Corcoran

Abstract: Contemporary Human Computer Interaction (HCI) research relies primarily on neural network models for machine vision and speech understanding of a system user. Such models require extensively annotated training datasets for optimal performance and when building interfaces for users from a vulnerable population such as young children, GDPR introduces significant complexities in data collection, mana… ▽ More Contemporary Human Computer Interaction (HCI) research relies primarily on neural network models for machine vision and speech understanding of a system user. Such models require extensively annotated training datasets for optimal performance and when building interfaces for users from a vulnerable population such as young children, GDPR introduces significant complexities in data collection, management, and processing. Motivated by the training needs of an Edge AI smart toy platform this research explores the latest advances in generative neural technologies and provides a working proof of concept of a controllable data generation pipeline for speech driven facial training data at scale. In this context, we demonstrate how StyleGAN2 can be finetuned to create a gender balanced dataset of children's faces. This dataset includes a variety of controllable factors such as facial expressions, age variations, facial poses, and even speech-driven animations with realistic lip synchronization. By combining generative text to speech models for child voice synthesis and a 3D landmark based talking heads pipeline, we can generate highly realistic, entirely synthetic, talking child video clips. These video clips can provide valuable, and controllable, synthetic training data for neural network models, bridging the gap when real data is scarce or restricted due to privacy regulations. △ Less

Submitted 8 November, 2023; originally announced November 2023.

Comments: Presented at SpeD 23

arXiv:2311.04936 [pdf]

A comparative analysis between Conformer-Transducer, Whisper, and wav2vec2 for improving the child speech recognition

Authors: Andrei Barcovschi, Rishabh Jain, Peter Corcoran

Abstract: Automatic Speech Recognition (ASR) systems have progressed significantly in their performance on adult speech data; however, transcribing child speech remains challenging due to the acoustic differences in the characteristics of child and adult voices. This work aims to explore the potential of adapting state-of-the-art Conformer-transducer models to child speech to improve child speech recognitio… ▽ More Automatic Speech Recognition (ASR) systems have progressed significantly in their performance on adult speech data; however, transcribing child speech remains challenging due to the acoustic differences in the characteristics of child and adult voices. This work aims to explore the potential of adapting state-of-the-art Conformer-transducer models to child speech to improve child speech recognition performance. Furthermore, the results are compared with those of self-supervised wav2vec2 models and semi-supervised multi-domain Whisper models that were previously finetuned on the same data. We demonstrate that finetuning Conformer-transducer models on child speech yields significant improvements in ASR performance on child speech, compared to the non-finetuned models. We also show Whisper and wav2vec2 adaptation on different child speech datasets. Our detailed comparative analysis shows that wav2vec2 provides the most consistent performance improvements among the three methods studied. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: Presented at SpeD 23

arXiv:2311.04313 [pdf]

Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning

Authors: Rishabh Jain, Peter Corcoran

Abstract: Speech synthesis technology has witnessed significant advancements in recent years, enabling the creation of natural and expressive synthetic speech. One area of particular interest is the generation of synthetic child speech, which presents unique challenges due to children's distinct vocal characteristics and developmental stages. This paper presents a novel approach that leverages the Fastpitch… ▽ More Speech synthesis technology has witnessed significant advancements in recent years, enabling the creation of natural and expressive synthetic speech. One area of particular interest is the generation of synthetic child speech, which presents unique challenges due to children's distinct vocal characteristics and developmental stages. This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech. This study uses the transfer learning training pipeline. The approach involved finetuning a multi-speaker TTS model to work with child speech. We use the cleaned version of the publicly available MyST dataset (55 hours) for our finetuning experiments. We also release a prototype dataset of synthetic speech samples generated from this research together with model code to support further research. By using a pretrained MOSNet, we conducted an objective assessment that showed a significant correlation between real and synthetic child voices. Additionally, to validate the intelligibility of the generated speech, we employed an automatic speech recognition (ASR) model to compare the word error rates (WER) of real and synthetic child voices. The speaker similarity between the real and generated speech is also measured using a pretrained speaker encoder. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: Presented in SpeD 23

arXiv:2311.02400 [pdf, other]

From Plate to Production: Artificial Intelligence in Modern Consumer-Driven Food Systems

Authors: Weiqing Min, Pengfei Zhou, Leyi Xu, Tao Liu, Tianhao Li, Mingyu Huang, Ying **, Yifan Yi, Min Wen, Shuqiang Jiang, Ramesh Jain

Abstract: Global food systems confront the urgent challenge of supplying sustainable, nutritious diets in the face of escalating demands. The advent of Artificial Intelligence (AI) is bringing in a personal choice revolution, wherein AI-driven individual decisions transform food systems from dinner tables, to the farms, and back to our plates. In this context, AI algorithms refine personal dietary choices,… ▽ More Global food systems confront the urgent challenge of supplying sustainable, nutritious diets in the face of escalating demands. The advent of Artificial Intelligence (AI) is bringing in a personal choice revolution, wherein AI-driven individual decisions transform food systems from dinner tables, to the farms, and back to our plates. In this context, AI algorithms refine personal dietary choices, subsequently sha** agricultural outputs, and promoting an optimized feedback loop from consumption to cultivation. Initially, we delve into AI tools and techniques spanning the food supply chain, and subsequently assess how AI subfields$\unicode{x2013}$encompassing machine learning, computer vision, and speech recognition$\unicode{x2013}$are harnessed within the AI-enabled Food System (AIFS) framework, which increasingly leverages Internet of Things, multimodal sensors and real-time data exchange. We spotlight the AIFS framework, emphasizing its fusion of AI with technologies such as digitalization, big data analytics, biotechnology, and IoT extensively used in modern food systems in every component. This paradigm shifts the conventional "farm to fork" narrative to a cyclical "consumer-driven farm to fork" model for better achieving sustainable, nutritious diets. This paper explores AI's promise and the intrinsic challenges it poses within the food domain. By championing stringent AI governance, uniform data architectures, and cross-disciplinary partnerships, we argue that AI, when synergized with consumer-centric strategies, holds the potential to steer food systems toward a sustainable trajectory. We furnish a comprehensive survey for the state-of-the-art in diverse facets of food systems, subsequently pinpointing gaps and advocating for the judicious and efficacious deployment of emergent AI methodologies. △ Less

Submitted 4 November, 2023; originally announced November 2023.

Showing 1–50 of 385 results for author: Jain, R