-
Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks
Authors:
Daniel Wen,
Nafisa Hussain
Abstract:
Large language models (LLMs) and large visual language models (LVLMs) have been at the forefront of the artificial intelligence field, particularly for tasks like text generation, video captioning, and question-answering. Typically, it is more applicable to train these models on broader knowledge bases or datasets to increase generalizability, learn relationships between topics, and recognize patt…
▽ More
Large language models (LLMs) and large visual language models (LVLMs) have been at the forefront of the artificial intelligence field, particularly for tasks like text generation, video captioning, and question-answering. Typically, it is more applicable to train these models on broader knowledge bases or datasets to increase generalizability, learn relationships between topics, and recognize patterns. Instead, we propose to provide instructional datasets specific to the task of each modality within a distinct domain and then fine-tune the parameters of the model using LORA. With our approach, we can eliminate all noise irrelevant to the given task while also ensuring that the model generates with enhanced precision. For this work, we use Video-LLaVA to generate recipes given cooking videos without transcripts. Video-LLaVA's multimodal architecture allows us to provide cooking images to its image encoder, cooking videos to its video encoder, and general cooking questions to its text encoder. Thus, we aim to remove all noise unrelated to cooking while improving our model's capabilities to generate specific ingredient lists and detailed instructions. As a result, our approach to fine-tuning Video-LLaVA leads to gains over the baseline Video-LLaVA by 2% on the YouCook2 dataset. While this may seem like a marginal increase, our model trains on an image instruction dataset 2.5% the size of Video-LLaVA's and a video instruction dataset 23.76% of Video-LLaVA's.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
A Survey on Human-AI Teaming with Large Pre-Trained Models
Authors:
Vanshika Vats,
Marzia Binta Nizam,
Minghao Liu,
Ziyuan Wang,
Richard Ho,
Mohnish Sai Prasad,
Vincent Titterton,
Sai Venkat Malreddy,
Riya Aggarwal,
Yanwen Xu,
Lei Ding,
Jay Mehta,
Nathan Grinnell,
Li Liu,
Sijia Zhong,
Devanathan Nallur Gandamani,
Xinyi Tang,
Rohan Ghosalkar,
Celeste Shen,
Rachel Shen,
Nafisa Hussain,
Kesav Ravichandran,
James Davis
Abstract:
In the rapidly evolving landscape of artificial intelligence (AI), the collaboration between human intelligence and AI systems, known as Human-AI (HAI) Teaming, has emerged as a cornerstone for advancing problem-solving and decision-making processes. The advent of Large Pre-trained Models (LPtM) has significantly transformed this landscape, offering unprecedented capabilities by leveraging vast am…
▽ More
In the rapidly evolving landscape of artificial intelligence (AI), the collaboration between human intelligence and AI systems, known as Human-AI (HAI) Teaming, has emerged as a cornerstone for advancing problem-solving and decision-making processes. The advent of Large Pre-trained Models (LPtM) has significantly transformed this landscape, offering unprecedented capabilities by leveraging vast amounts of data to understand and predict complex patterns. This paper surveys the pivotal integration of LPtMs with HAI, emphasizing how these models enhance collaborative intelligence beyond traditional approaches. It examines the potential of LPtMs in augmenting human capabilities, discussing this collaboration for AI model improvements, effective teaming, ethical considerations, and their broad applied implications in various sectors. Through this exploration, the study sheds light on the transformative impact of LPtM-enhanced HAI Teaming, providing insights for future research, policy development, and strategic implementations aimed at harnessing the full potential of this collaboration for research and societal benefit.
△ Less
Submitted 26 June, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
Guilt Detection in Text: A Step Towards Understanding Complex Emotions
Authors:
Abdul Gafar Manuel Meque,
Nisar Hussain,
Grigori Sidorov,
Alexander Gelbukh
Abstract:
We introduce a novel Natural Language Processing (NLP) task called Guilt detection, which focuses on detecting guilt in text. We identify guilt as a complex and vital emotion that has not been previously studied in NLP, and we aim to provide a more fine-grained analysis of it. To address the lack of publicly available corpora for guilt detection, we created VIC, a dataset containing 4622 texts fro…
▽ More
We introduce a novel Natural Language Processing (NLP) task called Guilt detection, which focuses on detecting guilt in text. We identify guilt as a complex and vital emotion that has not been previously studied in NLP, and we aim to provide a more fine-grained analysis of it. To address the lack of publicly available corpora for guilt detection, we created VIC, a dataset containing 4622 texts from three existing emotion detection datasets that we binarized into guilt and no-guilt classes. We experimented with traditional machine learning methods using bag-of-words and term frequency-inverse document frequency features, achieving a 72% f1 score with the highest-performing model. Our study provides a first step towards understanding guilt in text and opens the door for future research in this area.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
Heart Sound Segmentation using Bidirectional LSTMs with Attention
Authors:
Tharindu Fernando,
Houman Ghaemmaghami,
Simon Denman,
Sridha Sridharan,
Nayyar Hussain,
Clinton Fookes
Abstract:
This paper proposes a novel framework for the segmentation of phonocardiogram (PCG) signals into heart states, exploiting the temporal evolution of the PCG as well as considering the salient information that it provides for the detection of the heart state. We propose the use of recurrent neural networks and exploit recent advancements in attention based learning to segment the PCG signal. This al…
▽ More
This paper proposes a novel framework for the segmentation of phonocardiogram (PCG) signals into heart states, exploiting the temporal evolution of the PCG as well as considering the salient information that it provides for the detection of the heart state. We propose the use of recurrent neural networks and exploit recent advancements in attention based learning to segment the PCG signal. This allows the network to identify the most salient aspects of the signal and disregard uninformative information. The proposed method attains state-of-the-art performance on multiple benchmarks including both human and animal heart recordings. Furthermore, we empirically analyse different feature combinations including envelop features, wavelet and Mel Frequency Cepstral Coefficients (MFCC), and provide quantitative measurements that explore the importance of different features in the proposed approach. We demonstrate that a recurrent neural network coupled with attention mechanisms can effectively learn from irregular and noisy PCG recordings. Our analysis of different feature combinations shows that MFCC features and their derivatives offer the best performance compared to classical wavelet and envelop features. Heart sound segmentation is a crucial pre-processing step for many diagnostic applications. The proposed method provides a cost effective alternative to labour extensive manual segmentation, and provides a more accurate segmentation than existing methods. As such, it can improve the performance of further analysis including the detection of murmurs and ejection clicks. The proposed method is also applicable for detection and segmentation of other one dimensional biomedical signals.
△ Less
Submitted 1 April, 2020;
originally announced April 2020.
-
Batch Recurrent Q-Learning for Backchannel Generation Towards Engaging Agents
Authors:
Nusrah Hussain,
Engin Erzin,
T. Metin Sezgin,
Yucel Yemez
Abstract:
The ability to generate appropriate verbal and non-verbal backchannels by an agent during human-robot interaction greatly enhances the interaction experience. Backchannels are particularly important in applications like tutoring and counseling, which require constant attention and engagement of the user. We present here a method for training a robot for backchannel generation during a human-robot…
▽ More
The ability to generate appropriate verbal and non-verbal backchannels by an agent during human-robot interaction greatly enhances the interaction experience. Backchannels are particularly important in applications like tutoring and counseling, which require constant attention and engagement of the user. We present here a method for training a robot for backchannel generation during a human-robot interaction within the reinforcement learning (RL) framework, with the goal of maintaining high engagement level. Since online learning by interaction with a human is highly time-consuming and impractical, we take advantage of the recorded human-to-human dataset and approach our problem as a batch reinforcement learning problem. The dataset is utilized as a batch data acquired by some behavior policy. We perform experiments with laughs as a backchannel and train an agent with value-based techniques. In particular, we demonstrate the effectiveness of recurrent layers in the approximate value function for this problem, that boosts the performance in partially observable environments. With off-policy policy evaluation, it is shown that the RL agents are expected to produce more engagement than an agent trained from imitation learning.
△ Less
Submitted 6 August, 2019;
originally announced August 2019.
-
Speech Driven Backchannel Generation using Deep Q-Network for Enhancing Engagement in Human-Robot Interaction
Authors:
Nusrah Hussain,
Engin Erzin,
T. Metin Sezgin,
Yucel Yemez
Abstract:
We present a novel method for training a social robot to generate backchannels during human-robot interaction. We address the problem within an off-policy reinforcement learning framework, and show how a robot may learn to produce non-verbal backchannels like laughs, when trained to maximize the engagement and attention of the user. A major contribution of this work is the formulation of the probl…
▽ More
We present a novel method for training a social robot to generate backchannels during human-robot interaction. We address the problem within an off-policy reinforcement learning framework, and show how a robot may learn to produce non-verbal backchannels like laughs, when trained to maximize the engagement and attention of the user. A major contribution of this work is the formulation of the problem as a Markov decision process (MDP) with states defined by the speech activity of the user and rewards generated by quantified engagement levels. The problem that we address falls into the class of applications where unlimited interaction with the environment is not possible (our environment being a human) because it may be time-consuming, costly, impracticable or even dangerous in case a bad policy is executed. Therefore, we introduce deep Q-network (DQN) in a batch reinforcement learning framework, where an optimal policy is learned from a batch data collected using a more controlled policy. We suggest the use of human-to-human dyadic interaction datasets as a batch of trajectories to train an agent for engaging interactions. Our experiments demonstrate the potential of our method to train a robot for engaging behaviors in an offline manner.
△ Less
Submitted 5 August, 2019;
originally announced August 2019.
-
3D Facial Action Units Recognition for Emotional Expression
Authors:
N. Hussain,
H. Ujir,
I. Hipiny,
J-L Minoi
Abstract:
The muscular activities caused the activation of certain AUs for every facial expression at the certain duration of time throughout the facial expression. This paper presents the methods to recognise facial Action Unit (AU) using facial distance of the facial features which activates the muscles. The seven facial action units involved are AU1, AU4, AU6, AU12, AU15, AU17 and AU25 that characterises…
▽ More
The muscular activities caused the activation of certain AUs for every facial expression at the certain duration of time throughout the facial expression. This paper presents the methods to recognise facial Action Unit (AU) using facial distance of the facial features which activates the muscles. The seven facial action units involved are AU1, AU4, AU6, AU12, AU15, AU17 and AU25 that characterises happy and sad expression. The recognition is performed on each AU according to rules defined based on the distance of each facial points. The facial distances chosen are extracted from twelve facial features. Then the facial distances are trained using Support Vector Machine (SVM) and Neural Network (NN). Classification result using SVM is presented with several different SVM kernels while result using NN is presented for each training, validation and testing phase.
△ Less
Submitted 1 December, 2017;
originally announced December 2017.
-
LNOS - Live Network Operating System
Authors:
Sajjad Haider,
Mehboob Yasin,
Naveed Hussain,
Muhammad Imran
Abstract:
Operating Systems exists since existence of computers, and have been evolving continuously from time to time. In this paper we have reviewed a relatively new or unexplored topic of Live OS. From networking perspective, Live OS is used for establishing Clusters, Firewalls and as Network security assessment tool etc. Our proposed concept is that a Live OS can be established or configured for an orga…
▽ More
Operating Systems exists since existence of computers, and have been evolving continuously from time to time. In this paper we have reviewed a relatively new or unexplored topic of Live OS. From networking perspective, Live OS is used for establishing Clusters, Firewalls and as Network security assessment tool etc. Our proposed concept is that a Live OS can be established or configured for an organizations specific network requirements with respect to their servers. An important server failure due to hardware or software could take time for remedy of the problem, so for that situation a preconfigured server in the form of Live OS on CD/DVD/USB can be used as an immediate solution. In a network of ten nodes, we stopped the server machine and with necessary adjustments, Live OS replaced the server in less than five minutes. Live OS in a network environment is a quick replacement of the services that are failed due to server failure (hardware or software). It is a cost effective solution for low budget networks. The life of Live OS starts when we boot it from CD/DVD/USB and remains in action for that session. As soon as the machine is rebooted, any work done for that session is gone, (in case we do not store any information on permanent storage media). Live CD/DVD/USB is normally used on systems where we do not have Operating Systems installed. A Live OS can also be used on systems where we already have an installed OS. On the basis of functionality a Live OS can be used for many purposes and has some typical advantages that are not available on other operating systems. Vendors are releasing different distributions of Live OS and is becoming their sole identity in a particular domain like Networks, Security, Education or Entertainment etc. There can be many aspects of Live OS, but Linux based Live OS and their use in the field of networks is the main focus of this paper.
△ Less
Submitted 27 December, 2012;
originally announced December 2012.