Skip to main content

Showing 1–27 of 27 results for author: Guha, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.00823  [pdf, other

    cs.CL cs.AI cs.MA

    WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting

    Authors: Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor Sanchez, Bertie Vidgen

    Abstract: We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  2. arXiv:2403.09281  [pdf, other

    cs.CV

    CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification

    Authors: Yiming Ma, Victor Sanchez, Tanaya Guha

    Abstract: The CLIP (Contrastive Language-Image Pretraining) model has exhibited outstanding performance in recognition problems, such as zero-shot image classification and object detection. However, its ability to count remains understudied due to the inherent challenges of transforming counting--a regression task--into a recognition task. In this paper, we investigate CLIP's potential in counting, focusing… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  3. arXiv:2307.12241  [pdf, other

    cs.CV cs.LG

    Explainable Depression Detection via Head Motion Patterns

    Authors: Monika Gahalawat, Raul Fernandez Rojas, Tanaya Guha, Ramanathan Subramanian, Roland Goecke

    Abstract: While depression has been studied via multimodal non-verbal behavioural cues, head motion behaviour has not received much attention as a biomarker. This study demonstrates the utility of fundamental head-motion units, termed \emph{kinemes}, for depression detection by adopting two distinct approaches, and employing distinctive features: (a) discovering kinemes from head motion data corresponding t… ▽ More

    Submitted 23 July, 2023; originally announced July 2023.

  4. arXiv:2304.06370  [pdf, other

    cs.CV

    Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention

    Authors: Yiming Ma, Victor Sanchez, Soodeh Nikan, Devesh Upadhyay, Bhushan Atote, Tanaya Guha

    Abstract: Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors mounted at different locations to monitor the driver and the vehicle's interior scene and employ decision-level fusion to integrate these heterogenous data. However, this fusion method may not fully utilize the complementarity of different data… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.

    Comments: 9 pages (1 for reference); accepted by the 6th Multimodal Learning and Applications Workshop (MULA) at CVPR 2023

  5. arXiv:2303.02665  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Heterogeneous Graph Learning for Acoustic Event Classification

    Authors: Amir Shirian, Mona Ahmadian, Krishna Somandepalli, Tanaya Guha

    Abstract: Heterogeneous graphs provide a compact, efficient, and scalable way to model data involving multiple disparate modalities. This makes modeling audiovisual data using heterogeneous graphs an attractive option. However, graph structure does not appear naturally in audiovisual data. Graphs for audiovisual data are constructed manually which is both difficult and sub-optimal. In this work, we address… ▽ More

    Submitted 12 March, 2023; v1 submitted 5 March, 2023; originally announced March 2023.

    Comments: arXiv admin note: text overlap with arXiv:2207.07935

  6. arXiv:2302.09817  [pdf, other

    cs.LG cs.CV

    Explainable Human-centered Traits from Head Motion and Facial Expression Dynamics

    Authors: Surbhi Madan, Monika Gahalawat, Tanaya Guha, Roland Goecke, Ramanathan Subramanian

    Abstract: We explore the efficacy of multimodal behavioral cues for explainable prediction of personality and interview-specific traits. We utilize elementary head-motion units named kinemes, atomic facial movements termed action units and speech features to estimate these human-centered traits. Empirical results confirm that kinemes and action units enable discovery of multiple trait-specific behaviors whi… ▽ More

    Submitted 23 February, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

  7. arXiv:2210.09441  [pdf, other

    cs.CV cs.HC cs.RO

    Real-Time Driver Monitoring Systems through Modality and View Analysis

    Authors: Yiming Ma, Victor Sanchez, Soodeh Nikan, Devesh Upadhyay, Bhushan Atote, Tanaya Guha

    Abstract: Driver distractions are known to be the dominant cause of road accidents. While monitoring systems can detect non-driving-related activities and facilitate reducing the risks, they must be accurate and efficient to be applicable. Unfortunately, state-of-the-art methods prioritize accuracy while ignoring latency because they leverage cross-view and multimodal videos in which consecutive frames are… ▽ More

    Submitted 17 October, 2022; originally announced October 2022.

    Comments: Paper summaries that our work on the DAD dataset

  8. arXiv:2207.07935  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Visually-aware Acoustic Event Detection using Heterogeneous Graphs

    Authors: Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha

    Abstract: Perception of auditory events is inherently multimodal relying on both audio and visual cues. A large number of existing multimodal approaches process each modality using modality-specific models and then fuse the embeddings to encode the joint information. In contrast, we employ heterogeneous graphs to explicitly capture the spatial and temporal relationships between the modalities and represent… ▽ More

    Submitted 16 July, 2022; originally announced July 2022.

  9. arXiv:2207.07783  [pdf, other

    cs.CV

    Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

    Authors: Kyle Min, Sourya Roy, Subarna Tripathi, Tanaya Guha, Somdeb Majumdar

    Abstract: Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows. In this paper, we present SPELL, a novel spatial-temporal graph learning framework that can solve complex tasks such as ASD. To this end, each person in a video frame is first encoded in a unique n… ▽ More

    Submitted 12 October, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

    Comments: ECCV 2022 camera ready (Supplementary videos: on ECVA soon). This paper supersedes arXiv:2112.01479

  10. FusionCount: Efficient Crowd Counting via Multiscale Feature Fusion

    Authors: Yiming Ma, Victor Sanchez, Tanaya Guha

    Abstract: State-of-the-art crowd counting models follow an encoder-decoder approach. Images are first processed by the encoder to extract features. Then, to account for perspective distortion, the highest-level feature map is fed to extra components to extract multiscale features, which are the input to the decoder to generate crowd densities. However, in these methods, features extracted at earlier stages… ▽ More

    Submitted 28 February, 2022; originally announced February 2022.

    Comments: 5 pages, 11 figures, submit to ICIP

  11. Self-supervised Graphs for Audio Representation Learning with Limited Labeled Data

    Authors: Amir Shirian, Krishna Somandepalli, Tanaya Guha

    Abstract: Large scale databases with high-quality manual annotations are scarce in audio domain. We thus explore a self-supervised graph approach to learning audio representations from highly limited labelled data. Considering each audio sample as a graph node, we propose a subgraph-based framework with novel self-supervision tasks that can learn effective audio representations. During training, subgraphs a… ▽ More

    Submitted 16 July, 2022; v1 submitted 31 January, 2022; originally announced February 2022.

  12. Head Matters: Explainable Human-centered Trait Prediction from Head Motion Dynamics

    Authors: Surbhi Madan, Monika Gahalawat, Tanaya Guha, Ramanathan Subramanian

    Abstract: We demonstrate the utility of elementary head-motion units termed kinemes for behavioral analytics to predict personality and interview traits. Transforming head-motion patterns into a sequence of kinemes facilitates discovery of latent temporal signatures characterizing the targeted traits, thereby enabling both efficient and explainable trait prediction. Utilizing Kinemes and Facial Action Codin… ▽ More

    Submitted 15 December, 2021; originally announced December 2021.

    Comments: 10 pages, 10 figures, 6 tables. This paper is published in ICMI 2021

  13. arXiv:2112.01479  [pdf, other

    cs.CV

    Learning Spatial-Temporal Graphs for Active Speaker Detection

    Authors: Sourya Roy, Kyle Min, Subarna Tripathi, Tanaya Guha, Somdeb Majumdar

    Abstract: We address the problem of active speaker detection through a new framework, called SPELL, that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data. We cast active speaker detection as a node classification task that is aware of longer-term dependencies. We first construct a graph from a video so that each node corresponds to one person. Nodes re… ▽ More

    Submitted 3 December, 2021; v1 submitted 2 December, 2021; originally announced December 2021.

    Comments: 10 pages

  14. arXiv:2108.04694  [pdf, other

    cs.CV

    Multi-Camera Trajectory Forecasting with Trajectory Tensors

    Authors: Olly Styles, Tanaya Guha, Victor Sanchez

    Abstract: We introduce the problem of multi-camera trajectory forecasting (MCTF), which involves predicting the trajectory of a moving object across a network of cameras. While multi-camera setups are widespread for applications such as surveillance and traffic monitoring, existing trajectory forecasting methods typically focus on single-camera trajectory forecasting (SCTF), limiting their use for such appl… ▽ More

    Submitted 24 August, 2021; v1 submitted 10 August, 2021; originally announced August 2021.

    Comments: To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (tPAMI)

  15. arXiv:2102.04990  [pdf, other

    cs.CV cs.CL

    In Defense of Scene Graphs for Image Captioning

    Authors: Kien Nguyen, Subarna Tripathi, Bang Du, Tanaya Guha, Truong Q. Nguyen

    Abstract: The mainstream image captioning models rely on Convolutional Neural Network (CNN) image features to generate captions via recurrent models. Recently, image scene graphs have been used to augment captioning models so as to leverage their structural semantics, such as object entities, relationships and attributes. Several studies have noted that the naive use of scene graphs from a black-box scene g… ▽ More

    Submitted 17 August, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

    Comments: Accepted to ICCV 2021

  16. arXiv:2008.02661  [pdf, other

    cs.CV cs.MM eess.AS

    Dynamic Emotion Modeling with Learnable Graphs and Graph Inception Network

    Authors: A. Shirian, S. Tripathi, T. Guha

    Abstract: Human emotion is expressed, perceived and captured using a variety of dynamic data modalities, such as speech (verbal), videos (facial expressions) and motion sensors (body gestures). We propose a generalized approach to emotion recognition that can adapt across modalities by modeling dynamic data as structured graphs. The motivation behind the graph approach is to build compact models without com… ▽ More

    Submitted 8 February, 2021; v1 submitted 6 August, 2020; originally announced August 2020.

    Journal ref: 10.1109/TMM.2021.3059169

  17. arXiv:2008.02063  [pdf, other

    cs.CV cs.LG eess.AS

    Compact Graph Architecture for Speech Emotion Recognition

    Authors: A. Shirian, T. Guha

    Abstract: We propose a deep graph approach to address the task of speech emotion recognition. A compact, efficient and scalable way to represent data is in the form of graphs. Following the theory of graph signal processing, we propose to model speech signal as a cycle graph or a line graph. Such graph structure enables us to construct a Graph Convolution Network (GCN)-based architecture that can perform an… ▽ More

    Submitted 2 February, 2021; v1 submitted 5 August, 2020; originally announced August 2020.

  18. arXiv:2007.14913  [pdf, other

    cs.CV cs.MM

    Dynamic Character Graph via Online Face Clustering for Movie Analysis

    Authors: Prakhar Kulshreshtha, Tanaya Guha

    Abstract: An effective approach to automated movie content analysis involves building a network (graph) of its characters. Existing work usually builds a static character graph to summarize the content using metadata, scripts or manual annotations. We propose an unsupervised approach to building a dynamic character graph that captures the temporal evolution of character interaction. We refer to this as the… ▽ More

    Submitted 29 July, 2020; originally announced July 2020.

    Comments: accepted for publication in Multimedia Tools and Applications (MMTA)

  19. arXiv:2006.03898  [pdf, other

    cs.CV cs.MM eess.IV

    Ensemble Network for Ranking Images Based on Visual Appeal

    Authors: Sachin Singh, Victor Sanchez, Tanaya Guha

    Abstract: We propose a computational framework for ranking images (group photos in particular) taken at the same event within a short time span. The ranking is expected to correspond with human perception of overall appeal of the images. We hypothesize and provide evidence through subjective analysis that the factors that appeal to humans are its emotional content, aesthetics and image quality. We propose a… ▽ More

    Submitted 6 June, 2020; originally announced June 2020.

  20. arXiv:2005.00282  [pdf, other

    cs.CV

    Multi-Camera Trajectory Forecasting: Pedestrian Trajectory Prediction in a Network of Cameras

    Authors: Olly Styles, Tanaya Guha, Victor Sanchez, Alex Kot

    Abstract: We introduce the task of multi-camera trajectory forecasting (MCTF), where the future trajectory of an object is predicted in a network of cameras. Prior works consider forecasting trajectories in a single camera view. Our work is the first to consider the challenging scenario of forecasting across multiple non-overlap** camera views. This has wide applicability in tasks such as re-identificatio… ▽ More

    Submitted 1 May, 2020; originally announced May 2020.

    Comments: CVPR 2020 Precognition workshop

  21. arXiv:1910.08732  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos

    Authors: Kranti Kumar Parida, Neeraj Matiyali, Tanaya Guha, Gaurav Sharma

    Abstract: We present an audio-visual multimodal approach for the task of zeroshot learning (ZSL) for classification and retrieval of videos. ZSL has been studied extensively in the recent past but has primarily been limited to visual modality and to images. We demonstrate that both audio and visual modalities are important for ZSL for videos. Since a dataset to study the task is currently not available, we… ▽ More

    Submitted 19 October, 2019; originally announced October 2019.

    Comments: To appear in WACV 2020, Project Page: https://cse.iitk.ac.in/users/kranti/avzsl.html

  22. arXiv:1909.11944  [pdf, other

    cs.CV

    Multiple Object Forecasting: Predicting Future Object Locations in Diverse Environments

    Authors: Olly Styles, Tanaya Guha, Victor Sanchez

    Abstract: This paper introduces the problem of multiple object forecasting (MOF), in which the goal is to predict future bounding boxes of tracked objects. In contrast to existing works on object trajectory forecasting which primarily consider the problem from a birds-eye perspective, we formulate the problem from an object-level perspective and call for the prediction of full object bounding boxes, rather… ▽ More

    Submitted 7 January, 2020; v1 submitted 26 September, 2019; originally announced September 2019.

    Comments: WACV 2020. Code & dataset: https://github.com/olly-styles/Multiple-Object-Forecasting

  23. arXiv:1904.00150  [pdf, other

    cs.MM cs.LG cs.SD eess.AS

    Learning Affective Correspondence between Music and Image

    Authors: Gaurav Verma, Eeshan Gunesh Dhekane, Tanaya Guha

    Abstract: We introduce the problem of learning affective correspondence between audio (music) and visual data (images). For this task, a music clip and an image are considered similar (having true correspondence) if they have similar emotion content. In order to estimate this crossmodal, emotion-centric similarity, we propose a deep neural network architecture that learns to project the data from the two mo… ▽ More

    Submitted 16 April, 2019; v1 submitted 30 March, 2019; originally announced April 2019.

    Comments: 5 pages, International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019

  24. arXiv:1712.04753  [pdf, other

    eess.AS cs.CL cs.HC cs.SD

    Learning Spontaneity to Improve Emotion Recognition In Speech

    Authors: Karttikeya Mangalam, Tanaya Guha

    Abstract: We investigate the effect and usefulness of spontaneity (i.e. whether a given speech is spontaneous or not) in speech in the context of emotion recognition. We hypothesize that emotional content in speech is interrelated with its spontaneity, and use spontaneity classification as an auxiliary task to the problem of emotion recognition. We propose two supervised learning settings that utilize spont… ▽ More

    Submitted 13 June, 2018; v1 submitted 12 December, 2017; originally announced December 2017.

    Comments: Accepted at Interspeech 2018

  25. arXiv:1707.06830  [pdf, other

    cs.MM

    Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking

    Authors: Rahul Sharma, Tanaya Guha, Gaurav Sharma

    Abstract: Public speaking is an important aspect of human communication and interaction. The majority of computational work on public speaking concentrates on analyzing the spoken content, and the verbal behavior of the speakers. While the success of public speaking largely depends on the content of the talk, and the verbal behavior, non-verbal (visual) cues, such as gestures and physical appearance also pl… ▽ More

    Submitted 21 July, 2017; originally announced July 2017.

  26. arXiv:1306.2727  [pdf, other

    cs.CV cs.MM eess.IV

    Sparse Representation-based Image Quality Assessment

    Authors: Tanaya Guha, Ehsan Nezhadarya, Rabab K Ward

    Abstract: A successful approach to image quality assessment involves comparing the structural information between a distorted and its reference image. However, extracting structural information that is perceptually important to our visual system is a challenging task. This paper addresses this issue by employing a sparse representation-based approach and proposes a new metric called the \emph{sparse represe… ▽ More

    Submitted 12 June, 2013; originally announced June 2013.

    Comments: 10 pages, 3 figures, 3 tables, submitted to a journal

  27. Image Similarity Using Sparse Representation and Compression Distance

    Authors: Tanaya Guha, Rabab K. Ward

    Abstract: A new line of research uses compression methods to measure the similarity between signals. Two signals are considered similar if one can be compressed significantly when the information of the other is known. The existing compression-based similarity methods, although successful in the discrete one dimensional domain, do not work well in the context of images. This paper proposes a sparse represen… ▽ More

    Submitted 7 May, 2013; v1 submitted 12 June, 2012; originally announced June 2012.

    Comments: submitted journal draft