Skip to main content

Showing 1–20 of 20 results for author: Somandepalli, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.13762  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation

    Authors: Gwanghyun Kim, Alonso Martinez, Yu-Chuan Su, Brendan Jou, José Lezama, Agrim Gupta, Lijun Yu, Lu Jiang, Aren Jansen, Jacob Walker, Krishna Somandepalli

    Abstract: Training diffusion models for audiovisual sequences allows for a range of generation tasks by learning conditional distributions of various input-output combinations of the two modalities. Nevertheless, this strategy often requires training a separate model for each task which is expensive. Here, we propose a novel training approach to effectively learn arbitrary conditional distributions in the a… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  2. arXiv:2312.14125  [pdf, other

    cs.CV cs.AI

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Authors: Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam , et al. (6 additional authors not shown)

    Abstract: We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and tas… ▽ More

    Submitted 4 June, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: To appear at ICML 2024; Project page: http://sites.research.google/videopoet/

  3. arXiv:2309.03978  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    LanSER: Language-Model Supported Speech Emotion Recognition

    Authors: Taesik Gong, Josh Belanich, Krishna Somandepalli, Arsha Nagrani, Brian Eoff, Brendan Jou

    Abstract: Speech emotion recognition (SER) models typically rely on costly human-labeled data for training, making scaling methods to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER, a method that enables the use of unlabeled data by inferring weak emotion labels via pre-trained large language models through weakly-supervised learning. For inferring weak labels constrained… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: Presented at INTERSPEECH 2023

    Journal ref: INTERSPEECH (2023) 2408-2412

  4. arXiv:2308.14052  [pdf, other

    cs.CV

    MM-AU:Towards Multimodal Understanding of Advertisement Videos

    Authors: Digbalay Bose, Rajat Hebbar, Tiantian Feng, Krishna Somandepalli, Anfeng Xu, Shrikanth Narayanan

    Abstract: Advertisement videos (ads) play an integral part in the domain of Internet e-commerce as they amplify the reach of particular products to a broad audience or can serve as a medium to raise awareness about specific issues through concise narrative structures. The narrative structures of advertisements involve several elements like reasoning about the broad content (topic and the underlying message)… ▽ More

    Submitted 27 August, 2023; originally announced August 2023.

    Comments: Accepted to ACM Multimedia 2023

  5. arXiv:2303.06904  [pdf, other

    cs.CV cs.AI cs.CL

    Contextually-rich human affect perception using multimodal scene information

    Authors: Digbalay Bose, Rajat Hebbar, Krishna Somandepalli, Shrikanth Narayanan

    Abstract: The process of human affect understanding involves the ability to infer person specific emotional states from various sources including images, speech, and language. Affect perception from images has predominantly focused on expressions extracted from salient face crops. However, emotions perceived by humans rely on multiple contextual cues including social settings, foreground interactions, and a… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: Accepted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

  6. arXiv:2303.02665  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Heterogeneous Graph Learning for Acoustic Event Classification

    Authors: Amir Shirian, Mona Ahmadian, Krishna Somandepalli, Tanaya Guha

    Abstract: Heterogeneous graphs provide a compact, efficient, and scalable way to model data involving multiple disparate modalities. This makes modeling audiovisual data using heterogeneous graphs an attractive option. However, graph structure does not appear naturally in audiovisual data. Graphs for audiovisual data are constructed manually which is both difficult and sub-optimal. In this work, we address… ▽ More

    Submitted 12 March, 2023; v1 submitted 5 March, 2023; originally announced March 2023.

    Comments: arXiv admin note: text overlap with arXiv:2207.07935

  7. arXiv:2302.07315  [pdf, other

    eess.AS cs.LG cs.SD

    A dataset for Audio-Visual Sound Event Detection in Movies

    Authors: Rajat Hebbar, Digbalay Bose, Krishna Somandepalli, Veena Vijai, Shrikanth Narayanan

    Abstract: Audio event detection is a widely studied audio processing task, with applications ranging from self-driving cars to healthcare. In-the-wild datasets such as Audioset have propelled research in this field. However, many efforts typically involve manual annotation and verification, which is expensive to perform at scale. Movies depict various real-life and fictional scenarios which makes them a ric… ▽ More

    Submitted 14 February, 2023; originally announced February 2023.

  8. arXiv:2210.11065  [pdf, other

    cs.CV cs.CL cs.MM

    MovieCLIP: Visual Scene Recognition in Movies

    Authors: Digbalay Bose, Rajat Hebbar, Krishna Somandepalli, Haoyang Zhang, Yin Cui, Kree Cole-McLaughlin, Huisheng Wang, Shrikanth Narayanan

    Abstract: Longform media such as movies have complex narrative structures, with events spanning a rich variety of ambient visual scenes. Domain specific challenges associated with visual scenes in movies include transitions, person coverage, and a wide array of real-life and fictional scenarios. Existing visual scene datasets in movies have limited taxonomies and don't consider the visual scene transition w… ▽ More

    Submitted 22 October, 2022; v1 submitted 20 October, 2022; originally announced October 2022.

    Comments: Accepted to 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2023). Project website with supplemental material: https://sail.usc.edu/~mica/MovieCLIP/. Revised version with updated author affiliations

  9. arXiv:2207.07935  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Visually-aware Acoustic Event Detection using Heterogeneous Graphs

    Authors: Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha

    Abstract: Perception of auditory events is inherently multimodal relying on both audio and visual cues. A large number of existing multimodal approaches process each modality using modality-specific models and then fuse the embeddings to encode the joint information. In contrast, we employ heterogeneous graphs to explicitly capture the spatial and temporal relationships between the modalities and represent… ▽ More

    Submitted 16 July, 2022; originally announced July 2022.

  10. arXiv:2206.12494  [pdf, other

    cs.SD cs.LG eess.AS

    Multitask vocal burst modeling with ResNets and pre-trained paralinguistic Conformers

    Authors: Josh Belanich, Krishna Somandepalli, Brian Eoff, Brendan Jou

    Abstract: This technical report presents the modeling approaches used in our submission to the ICML Expressive Vocalizations Workshop & Competition multitask track (ExVo-MultiTask). We first applied image classification models of various sizes on mel-spectrogram representations of the vocal bursts, as is standard in sound event detection literature. Results from these models show an increase of 21.24% over… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: To be published in the ICML Expressive Vocalizations Workshop & Competition 2022 (https://www.competitions.hume.ai/exvo2022)

  11. Self-supervised Graphs for Audio Representation Learning with Limited Labeled Data

    Authors: Amir Shirian, Krishna Somandepalli, Tanaya Guha

    Abstract: Large scale databases with high-quality manual annotations are scarce in audio domain. We thus explore a self-supervised graph approach to learning audio representations from highly limited labelled data. Considering each audio sample as a graph node, we propose a subgraph-based framework with novel self-supervision tasks that can learn effective audio representations. During training, subgraphs a… ▽ More

    Submitted 16 July, 2022; v1 submitted 31 January, 2022; originally announced February 2022.

  12. arXiv:2110.06486  [pdf, other

    cs.CV cs.CL

    Understanding of Emotion Perception from Art

    Authors: Digbalay Bose, Krishna Somandepalli, Souvik Kundu, Rimita Lahiri, Jonathan Gratch, Shrikanth Narayanan

    Abstract: Computational modeling of the emotions evoked by art in humans is a challenging problem because of the subjective and nuanced nature of art and affective signals. In this paper, we consider the above-mentioned problem of understanding emotions evoked in viewers by artwork using both text and visual modalities. Specifically, we analyze images and the accompanying text captions from the viewers expr… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

    Comments: 5 pages, 5 figures. Accepted at ICCV2021: 4th Workshop on Closing the loop between Vision and Language

  13. Representation of professions in entertainment media: Insights into frequency and sentiment trends through computational text analysis

    Authors: Sabyasachee Baruah, Krishna Somandepalli, Shrikanth Narayanan

    Abstract: Societal ideas and trends dictate media narratives and cinematic depictions which in turn influences people's beliefs and perceptions of the real world. Media portrayal of culture, education, government, religion, and family affect their function and evolution over time as people interpret and perceive these representations and incorporate them into their beliefs and actions. It is important to st… ▽ More

    Submitted 11 October, 2021; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: 27 pages, 15 figures

  14. Robust Character Labeling in Movie Videos: Data Resources and Self-supervised Feature Adaptation

    Authors: Krishna Somandepalli, Rajat Hebbar, Shrikanth Narayanan

    Abstract: Robust face clustering is a vital step in enabling computational understanding of visual character portrayal in media. Face clustering for long-form content is challenging because of variations in appearance and lack of supporting large-scale labeled data. Our work in this paper focuses on two key aspects of this problem: the lack of domain-specific training or benchmark datasets, and adapting fac… ▽ More

    Submitted 25 February, 2022; v1 submitted 25 August, 2020; originally announced August 2020.

    Journal ref: IEEE Transactions on Multimedia (2021)

  15. arXiv:2008.08225  [pdf

    cs.CL

    Victim or Perpetrator? Analysis of Violent Characters Portrayals from Movie Scripts

    Authors: Victor R Martinez, Krishna Somandepalli, Karan Singla, Anil Ramanakrishna, Yalda T. Uhls, Shrikanth Narayanan

    Abstract: Violent content in the media can influence viewers' perception of the society. For example, frequent depictions of certain demographics as victims or perpetrators of violence can shape stereotyped attitudes. We propose that computational methods can aid in the large-scale analysis of violence in movies. The method we develop characterizes aspects of violent content solely from the language used in… ▽ More

    Submitted 29 August, 2020; v1 submitted 18 August, 2020; originally announced August 2020.

    Comments: In 2nd workshop on Media Analytics for Societal Trends: Closing the loop with impact and affect in human-media interactions

  16. arXiv:2005.06038  [pdf, other

    cs.LG cs.CV cs.SD eess.AS eess.SP stat.ML

    Generalized Multi-view Shared Subspace Learning using View Bootstrap**

    Authors: Krishna Somandepalli, Shrikanth Narayanan

    Abstract: A key objective in multi-view learning is to model the information common to multiple parallel views of a class of objects/events to improve downstream learning tasks. In this context, two open research questions remain: How can we model hundreds of views per event? Can we learn robust multi-view embeddings without any knowledge of how these views are acquired? We present a neural method based on… ▽ More

    Submitted 12 May, 2020; originally announced May 2020.

  17. Cross modal video representations for weakly supervised active speaker localization

    Authors: Rahul Sharma, Krishna Somandepalli, Shrikanth Narayanan

    Abstract: An objective understanding of media depictions, such as inclusive portrayals of how much someone is heard and seen on screen such as in film and television, requires the machines to discern automatically who, when, how, and where someone is talking, and not. Speaker activity can be automatically discerned from the rich multimodal information present in the media content. This is however a challeng… ▽ More

    Submitted 3 November, 2021; v1 submitted 9 March, 2020; originally announced March 2020.

  18. arXiv:2002.03520  [pdf, other

    eess.AS cs.SD

    An empirical analysis of information encoded in disentangled neural speaker representations

    Authors: Raghuveer Peri, Haoqi Li, Krishna Somandepalli, Arindam Jati, Shrikanth Narayanan

    Abstract: The primary characteristic of robust speaker representations is that they are invariant to factors of variability not related to speaker identity. Disentanglement of speaker representations is one of the techniques used to improve robustness of speaker representations to both intrinsic factors that are acquired during speech production (e.g., emotion, lexical content) and extrinsic factors that ar… ▽ More

    Submitted 7 April, 2020; v1 submitted 9 February, 2020; originally announced February 2020.

    Comments: Submitted to Speaker Odyssey 2020

  19. arXiv:1911.00940  [pdf, other

    eess.AS cs.SD eess.SP

    Robust speaker recognition using unsupervised adversarial invariance

    Authors: Raghuveer Peri, Monisankha Pal, Arindam Jati, Krishna Somandepalli, Shrikanth Narayanan

    Abstract: In this paper, we address the problem of speaker recognition in challenging acoustic conditions using a novel method to extract robust speaker-discriminative speech representations. We adopt a recently proposed unsupervised adversarial invariance architecture to train a network that maps speaker embeddings extracted using a pre-trained model onto two lower dimensional embedding spaces. The embeddi… ▽ More

    Submitted 3 November, 2019; originally announced November 2019.

    Comments: Submitted to ICASSP 2020

  20. Multimodal Representation Learning using Deep Multiset Canonical Correlation

    Authors: Krishna Somandepalli, Naveen Kumar, Ruchir Travadi, Shrikanth Narayanan

    Abstract: We propose Deep Multiset Canonical Correlation Analysis (dMCCA) as an extension to representation learning using CCA when the underlying signal is observed across multiple (more than two) modalities. We use deep learning framework to learn non-linear transformations from different modalities to a shared subspace such that the representations maximize the ratio of between- and within-modality covar… ▽ More

    Submitted 3 April, 2019; originally announced April 2019.