Search | arXiv e-print repository

Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining

Authors: Ugur Sahin, Hang Li, Qadeer Khan, Daniel Cremers, Volker Tresp

Abstract: Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a… ▽ More Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs' performance in tasks involving multimodal compositional reasoning. Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: Accepted to WACV

arXiv:2307.14913 [pdf, other]

ARC-NLP at PAN 2023: Transition-Focused Natural Language Inference for Writing Style Detection

Authors: Izzet Emre Kucukkaya, Umitcan Sahin, Cagri Toraman

Abstract: The task of multi-author writing style detection aims at finding any positions of writing style change in a given text document. We formulate the task as a natural language inference problem where two consecutive paragraphs are paired. Our approach focuses on transitions between paragraphs while truncating input tokens for the task. As backbone models, we employ different Transformer-based encoder… ▽ More The task of multi-author writing style detection aims at finding any positions of writing style change in a given text document. We formulate the task as a natural language inference problem where two consecutive paragraphs are paired. Our approach focuses on transitions between paragraphs while truncating input tokens for the task. As backbone models, we employ different Transformer-based encoders with warmup phase during training. We submit the model version that outperforms baselines and other proposed model versions in our experiments. For the easy and medium setups, we submit transition-focused natural language inference based on DeBERTa with warmup training, and the same model without transition for the hard setup. △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: Accepted by PAN at CLEF 2023

arXiv:2307.14912 [pdf, other]

ARC-NLP at PAN 2023: Hierarchical Long Text Classification for Trigger Detection

Authors: Umitcan Sahin, Izzet Emre Kucukkaya, Cagri Toraman

Abstract: Fanfiction, a popular form of creative writing set within established fictional universes, has gained a substantial online following. However, ensuring the well-being and safety of participants has become a critical concern in this community. The detection of triggering content, material that may cause emotional distress or trauma to readers, poses a significant challenge. In this paper, we descri… ▽ More Fanfiction, a popular form of creative writing set within established fictional universes, has gained a substantial online following. However, ensuring the well-being and safety of participants has become a critical concern in this community. The detection of triggering content, material that may cause emotional distress or trauma to readers, poses a significant challenge. In this paper, we describe our approach for the Trigger Detection shared task at PAN CLEF 2023, where we want to detect multiple triggering content in a given Fanfiction document. For this, we build a hierarchical model that uses recurrence over Transformer-based language models. In our approach, we first split long documents into smaller sized segments and use them to fine-tune a Transformer model. Then, we extract feature embeddings from the fine-tuned Transformer model, which are used as input in the training of multiple LSTM models for trigger detection in a multi-label setting. Our model achieves an F1-macro score of 0.372 and F1-micro score of 0.736 on the validation set, which are higher than the baseline results shared at PAN CLEF 2023. △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: Accepted by PAN at CLEF 2023

arXiv:2307.13829 [pdf, other]

ARC-NLP at Multimodal Hate Speech Event Detection 2023: Multimodal Methods Boosted by Ensemble Learning, Syntactical and Entity Features

Authors: Umitcan Sahin, Izzet Emre Kucukkaya, Oguzhan Ozcelik, Cagri Toraman

Abstract: Text-embedded images can serve as a means of spreading hate speech, propaganda, and extremist beliefs. Throughout the Russia-Ukraine war, both opposing factions heavily relied on text-embedded images as a vehicle for spreading propaganda and hate speech. Ensuring the effective detection of hate speech and propaganda is of utmost importance to mitigate the negative effect of hate speech disseminati… ▽ More Text-embedded images can serve as a means of spreading hate speech, propaganda, and extremist beliefs. Throughout the Russia-Ukraine war, both opposing factions heavily relied on text-embedded images as a vehicle for spreading propaganda and hate speech. Ensuring the effective detection of hate speech and propaganda is of utmost importance to mitigate the negative effect of hate speech dissemination. In this paper, we outline our methodologies for two subtasks of Multimodal Hate Speech Event Detection 2023. For the first subtask, hate speech detection, we utilize multimodal deep learning models boosted by ensemble learning and syntactical text attributes. For the second subtask, target detection, we employ multimodal deep learning models boosted by named entity features. Through experimentation, we demonstrate the superior performance of our models compared to all textual, visual, and text-visual baselines employed in multimodal hate speech detection. Furthermore, our models achieve the first place in both subtasks on the final leaderboard of the shared task. △ Less

Submitted 25 July, 2023; originally announced July 2023.

Comments: Submitted to CASE at RANLP 2023

arXiv:2302.13403 [pdf, other]

Tweets Under the Rubble: Detection of Messages Calling for Help in Earthquake Disaster

Authors: Cagri Toraman, Izzet Emre Kucukkaya, Oguzhan Ozcelik, Umitcan Sahin

Abstract: The importance of social media is again exposed in the recent tragedy of the 2023 Turkey and Syria earthquake. Many victims who were trapped under the rubble called for help by posting messages in Twitter. We present an interactive tool to provide situational awareness for missing and trapped people, and disaster relief for rescue and donation efforts. The system (i) collects tweets, (ii) classifi… ▽ More The importance of social media is again exposed in the recent tragedy of the 2023 Turkey and Syria earthquake. Many victims who were trapped under the rubble called for help by posting messages in Twitter. We present an interactive tool to provide situational awareness for missing and trapped people, and disaster relief for rescue and donation efforts. The system (i) collects tweets, (ii) classifies the ones calling for help, (iii) extracts important entity tags, and (iv) visualizes them in an interactive map screen. Our initial experiments show that the performance in terms of the F1 score is up to 98.30 for tweet classification, and 84.32 for entity extraction. The demonstration, dataset, and other related files can be accessed at https://github.com/avaapm/deprem △ Less

Submitted 26 February, 2023; originally announced February 2023.

arXiv:2301.03206 [pdf, other]

doi 10.21437/SPSC.2022-3

Introducing Model Inversion Attacks on Automatic Speaker Recognition

Authors: Karla Pizzi, Franziska Boenisch, Ugur Sahin, Konstantin Böttinger

Abstract: Model inversion (MI) attacks allow to reconstruct average per-class representations of a machine learning (ML) model's training data. It has been shown that in scenarios where each class corresponds to a different individual, such as face classifiers, this represents a severe privacy risk. In this work, we explore a new application for MI: the extraction of speakers' voices from a speaker recognit… ▽ More Model inversion (MI) attacks allow to reconstruct average per-class representations of a machine learning (ML) model's training data. It has been shown that in scenarios where each class corresponds to a different individual, such as face classifiers, this represents a severe privacy risk. In this work, we explore a new application for MI: the extraction of speakers' voices from a speaker recognition system. We present an approach to (1) reconstruct audio samples from a trained ML model and (2) extract intermediate voice feature representations which provide valuable insights into the speakers' biometrics. Therefore, we propose an extension of MI attacks which we call sliding model inversion. Our sliding MI extends standard MI by iteratively inverting overlap** chunks of the audio samples and thereby leveraging the sequential properties of audio data for enhanced inversion performance. We show that one can use the inverted audio data to generate spoofed audio samples to impersonate a speaker, and execute voice-protected commands for highly secured systems on their behalf. To the best of our knowledge, our work is the first one extending MI attacks to audio data, and our results highlight the security risks resulting from the extraction of the biometric data in that setup. △ Less

Submitted 9 January, 2023; originally announced January 2023.

Comments: for associated pdf, see https://www.isca-speech.org/archive/pdfs/spsc_2022/pizzi22_spsc.pdf

Journal ref: Proc. 2nd Symposium on Security and Privacy in Speech Communication, 2022

arXiv:2212.03616 [pdf, other]

Image Compression With Learned Lifting-Based DWT and Learned Tree-Based Entropy Models

Authors: Ugur Berk Sahin, Fatih Kamisli

Abstract: This paper explores learned image compression based on traditional and learned discrete wavelet transform (DWT) architectures and learned entropy models for coding DWT subband coefficients. A learned DWT is obtained through the lifting scheme with learned nonlinear predict and update filters. Several learned entropy models are proposed to exploit inter and intra-DWT subband coefficient dependencie… ▽ More This paper explores learned image compression based on traditional and learned discrete wavelet transform (DWT) architectures and learned entropy models for coding DWT subband coefficients. A learned DWT is obtained through the lifting scheme with learned nonlinear predict and update filters. Several learned entropy models are proposed to exploit inter and intra-DWT subband coefficient dependencies, akin to traditional EZW, SPIHT, or EBCOT algorithms. Experimental results show that when the proposed learned entropy models are combined with traditional wavelet filters, such as the CDF 9/7 filters, compression performance that far exceeds that of JPEG2000 can be achieved. When the learned entropy models are combined with the learned DWT, compression performance increases further. The computations in the learned DWT and all entropy models, except one, can be simply parallelized, and the systems provide practical encoding and decoding times on GPUs. △ Less

Submitted 7 December, 2022; originally announced December 2022.

Comments: 11 pages, 17 figures

arXiv:2012.01736 [pdf, other]

Designing a Prospective COVID-19 Therapeutic with Reinforcement Learning

Authors: Marcin J. Skwark, Nicolás López Carranza, Thomas Pierrot, Joe Phillips, Slim Said, Alexandre Laterre, Amine Kerkeni, Uğur Şahin, Karim Beguir

Abstract: The SARS-CoV-2 pandemic has created a global race for a cure. One approach focuses on designing a novel variant of the human angiotensin-converting enzyme 2 (ACE2) that binds more tightly to the SARS-CoV-2 spike protein and diverts it from human cells. Here we formulate a novel protein design framework as a reinforcement learning problem. We generate new designs efficiently through the combination… ▽ More The SARS-CoV-2 pandemic has created a global race for a cure. One approach focuses on designing a novel variant of the human angiotensin-converting enzyme 2 (ACE2) that binds more tightly to the SARS-CoV-2 spike protein and diverts it from human cells. Here we formulate a novel protein design framework as a reinforcement learning problem. We generate new designs efficiently through the combination of a fast, biologically-grounded reward function and sequential action-space formulation. The use of Policy Gradients reduces the compute budget needed to reach consistent, high-quality designs by at least an order of magnitude compared to standard methods. Complexes designed by this method have been validated by molecular dynamics simulations, confirming their increased stability. This suggests that combining leading protein design methods with modern deep reinforcement learning is a viable path for discovering a Covid-19 cure and may accelerate design of peptide-based therapeutics for other diseases. △ Less

Submitted 3 December, 2020; originally announced December 2020.

arXiv:2007.13212 [pdf, other]

Demo: A Proof-of-Concept Implementation of Guard Secure Routing Protocol

Authors: Sanaz Taheri-Boshrooyeh, Ali Utkan Şahin, Yahya Hassanzadeh-Nazarabadi, Öznur Özkasap

Abstract: Skip Graphs belong to the family of Distributed Hash Table (DHT) structures that are utilized as routing overlays in various peer-to-peer applications including blockchains, cloud storage, and social networks. In a Skip Graph overlay, any misbehavior of peers during the routing of a query compromises the system functionality. Guard is the first authenticated search mechanism for Skip Graphs, enabl… ▽ More Skip Graphs belong to the family of Distributed Hash Table (DHT) structures that are utilized as routing overlays in various peer-to-peer applications including blockchains, cloud storage, and social networks. In a Skip Graph overlay, any misbehavior of peers during the routing of a query compromises the system functionality. Guard is the first authenticated search mechanism for Skip Graphs, enables reliable search operation in a fully decentralized manner. In this demo paper, we present a proof-of-concept implementation of Guard on Skip Graph nodes as well as a deployment demo scenario. △ Less

Submitted 26 July, 2020; originally announced July 2020.

Comments: 3 pages

arXiv:2007.13200 [pdf, other]

SkipSim: Scalable Skip Graph Simulator

Authors: Yahya Hassanzadeh-Nazarabadi, Ali Utkan Şahin, Öznur Özkasap, Alptekin Küpçü

Abstract: SkipSim is an offline Skip Graph simulator that enables Skip Graph-based algorithms including blockchains and P2P cloud storage to be simulated while preserving their scalability and decentralized nature. To the best of our knowledge, it is the first Skip Graph simulator that provides several features for experimentation on Skip Graph-based overlay networks. In this demo paper, we present SkipSim… ▽ More SkipSim is an offline Skip Graph simulator that enables Skip Graph-based algorithms including blockchains and P2P cloud storage to be simulated while preserving their scalability and decentralized nature. To the best of our knowledge, it is the first Skip Graph simulator that provides several features for experimentation on Skip Graph-based overlay networks. In this demo paper, we present SkipSim features, its architecture, as well as a sample blockchain demo scenario. △ Less

Submitted 26 July, 2020; originally announced July 2020.

Showing 1–10 of 10 results for author: Sahin, U