Skip to main content

Showing 1–23 of 23 results for author: Abdelaziz, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.09617  [pdf, other

    cs.CL cs.HC eess.AS

    Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

    Authors: Shruti Palaskar, Oggi Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed Tewfik

    Abstract: Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  2. arXiv:2406.09443  [pdf, other

    eess.AS cs.HC cs.LG

    Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

    Authors: Satyam Kumar, Sai Srujana Buddi, Utkarsh Oggy Sarawgi, Vineet Garg, Shivesh Ranjan, Ognjen, Rudovic, Ahmed Hussen Abdelaziz, Saurabh Adya

    Abstract: Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  3. arXiv:2402.00340  [pdf, other

    cs.SD eess.AS

    Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?

    Authors: Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Skyler Seto, Tatiana Likhomanenko, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Barry-John Theobald

    Abstract: Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised spe… ▽ More

    Submitted 13 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  4. arXiv:2401.17230  [pdf, other

    cs.SD cs.AI eess.AS

    ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

    Authors: Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

    Abstract: This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also… ▽ More

    Submitted 13 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: 5 pages, 3 figures, 7 tables, Interspeech 2024

  5. arXiv:2310.15261  [pdf, ps, other

    cs.SD cs.HC cs.LG eess.AS

    Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

    Authors: Gautam Krishna, Sameer Dharur, Oggi Rudovic, Pranay Dighe, Saurabh Adya, Ahmed Hussen Abdelaziz, Ahmed H Tewfik

    Abstract: Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues, e.g acoustic, text and/or automatic speech recognition system (ASR) features, to classify speech as device-directed or otherwise, and often have to contend with one or… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: 5 pages

  6. arXiv:2309.06006  [pdf, ps, other

    cs.CV cs.AI

    SoccerNet 2023 Challenges Results

    Authors: Anthony Cioppa, Silvio Giancola, Vladimir Somers, Floriane Magera, Xin Zhou, Hassan Mkhallati, Adrien Deliège, Jan Held, Carlos Hinojosa, Amir M. Mansourian, Pierre Miralles, Olivier Barnich, Christophe De Vleeschouwer, Alexandre Alahi, Bernard Ghanem, Marc Van Droogenbroeck, Abdullah Kamal, Adrien Maglo, Albert Clapés, Amr Abdelaziz, Artur Xarles, Astrid Orcesi, Atom Scott, Bin Liu, Byoungkwon Lim , et al. (77 additional authors not shown)

    Abstract: The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, fo… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

  7. arXiv:2207.04521  [pdf, other

    cs.MM cs.CR

    Information-Theoretic Bounds for Steganography in Multimedia

    Authors: Hassan Y. El Arsh, Amr Abdelaziz, Ahmed Elliethy, Hussein A. Aly, T. Aaron Gulliver

    Abstract: Steganography in multimedia aims to embed secret data into an innocent looking multimedia cover object. This embedding introduces some distortion to the cover object and produces a corresponding stego object. The embedding distortion is measured by a cost function that determines the detection probability of the existence of the embedded secret data. A cost function related to the maximum embeddin… ▽ More

    Submitted 15 July, 2022; v1 submitted 10 July, 2022; originally announced July 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2111.04960

  8. arXiv:2203.15975  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

    Authors: Vineet Garg, Ognjen Rudovic, Pranay Dighe, Ahmed H. Abdelaziz, Erik Marchi, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

    Abstract: We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to accidental button presses is critical for user experience. While the majority of approaches to false trigger mitigation (FTM) are designed to detect the presence of a t… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022

  9. arXiv:2111.04960  [pdf, ps, other

    cs.CR cs.IT

    Information-Theoretic Limits for Steganography in Multimedia

    Authors: Hassan Y. El-Arsh, Amr Abdelaziz, Ahmed Elliethy, Hussein A. Aly

    Abstract: Steganography is the art and science of hiding data within innocent-looking objects (cover objects). Multimedia objects such as images and videos are an attractive type of cover objects due to their high embedding rates. There exist many techniques for performing steganography in both the literature and the practical world. Meanwhile, the definition of the steganographic capacity for multimedia an… ▽ More

    Submitted 9 November, 2021; originally announced November 2021.

    Comments: Manuscript posted on 03.07.2021, 23:19 at "https://www.techrxiv.org/articles/preprint/Information-Theoretic_Limits_for_Steganography_in_Multimedia/14867241"

  10. arXiv:2012.05225  [pdf, other

    cs.CV cs.AI cs.CY cs.LG

    MorphGAN: One-Shot Face Synthesis GAN for Detecting Recognition Bias

    Authors: Nataniel Ruiz, Barry-John Theobald, Anurag Ranjan, Ahmed Hussein Abdelaziz, Nicholas Apostoloff

    Abstract: To detect bias in face recognition networks, it can be useful to probe a network under test using samples in which only specific attributes vary in some controlled way. However, capturing a sufficiently large dataset with specific control over the attributes of interest is difficult. In this work, we describe a simulator that applies specific head pose and facial expression adjustments to images o… ▽ More

    Submitted 10 December, 2020; v1 submitted 9 December, 2020; originally announced December 2020.

  11. arXiv:2008.00620  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Audiovisual Speech Synthesis using Tacotron2

    Authors: Ahmed Hussen Abdelaziz, Anushree Prasanna Kumar, Chloe Seivwright, Gabriele Fanelli, Justin Binder, Yannis Stylianou, Sachin Kajarekar

    Abstract: Audiovisual speech synthesis is the problem of synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. In this paper, we propose and compare two audiovisual speech synthesis systems for 3D face models. The first system is the AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a s… ▽ More

    Submitted 29 August, 2021; v1 submitted 2 August, 2020; originally announced August 2020.

    Comments: This work has been submitted to the 23rd ACM International Conference on Multimodal Interaction for possible publication

  12. arXiv:2005.13616  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Modality Dropout for Improved Performance-driven Talking Faces

    Authors: Ahmed Hussen Abdelaziz, Barry-John Theobald, Paul Dixon, Reinhard Knothe, Nicholas Apostoloff, Sachin Kajareker

    Abstract: We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information. To ensure that our model exploits both modalities during training, batches are generated that contain audio-only, v… ▽ More

    Submitted 27 May, 2020; originally announced May 2020.

    Comments: Pre-print

  13. arXiv:2004.12031  [pdf, ps, other

    cs.LG cs.CL cs.CV cs.SD eess.AS

    On the Role of Visual Cues in Audiovisual Speech Enhancement

    Authors: Zakaria Aldeneh, Anushree Prasanna Kumar, Barry-John Theobald, Erik Marchi, Sachin Kajarekar, Devang Naik, Ahmed Hussen Abdelaziz

    Abstract: We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of… ▽ More

    Submitted 25 February, 2021; v1 submitted 24 April, 2020; originally announced April 2020.

    Comments: ICASSP 2021

  14. arXiv:1912.05869  [pdf, other

    eess.AS cs.NE cs.SD q-bio.NC

    On Neural Phone Recognition of Mixed-Source ECoG Signals

    Authors: Ahmed Hussen Abdelaziz, Shuo-Yiin Chang, Nelson Morgan, Erik Edwards, Dorothea Kolossa, Dan Ellis, David A. Moses, Edward F. Chang

    Abstract: The emerging field of neural speech recognition (NSR) using electrocorticography has recently attracted remarkable research interest for studying how human brains recognize speech in quiet and noisy surroundings. In this study, we demonstrate the utility of NSR systems to objectively prove the ability of human beings to attend to a single speech source while suppressing the interfering signals in… ▽ More

    Submitted 12 December, 2019; originally announced December 2019.

    Comments: 5 pages, showing algorithms, results and references from our collaboration during a 2017 postdoc stay of the first author

  15. Achieving Positive Covert Capacity over MIMO AWGN Channels

    Authors: Ahmed Bendary, Amr Abdelaziz, C. Emre Koksal

    Abstract: We consider covert communication, i.e., hiding the presence of communication from an adversary for multiple-input multiple-output (MIMO) additive white Gaussian noise (AWGN) channels. We characterize the maximum covert coding rate under a variety of settings, including different regimes where either the number of transmit antennas or the blocklength is scaled up. We show that a non-zero covert cap… ▽ More

    Submitted 21 January, 2021; v1 submitted 29 October, 2019; originally announced October 2019.

    Comments: Covert communication, low probability of detection communication, MIMO AWGN, square-root law, secrecy capacity, compound channels, unit-rank MIMO

    Journal ref: IEEE Journal on Selected Areas in Information Theory 2021

  16. arXiv:1906.07575  [pdf, other

    cs.CY cs.LG stat.ML

    Trans-Sense: Real Time Transportation Schedule Estimation Using Smart Phones

    Authors: Ali AbdelAziz, Amin Shoukry, Walid Gomaa, Moustafa Youssef

    Abstract: Develo** countries suffer from traffic congestion, poorly planned road/rail networks, and lack of access to public transportation facilities. This context results in an increase in fuel consumption, pollution level, monetary losses, massive delays, and less productivity. On the other hand, it has a negative impact on the commuters feelings and moods. Availability of real-time transit information… ▽ More

    Submitted 13 June, 2019; originally announced June 2019.

    Comments: 8 pages, 11 figures,

  17. arXiv:1905.06860  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

    Authors: Ahmed Hussen Abdelaziz, Barry-John Theobald, Justin Binder, Gabriele Fanelli, Paul Dixon, Nicholas Apostoloff, Thibaut Weise, Sachin Kajareker

    Abstract: Speech-driven visual speech synthesis involves map** features extracted from acoustic speech to the corresponding lip animation controls for a face model. This map** can take many forms, but a powerful approach is to use deep neural networks (DNNs). However, a limitation is the lack of synchronized audio, video, and depth data required to reliably train the DNNs, especially for speaker-indepen… ▽ More

    Submitted 14 May, 2019; originally announced May 2019.

    Comments: 9 pages, 2 figures, 2 tables

    ACM Class: I.2.m; I.3.8

  18. arXiv:1802.01231  [pdf, other

    cs.IT

    MIMO with Energy Recycling

    Authors: Y. Ozan Basciftci, Amr Abdelaziz, C. Emre Koksal

    Abstract: We consider a Multiple Input Single Output (MISO) point-to-point communication system in which the transmitter is designed such that, each antenna can transmit information or harvest energy at any given point in time. We evaluate the achievable rate by such an energy-recycling MISO system under an average transmission power constraint. Our achievable scheme carefully switches the mode of the anten… ▽ More

    Submitted 20 March, 2018; v1 submitted 4 February, 2018; originally announced February 2018.

  19. arXiv:1705.02303  [pdf, other

    cs.IT

    Fundamental Limits of Covert Communication over MIMO AWGN Channel

    Authors: Amr Abdelaziz, C. Emre Koksal

    Abstract: Fundamental limits of covert communication have been studied in literature for different models of scalar channels. It was shown that, over $n$ independent channel uses, $\mathcal{O}(\sqrt{n})$ bits can transmitted reliably over a public channel while achieving an arbitrarily low probability of detection (LPD) by other stations. This result is well known as square-root law and even to achieve this… ▽ More

    Submitted 13 March, 2018; v1 submitted 5 May, 2017; originally announced May 2017.

    Comments: Submitted to IEEE Transactions on Information Theory

  20. arXiv:1701.07518  [pdf, other

    cs.CR cs.IT

    On The Compound MIMO Wiretap Channel with Mean Feedback

    Authors: Amr Abdelaziz, C. Emre Koksal, Hesham El Gamal, Ashraf D. Elbayoumy

    Abstract: Compound MIMO wiretap channel with double sided uncertainty is considered under channel mean information model. In mean information model, channel variations are centered around its mean value which is fed back to the transmitter. We show that the worst case main channel is anti-parallel to the channel mean information resulting in an overall unit rank channel. Further, the worst eavesdropper chan… ▽ More

    Submitted 3 May, 2017; v1 submitted 25 January, 2017; originally announced January 2017.

    Comments: To appear at ISIT 2017 proceedings

  21. arXiv:1609.03109  [pdf, other

    cs.CR

    Message Authentication and Secret Key Agreement in VANETs via Angle of Arrival

    Authors: Amr Abdelaziz, Ron Burton, C. Emre Koksal

    Abstract: In the scope of VANETs, nature of exchanged safety/warning messages renders itself highly location dependent as it is usually for incident reporting. Thus, vehicles are required to periodically exchange beacon messages that include speed, time and GPS location information. In this paper paper, we present a physical layer assisted message authentication scheme that uses Angle of Arrival (AoA) estim… ▽ More

    Submitted 10 September, 2016; originally announced September 2016.

  22. arXiv:1607.00467  [pdf, other

    cs.IT

    On The Security of AoA Estimation

    Authors: Amr Abdelaziz, C. Emre Koksal, Hesham El Gamal

    Abstract: Angle of Arrival (AoA) estimation has found its way to a wide range of applications. Much attention have been paid to study different techniques for AoA estimation and its applications for jamming suppression, however, security vulnerability issues of AoA estimation itself under hostile activity have not been paid the same attention. In this paper, the problem of AoA estimation in Rician flat fadi… ▽ More

    Submitted 2 July, 2016; originally announced July 2016.

    Comments: 9 Pages

  23. arXiv:1502.01454  [pdf, other

    cs.NI cs.CY

    The Diversity and Scale Matter: Ubiquitous Transportation Mode Detection using Single Cell Tower Information

    Authors: Ali Mohamed AbdelAziz, Moustafa Youssef

    Abstract: Detecting the transportation mode of a user is important for a wide range of applications. While a number of recent systems addressed the transportation mode detection problem using the ubiquitous mobile phones, these studies either leverage GPS, the inertial sensors, and/or multiple cell towers information. However, these different phone sensors have high energy consumption, limited to a small su… ▽ More

    Submitted 5 February, 2015; originally announced February 2015.

    Comments: 5 pages, 6 figures