-
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
Authors:
Shruti Palaskar,
Oggi Rudovic,
Sameer Dharur,
Florian Pesce,
Gautam Krishna,
Aswin Sivaraman,
Jack Berkowitz,
Ahmed Hussen Abdelaziz,
Saurabh Adya,
Ahmed Tewfik
Abstract:
Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to…
▽ More
Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over the text-only approach and attains performance parity with its full fine-tuning (FFT) counterpart while needing to tune only a fraction of its parameters. Furthermore, with the newly introduced adapter dropout, FLoRA is robust to missing data, improving over FFT by 20% lower EER and 56% lower false accept rate. The proposed approach scales well for model sizes from 16M to 3B parameters.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness
Authors:
Satyam Kumar,
Sai Srujana Buddi,
Utkarsh Oggy Sarawgi,
Vineet Garg,
Shivesh Ranjan,
Ognjen,
Rudovic,
Ahmed Hussen Abdelaziz,
Saurabh Adya
Abstract:
Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection…
▽ More
Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection (PVAD) systems to assess their real-world effectiveness. We introduce a comprehensive approach to assess PVAD systems, incorporating various performance metrics such as frame-level and utterance-level error rates, detection latency and accuracy, alongside user-level analysis. Through extensive experimentation and evaluation, we provide a thorough understanding of the strengths and limitations of various PVAD variants. This paper advances the understanding of PVAD technology by offering insights into its efficacy and viability in practical applications using a comprehensive set of metrics.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?
Authors:
Zakaria Aldeneh,
Takuya Higuchi,
Jee-weon Jung,
Skyler Seto,
Tatiana Likhomanenko,
Stephen Shum,
Ahmed Hussen Abdelaziz,
Shinji Watanabe,
Barry-John Theobald
Abstract:
Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised spe…
▽ More
Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised speech features inherently include information required for downstream speaker verification task, and therefore, we can simplify the downstream model without sacrificing performance. To this end, we revisit the design of the downstream model for speaker verification using self-supervised features. We show that we can simplify the model to use 97.51% fewer parameters while achieving a 29.93% average improvement in performance on SUPERB. Consequently, we show that the simplified downstream model is more data efficient compared to baseline--it achieves better performance with only 60% of the training data.
△ Less
Submitted 13 June, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models
Authors:
Jee-weon Jung,
Wangyou Zhang,
Jiatong Shi,
Zakaria Aldeneh,
Takuya Higuchi,
Barry-John Theobald,
Ahmed Hussen Abdelaziz,
Shinji Watanabe
Abstract:
This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also…
▽ More
This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also aspire to bridge developed models with other domains, facilitating the broad research community to effortlessly incorporate state-of-the-art embedding extractors. Pre-trained embedding extractors can be accessed in an off-the-shelf manner and we demonstrate the toolkit's versatility by showcasing its integration with two tasks. Another goal is to integrate with diverse self-supervised learning features. We release a reproducible recipe that achieves an equal error rate of 0.39% on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN.
△ Less
Submitted 13 June, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features
Authors:
Gautam Krishna,
Sameer Dharur,
Oggi Rudovic,
Pranay Dighe,
Saurabh Adya,
Ahmed Hussen Abdelaziz,
Ahmed H Tewfik
Abstract:
Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues, e.g acoustic, text and/or automatic speech recognition system (ASR) features, to classify speech as device-directed or otherwise, and often have to contend with one or…
▽ More
Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues, e.g acoustic, text and/or automatic speech recognition system (ASR) features, to classify speech as device-directed or otherwise, and often have to contend with one or more of these modalities being unavailable when deployed in real-world settings. In this paper, we investigate fusion schemes for DDSD systems that can be made more robust to missing modalities. Concurrently, we study the use of non-verbal cues, specifically prosody features, in addition to verbal cues for DDSD. We present different approaches to combine scores and embeddings from prosody with the corresponding verbal cues, finding that prosody improves DDSD performance by upto 8.5% in terms of false acceptance rate (FA) at a given fixed operating point via non-linear intermediate fusion, while our use of modality dropout techniques improves the performance of these models by 7.4% in terms of FA when evaluated with missing modalities during inference time.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
SoccerNet 2023 Challenges Results
Authors:
Anthony Cioppa,
Silvio Giancola,
Vladimir Somers,
Floriane Magera,
Xin Zhou,
Hassan Mkhallati,
Adrien Deliège,
Jan Held,
Carlos Hinojosa,
Amir M. Mansourian,
Pierre Miralles,
Olivier Barnich,
Christophe De Vleeschouwer,
Alexandre Alahi,
Bernard Ghanem,
Marc Van Droogenbroeck,
Abdullah Kamal,
Adrien Maglo,
Albert Clapés,
Amr Abdelaziz,
Artur Xarles,
Astrid Orcesi,
Atom Scott,
Bin Liu,
Byoungkwon Lim
, et al. (77 additional authors not shown)
Abstract:
The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, fo…
▽ More
The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, focusing on retrieving all timestamps related to global actions in soccer, (2) ball action spotting, focusing on retrieving all timestamps related to the soccer ball change of state, and (3) dense video captioning, focusing on describing the broadcast with natural language and anchored timestamps. The second theme, field understanding, relates to the single task of (4) camera calibration, focusing on retrieving the intrinsic and extrinsic camera parameters from images. The third and last theme, player understanding, is composed of three low-level tasks related to extracting information about the players: (5) re-identification, focusing on retrieving the same players across multiple views, (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams, and (7) jersey number recognition, focusing on recognizing the jersey number of players from tracklets. Compared to the previous editions of the SoccerNet challenges, tasks (2-3-7) are novel, including new annotations and data, task (4) was enhanced with more data and annotations, and task (6) now focuses on end-to-end approaches. More information on the tasks, challenges, and leaderboards are available on https://www.soccer-net.org. Baselines and development kits can be found on https://github.com/SoccerNet.
△ Less
Submitted 12 September, 2023;
originally announced September 2023.
-
Information-Theoretic Bounds for Steganography in Multimedia
Authors:
Hassan Y. El Arsh,
Amr Abdelaziz,
Ahmed Elliethy,
Hussein A. Aly,
T. Aaron Gulliver
Abstract:
Steganography in multimedia aims to embed secret data into an innocent looking multimedia cover object. This embedding introduces some distortion to the cover object and produces a corresponding stego object. The embedding distortion is measured by a cost function that determines the detection probability of the existence of the embedded secret data. A cost function related to the maximum embeddin…
▽ More
Steganography in multimedia aims to embed secret data into an innocent looking multimedia cover object. This embedding introduces some distortion to the cover object and produces a corresponding stego object. The embedding distortion is measured by a cost function that determines the detection probability of the existence of the embedded secret data. A cost function related to the maximum embedding rate is typically employed to evaluate a steganographic system. In addition, the distribution of multimedia sources follows the Gibbs distribution which is a complex statistical model that restricts analysis. Thus, previous multimedia steganographic approaches either assume a relaxed distribution or presume a proposition on the maximum embedding rate and then try to prove it is correct. Conversely, this paper introduces an analytic approach to determining the maximum embedding rate in multimedia cover objects through a constrained optimization problem concerning the relationship between the maximum embedding rate and the probability of detection by any steganographic detector. The KL-divergence between the distributions for the cover and stego objects is used as the cost function as it upper bounds the performance of the optimal steganographic detector. An equivalence between the Gibbs and correlated-multivariate-quantized-Gaussian distributions is established to solve this optimization problem. The solution provides an analytic form for the maximum embedding rate in terms of the WrightOmega function. Moreover, it is proven that the maximum embedding rate is in agreement with the commonly used Square Root Law (SRL) for steganography, but the solution presented here is more accurate. Finally, the theoretical results obtained are verified experimentally.
△ Less
Submitted 15 July, 2022; v1 submitted 10 July, 2022;
originally announced July 2022.
-
Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models
Authors:
Vineet Garg,
Ognjen Rudovic,
Pranay Dighe,
Ahmed H. Abdelaziz,
Erik Marchi,
Saurabh Adya,
Chandra Dhir,
Ahmed Tewfik
Abstract:
We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to accidental button presses is critical for user experience. While the majority of approaches to false trigger mitigation (FTM) are designed to detect the presence of a t…
▽ More
We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to accidental button presses is critical for user experience. While the majority of approaches to false trigger mitigation (FTM) are designed to detect the presence of a target keyword, inferring user intent in absence of keyword is difficult. This also poses a challenge when creating the training/evaluation data for such systems due to inherent ambiguity in the user's data. To this end, we propose a novel FTM approach that uses weakly-labeled training data obtained with a newly introduced data sampling strategy. While this sampling strategy reduces data annotation efforts, the data labels are noisy as the data are not annotated manually. We use these data to train an acoustics-only model for the FTM task by regularizing its loss function via knowledge distillation from an ASR-based (LatticeRNN) model. This improves the model decisions, resulting in 66% gain in accuracy, as measured by equal-error-rate (EER), over the base acoustics-only model. We also show that the ensemble of the LatticeRNN and acoustic-distilled models brings further accuracy improvement of 20%.
△ Less
Submitted 29 March, 2022;
originally announced March 2022.
-
Information-Theoretic Limits for Steganography in Multimedia
Authors:
Hassan Y. El-Arsh,
Amr Abdelaziz,
Ahmed Elliethy,
Hussein A. Aly
Abstract:
Steganography is the art and science of hiding data within innocent-looking objects (cover objects). Multimedia objects such as images and videos are an attractive type of cover objects due to their high embedding rates. There exist many techniques for performing steganography in both the literature and the practical world. Meanwhile, the definition of the steganographic capacity for multimedia an…
▽ More
Steganography is the art and science of hiding data within innocent-looking objects (cover objects). Multimedia objects such as images and videos are an attractive type of cover objects due to their high embedding rates. There exist many techniques for performing steganography in both the literature and the practical world. Meanwhile, the definition of the steganographic capacity for multimedia and how to be calculated has not taken full attention. In this paper, for multivariate quantized-Gaussian-distributed multimedia, we study the maximum achievable embedding rate with respect to the statistical properties of cover objects against the maximum achievable performance by any steganalytic detector. Toward this goal, we evaluate the maximum allowed entropy of the hidden message source subject to the maximum probability of error of the steganalytic detector which is bounded by the KL-divergence between the statistical distributions for the cover and the stego objects. We give the exact scaling constant that governs the relationship between the entropies of the hidden message and the cover object.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
MorphGAN: One-Shot Face Synthesis GAN for Detecting Recognition Bias
Authors:
Nataniel Ruiz,
Barry-John Theobald,
Anurag Ranjan,
Ahmed Hussein Abdelaziz,
Nicholas Apostoloff
Abstract:
To detect bias in face recognition networks, it can be useful to probe a network under test using samples in which only specific attributes vary in some controlled way. However, capturing a sufficiently large dataset with specific control over the attributes of interest is difficult. In this work, we describe a simulator that applies specific head pose and facial expression adjustments to images o…
▽ More
To detect bias in face recognition networks, it can be useful to probe a network under test using samples in which only specific attributes vary in some controlled way. However, capturing a sufficiently large dataset with specific control over the attributes of interest is difficult. In this work, we describe a simulator that applies specific head pose and facial expression adjustments to images of previously unseen people. The simulator first fits a 3D morphable model to a provided image, applies the desired head pose and facial expression controls, then renders the model into an image. Next, a conditional Generative Adversarial Network (GAN) conditioned on the original image and the rendered morphable model is used to produce the image of the original person with the new facial expression and head pose. We call this conditional GAN -- MorphGAN. Images generated using MorphGAN conserve the identity of the person in the original image, and the provided control over head pose and facial expression allows test sets to be created to identify robustness issues of a facial recognition deep network with respect to pose and expression. Images generated by MorphGAN can also serve as data augmentation when training data are scarce. We show that by augmenting small datasets of faces with new poses and expressions improves the recognition performance by up to 9% depending on the augmentation and data scarcity.
△ Less
Submitted 10 December, 2020; v1 submitted 9 December, 2020;
originally announced December 2020.
-
Audiovisual Speech Synthesis using Tacotron2
Authors:
Ahmed Hussen Abdelaziz,
Anushree Prasanna Kumar,
Chloe Seivwright,
Gabriele Fanelli,
Justin Binder,
Yannis Stylianou,
Sachin Kajarekar
Abstract:
Audiovisual speech synthesis is the problem of synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. In this paper, we propose and compare two audiovisual speech synthesis systems for 3D face models. The first system is the AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a s…
▽ More
Audiovisual speech synthesis is the problem of synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. In this paper, we propose and compare two audiovisual speech synthesis systems for 3D face models. The first system is the AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a sequence of phonemes representing the sentence to synthesize into a sequence of acoustic features and the corresponding controllers of a face model. The output acoustic features are used to condition a WaveRNN to reconstruct the speech waveform, and the output facial controllers are used to generate the corresponding video of the talking face. The second audiovisual speech synthesis system is modular, where acoustic speech is synthesized from text using the traditional Tacotron2. The reconstructed acoustic speech signal is then used to drive the facial controls of the face model using an independently trained audio-to-facial-animation neural network. We further condition both the end-to-end and modular approaches on emotion embeddings that encode the required prosody to generate emotional audiovisual speech. We analyze the performance of the two systems and compare them to the ground truth videos using subjective evaluation tests. The end-to-end and modular systems are able to synthesize close to human-like audiovisual speech with mean opinion scores (MOS) of 4.1 and 3.9, respectively, compared to a MOS of 4.1 for the ground truth generated from professionally recorded videos. While the end-to-end system gives a better overall quality, the modular approach is more flexible and the quality of acoustic speech and visual speech synthesis is almost independent of each other.
△ Less
Submitted 29 August, 2021; v1 submitted 2 August, 2020;
originally announced August 2020.
-
Modality Dropout for Improved Performance-driven Talking Faces
Authors:
Ahmed Hussen Abdelaziz,
Barry-John Theobald,
Paul Dixon,
Reinhard Knothe,
Nicholas Apostoloff,
Sachin Kajareker
Abstract:
We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information. To ensure that our model exploits both modalities during training, batches are generated that contain audio-only, v…
▽ More
We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information. To ensure that our model exploits both modalities during training, batches are generated that contain audio-only, video-only, and audiovisual input features. The probability of drop** a modality allows control over the degree to which the model exploits audio and visual information during training. Our trained model runs in real-time on resource limited hardware (e.g.\ a smart phone), it is user agnostic, and it is not dependent on a potentially error-prone transcription of the speech. We use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout. Before introducing dropout, viewers prefer audiovisual-driven animation in 51% of the test sequences compared with only 18% for video-driven. After introducing dropout viewer preference for audiovisual-driven animation increases to 74%, but decreases to 8% for video-only.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.
-
On the Role of Visual Cues in Audiovisual Speech Enhancement
Authors:
Zakaria Aldeneh,
Anushree Prasanna Kumar,
Barry-John Theobald,
Erik Marchi,
Sachin Kajarekar,
Devang Naik,
Ahmed Hussen Abdelaziz
Abstract:
We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of…
▽ More
We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of articulation. One byproduct of this finding is that the learned visual embeddings can be used as features for other visual speech applications. We demonstrate the effectiveness of the learned visual embeddings for classifying visemes (the visual analogy to phonemes). Our results provide insight into important aspects of audiovisual speech enhancement and demonstrate how such models can be used for self-supervision tasks for visual speech applications.
△ Less
Submitted 25 February, 2021; v1 submitted 24 April, 2020;
originally announced April 2020.
-
On Neural Phone Recognition of Mixed-Source ECoG Signals
Authors:
Ahmed Hussen Abdelaziz,
Shuo-Yiin Chang,
Nelson Morgan,
Erik Edwards,
Dorothea Kolossa,
Dan Ellis,
David A. Moses,
Edward F. Chang
Abstract:
The emerging field of neural speech recognition (NSR) using electrocorticography has recently attracted remarkable research interest for studying how human brains recognize speech in quiet and noisy surroundings. In this study, we demonstrate the utility of NSR systems to objectively prove the ability of human beings to attend to a single speech source while suppressing the interfering signals in…
▽ More
The emerging field of neural speech recognition (NSR) using electrocorticography has recently attracted remarkable research interest for studying how human brains recognize speech in quiet and noisy surroundings. In this study, we demonstrate the utility of NSR systems to objectively prove the ability of human beings to attend to a single speech source while suppressing the interfering signals in a simulated cocktail party scenario. The experimental results show that the relative degradation of the NSR system performance when tested in a mixed-source scenario is significantly lower than that of automatic speech recognition (ASR). In this paper, we have significantly enhanced the performance of our recently published framework by using manual alignments for initialization instead of the flat start technique. We have also improved the NSR system performance by accounting for the possible transcription mismatch between the acoustic and neural signals.
△ Less
Submitted 12 December, 2019;
originally announced December 2019.
-
Achieving Positive Covert Capacity over MIMO AWGN Channels
Authors:
Ahmed Bendary,
Amr Abdelaziz,
C. Emre Koksal
Abstract:
We consider covert communication, i.e., hiding the presence of communication from an adversary for multiple-input multiple-output (MIMO) additive white Gaussian noise (AWGN) channels. We characterize the maximum covert coding rate under a variety of settings, including different regimes where either the number of transmit antennas or the blocklength is scaled up. We show that a non-zero covert cap…
▽ More
We consider covert communication, i.e., hiding the presence of communication from an adversary for multiple-input multiple-output (MIMO) additive white Gaussian noise (AWGN) channels. We characterize the maximum covert coding rate under a variety of settings, including different regimes where either the number of transmit antennas or the blocklength is scaled up. We show that a non-zero covert capacity can be achieved in the massive MIMO regime in which the number of transmit antennas scales up but under specific conditions. Under such conditions, we show that the covert capacity of MIMO AWGN channels converges the capacity of MIMO AWGN channels. Furthermore, we derive the order-optimal scaling of the number of covert bits in the regime where the covert capacity is zero. We provide an insightful comparative analysis of different cases in which secrecy and energy-undetectability constraints are imposed separately or jointly.
△ Less
Submitted 21 January, 2021; v1 submitted 29 October, 2019;
originally announced October 2019.
-
Trans-Sense: Real Time Transportation Schedule Estimation Using Smart Phones
Authors:
Ali AbdelAziz,
Amin Shoukry,
Walid Gomaa,
Moustafa Youssef
Abstract:
Develo** countries suffer from traffic congestion, poorly planned road/rail networks, and lack of access to public transportation facilities. This context results in an increase in fuel consumption, pollution level, monetary losses, massive delays, and less productivity. On the other hand, it has a negative impact on the commuters feelings and moods. Availability of real-time transit information…
▽ More
Develo** countries suffer from traffic congestion, poorly planned road/rail networks, and lack of access to public transportation facilities. This context results in an increase in fuel consumption, pollution level, monetary losses, massive delays, and less productivity. On the other hand, it has a negative impact on the commuters feelings and moods. Availability of real-time transit information - by providing public transportation vehicles locations using GPS devices - helps in estimating a passenger's waiting time and addressing the above issues. However, such solution is expensive for develo** countries. This paper aims at designing and implementing a crowd-sourced mobile phones-based solution to estimate the expected waiting time of a passenger in public transit systems, the prediction of the remaining time to get on/off a vehicle, and to construct a real time public transit schedule. Trans-Sense has been evaluated using real data collected for over 800 hours, on a daily basis, by different Android phones, and using different light rail transit lines at different time spans. The results show that Trans-Sense can achieve an average recall and precision of 95.35% and 90.1%, respectively, in discriminating lightrail stations. Moreover, the empirical distributions governing the different time delays affecting a passenger's total trip time enable predicting the right time of arrival of a passenger to her destination with an accuracy of 91.81%.In addition, the system estimates the stations dimensions with an accuracy of 95.71%.
△ Less
Submitted 13 June, 2019;
originally announced June 2019.
-
Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models
Authors:
Ahmed Hussen Abdelaziz,
Barry-John Theobald,
Justin Binder,
Gabriele Fanelli,
Paul Dixon,
Nicholas Apostoloff,
Thibaut Weise,
Sachin Kajareker
Abstract:
Speech-driven visual speech synthesis involves map** features extracted from acoustic speech to the corresponding lip animation controls for a face model. This map** can take many forms, but a powerful approach is to use deep neural networks (DNNs). However, a limitation is the lack of synchronized audio, video, and depth data required to reliably train the DNNs, especially for speaker-indepen…
▽ More
Speech-driven visual speech synthesis involves map** features extracted from acoustic speech to the corresponding lip animation controls for a face model. This map** can take many forms, but a powerful approach is to use deep neural networks (DNNs). However, a limitation is the lack of synchronized audio, video, and depth data required to reliably train the DNNs, especially for speaker-independent models. In this paper, we investigate adapting an automatic speech recognition (ASR) acoustic model (AM) for the visual speech synthesis problem. We train the AM on ten thousand hours of audio-only data. The AM is then adapted to the visual speech synthesis domain using ninety hours of synchronized audio-visual speech. Using a subjective assessment test, we compared the performance of the AM-initialized DNN to one with a random initialization. The results show that viewers significantly prefer animations generated from the AM-initialized DNN than the ones generated using the randomly initialized model. We conclude that visual speech synthesis can significantly benefit from the powerful representation of speech in the ASR acoustic models.
△ Less
Submitted 14 May, 2019;
originally announced May 2019.
-
MIMO with Energy Recycling
Authors:
Y. Ozan Basciftci,
Amr Abdelaziz,
C. Emre Koksal
Abstract:
We consider a Multiple Input Single Output (MISO) point-to-point communication system in which the transmitter is designed such that, each antenna can transmit information or harvest energy at any given point in time. We evaluate the achievable rate by such an energy-recycling MISO system under an average transmission power constraint. Our achievable scheme carefully switches the mode of the anten…
▽ More
We consider a Multiple Input Single Output (MISO) point-to-point communication system in which the transmitter is designed such that, each antenna can transmit information or harvest energy at any given point in time. We evaluate the achievable rate by such an energy-recycling MISO system under an average transmission power constraint. Our achievable scheme carefully switches the mode of the antennas between transmission and wireless harvesting, where most of the harvesting happens from the neighboring antennas' transmissions, i.e., recycling. We show that, with recycling, it is possible to exceed the capacity of the classical non-harvesting counterpart. As the complexity of the achievable algorithm is exponential with the number of antennas, we also provide an almost linear algorithm that has a minimal degradation in achievable rate. To address the major questions on the capability of recycling and the impacts of antenna coupling, we also develop a hardware setup and experimental results for a 4-antenna transmitter, based on a uniform linear array (ULA). We demonstrate that the loss in the rate due to antenna coupling can be made negligible with sufficient antenna spacing and provide hardware measurements for the power recycled from the transmitting antennas and the power received at the target receiver, taken simultaneously. We provide refined performance measurement results, based on our actual measurements.
△ Less
Submitted 20 March, 2018; v1 submitted 4 February, 2018;
originally announced February 2018.
-
Fundamental Limits of Covert Communication over MIMO AWGN Channel
Authors:
Amr Abdelaziz,
C. Emre Koksal
Abstract:
Fundamental limits of covert communication have been studied in literature for different models of scalar channels. It was shown that, over $n$ independent channel uses, $\mathcal{O}(\sqrt{n})$ bits can transmitted reliably over a public channel while achieving an arbitrarily low probability of detection (LPD) by other stations. This result is well known as square-root law and even to achieve this…
▽ More
Fundamental limits of covert communication have been studied in literature for different models of scalar channels. It was shown that, over $n$ independent channel uses, $\mathcal{O}(\sqrt{n})$ bits can transmitted reliably over a public channel while achieving an arbitrarily low probability of detection (LPD) by other stations. This result is well known as square-root law and even to achieve this diminishing rate of covert communication, some form of shared secret is needed between the transmitter and the receiver. In this paper, we establish the limits of LPD communication over the MIMO AWGN channel. We define the notion of $ε$-probability of detection ($ε$-PD) and provide a formulation to evaluate the maximum achievable rate under the $ε$-PD constraint. We first show that the capacity-achieving input distribution is the zero-mean Gaussian distribution. Then, assuming channel state information (CSI) on only the main channel at the transmitter, we derive the optimal input covariance matrix, hence, establishing the $ε$-PD capacity. We evaluate $ε$-PD rates in the limiting regimes for the number of channel uses (asymptotic block length) and the number of antennas (massive MIMO). We show that, in the asymptotic block-length regime, while the SRL still holds for the MIMO AWGN, the number of bits that can be transmitted covertly scales exponentially with the number of transmitting antennas. Further, we derive the $ε$-PD capacity \textit{with no shared secret}. For that scenario, in the massive MIMO limit, higher covert rate up to the non LPD constrained capacity still can be achieved, yet, with much slower scaling compared to the scenario with shared secret. The practical implication of our result is that, MIMO has the potential to provide a substantial increase in the file sizes that can be covertly communicated subject to a reasonably low delay.
△ Less
Submitted 13 March, 2018; v1 submitted 5 May, 2017;
originally announced May 2017.
-
On The Compound MIMO Wiretap Channel with Mean Feedback
Authors:
Amr Abdelaziz,
C. Emre Koksal,
Hesham El Gamal,
Ashraf D. Elbayoumy
Abstract:
Compound MIMO wiretap channel with double sided uncertainty is considered under channel mean information model. In mean information model, channel variations are centered around its mean value which is fed back to the transmitter. We show that the worst case main channel is anti-parallel to the channel mean information resulting in an overall unit rank channel. Further, the worst eavesdropper chan…
▽ More
Compound MIMO wiretap channel with double sided uncertainty is considered under channel mean information model. In mean information model, channel variations are centered around its mean value which is fed back to the transmitter. We show that the worst case main channel is anti-parallel to the channel mean information resulting in an overall unit rank channel. Further, the worst eavesdropper channel is shown to be isotropic around its mean information. Accordingly, we provide the capacity achieving beamforming direction. We show that the saddle point property holds under mean information model, and thus, compound secrecy capacity equals to the worst case capacity over the class of uncertainty. Moreover, capacity achieving beamforming direction is found to require matrix inversion, thus, we derive the null steering (NS) beamforming as an alternative suboptimal solution that does not require matrix inversion. NS beamformer is in the direction orthogonal to the eavesdropper mean channel that maintains the maximum possible gain in mean main channel direction. Extensive computer simulation reveals that NS performs very close to the optimal solution. It also verifies that, NS beamforming outperforms both maximum ratio transmission (MRT) and zero forcing (ZF) beamforming approaches over the entire SNR range. Finally, An equivalence relation with MIMO wiretap channel in Rician fading environment is established.
△ Less
Submitted 3 May, 2017; v1 submitted 25 January, 2017;
originally announced January 2017.
-
Message Authentication and Secret Key Agreement in VANETs via Angle of Arrival
Authors:
Amr Abdelaziz,
Ron Burton,
C. Emre Koksal
Abstract:
In the scope of VANETs, nature of exchanged safety/warning messages renders itself highly location dependent as it is usually for incident reporting. Thus, vehicles are required to periodically exchange beacon messages that include speed, time and GPS location information. In this paper paper, we present a physical layer assisted message authentication scheme that uses Angle of Arrival (AoA) estim…
▽ More
In the scope of VANETs, nature of exchanged safety/warning messages renders itself highly location dependent as it is usually for incident reporting. Thus, vehicles are required to periodically exchange beacon messages that include speed, time and GPS location information. In this paper paper, we present a physical layer assisted message authentication scheme that uses Angle of Arrival (AoA) estimation to verify the message originator location based on the claimed location information. Within the considered vehicular communication settings, fundamental limits of AoA estimation are developed in terms of its Cramer Rao Bound (CRB) and existence of efficient estimator. The problem of deciding whether the received signal is originated from the claimed GPS location is formulated as a two sided hypotheses testing problem whose solution is given by Wald test statics. Moreover, we use correct decision, $P_D$, and false alarm, $P_F$, probabilities as a quantitative performance measure. The observation posterior likelihood function is shown to satisfy regularity conditions necessary for asymptotic normality of the ML-AoA estimator. Thus, we give $P_D$ and $P_F$ in a closed form.
We extend the potential of physical layer contribution in security to provide physical layer assisted secret key agreement (SKA) protocol. A public key (PK) based SKA in which communicating vehicles are required to validate their respective physical location. We show that the risk of the Man in the Middle attack, which is common in PK-SKA protocols without a trusted third party, is waived up to the literal meaning of the word "middle".
△ Less
Submitted 10 September, 2016;
originally announced September 2016.
-
On The Security of AoA Estimation
Authors:
Amr Abdelaziz,
C. Emre Koksal,
Hesham El Gamal
Abstract:
Angle of Arrival (AoA) estimation has found its way to a wide range of applications. Much attention have been paid to study different techniques for AoA estimation and its applications for jamming suppression, however, security vulnerability issues of AoA estimation itself under hostile activity have not been paid the same attention. In this paper, the problem of AoA estimation in Rician flat fadi…
▽ More
Angle of Arrival (AoA) estimation has found its way to a wide range of applications. Much attention have been paid to study different techniques for AoA estimation and its applications for jamming suppression, however, security vulnerability issues of AoA estimation itself under hostile activity have not been paid the same attention. In this paper, the problem of AoA estimation in Rician flat fading channel under jamming condition is investigated. We consider the scenario in which a receiver with multiple antenna is trying to estimate the AoA of the specular line of sight (LOS) component of signal received from a given single antenna transmitter using a predefined training sequence. A jammer equipped with multiple antennas is trying to interrupt the AoA estimation phase by sending an arbitrary signal. We derive the optimal jammer and receiver strategies in various scenarios based on the knowledge of the opponent strategies and the available information about the communication channel. In all scenarios, we derive the optimal jammer signal design as well as its optimal power allocation policy. The results show the optimality of the training based Maximum Likelihood (ML) AoA estimator in case of randomly generated jamming signal. We also show that, the optimal jammer strategy is to emit a signal identical to the predefined training sequence turning the estimation process into a highest power competition scenario in which the detected AoA is the one for the transmitting entity of higher power. The obtained results are supported by the provided computer simulation.
△ Less
Submitted 2 July, 2016;
originally announced July 2016.
-
The Diversity and Scale Matter: Ubiquitous Transportation Mode Detection using Single Cell Tower Information
Authors:
Ali Mohamed AbdelAziz,
Moustafa Youssef
Abstract:
Detecting the transportation mode of a user is important for a wide range of applications. While a number of recent systems addressed the transportation mode detection problem using the ubiquitous mobile phones, these studies either leverage GPS, the inertial sensors, and/or multiple cell towers information. However, these different phone sensors have high energy consumption, limited to a small su…
▽ More
Detecting the transportation mode of a user is important for a wide range of applications. While a number of recent systems addressed the transportation mode detection problem using the ubiquitous mobile phones, these studies either leverage GPS, the inertial sensors, and/or multiple cell towers information. However, these different phone sensors have high energy consumption, limited to a small subset of phones (e.g. high-end phones or phones that support neighbouring cell tower information), cannot work in certain areas (e.g. inside tunnels for GPS), and/or work only from the user side.
In this paper, we present a transportation mode detection system, MonoSense, that leverages the phone serving cell information only. The basic idea is that the phone speed can be correlated with features extracted from both the serving cell tower ID and the received signal strength from it. To achieve high detection accuracy with this limited information, MonoSense leverages diversity along multiple axes to extract novel features. Specifically, MonoSense extracts features from both the time and frequency domain information available from the serving cell tower over different sliding widow sizes. More importantly, we show also that both the logarithmic and linear RSS scales can provide different information about the movement of a phone, further enriching the feature space and leading to higher accuracy.
Evaluation of MonoSense using 135 hours of cellular traces covering 485 km and collected by four users using different Android phones shows that it can achieve an average precision and recall of 89.26% and 89.84% respectively in differentiating between the stationary, walking, and driving modes using only the serving cell tower information, highlighting MonoSense ability to enable a wide set of intelligent transportation applications.
△ Less
Submitted 5 February, 2015;
originally announced February 2015.