Search | arXiv e-print repository

Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan

Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Salman Tahir, Rohan Kumar Das, Muhammad Zaigham Zaheer, Marta Moscati, Markus Schedl, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf

Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2… ▽ More The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge. △ Less

Submitted 16 April, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: ACM Multimedia Conference - Grand Challenge

arXiv:2309.17426 [pdf]

Classification of Potholes Based on Surface Area Using Pre-Trained Models of Convolutional Neural Network

Authors: Chauhdary Fazeel Ahmad, Abdullah Cheema, Waqas Qayyum, Rana Ehtisham, Muhammad Haroon Yousaf, Junaid Mir, Nasim Shakouri Mahmoudabadi, Afaq Ahmad

Abstract: Potholes are fatal and can cause severe damage to vehicles as well as can cause deadly accidents. In South Asian countries, pavement distresses are the primary cause due to poor subgrade conditions, lack of subsurface drainage, and excessive rainfalls. The present research compares the performance of three pre-trained Convolutional Neural Network (CNN) models, i.e., ResNet 50, ResNet 18, and Mobil… ▽ More Potholes are fatal and can cause severe damage to vehicles as well as can cause deadly accidents. In South Asian countries, pavement distresses are the primary cause due to poor subgrade conditions, lack of subsurface drainage, and excessive rainfalls. The present research compares the performance of three pre-trained Convolutional Neural Network (CNN) models, i.e., ResNet 50, ResNet 18, and MobileNet. At first, pavement images are classified to find whether images contain potholes, i.e., Potholes or Normal. Secondly, pavements images are classi-fied into three categories, i.e., Small Pothole, Large Pothole, and Normal. Pavement images are taken from 3.5 feet (waist height) and 2 feet. MobileNet v2 has an accuracy of 98% for detecting a pothole. The classification of images taken at the height of 2 feet has an accuracy value of 87.33%, 88.67%, and 92% for classifying the large, small, and normal pavement, respectively. Similarly, the classification of the images taken from full of waist (FFW) height has an accuracy value of 98.67%, 98.67%, and 100%. △ Less

Submitted 29 September, 2023; originally announced September 2023.

Comments: 24 Pages, 26 Figures

arXiv:2303.06129 [pdf, other]

Single-branch Network for Multimodal Training

Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Haris Khan, Muhammad Zaigham Zaheer, Karthik Nandakumar, Muhammad Haroon Yousaf, Arif Mahmood

Abstract: With the rapid growth of social media platforms, users are sharing billions of multimedia posts containing audio, images, and text. Researchers have focused on building autonomous systems capable of processing such multimedia data to solve challenging multimodal tasks including cross-modal retrieval, matching, and verification. Existing works use separate networks to extract embeddings of each mod… ▽ More With the rapid growth of social media platforms, users are sharing billions of multimedia posts containing audio, images, and text. Researchers have focused on building autonomous systems capable of processing such multimedia data to solve challenging multimodal tasks including cross-modal retrieval, matching, and verification. Existing works use separate networks to extract embeddings of each modality to bridge the gap between them. The modular structure of their branched networks is fundamental in creating numerous multimodal applications and has become a defacto standard to handle multiple modalities. In contrast, we propose a novel single-branch network capable of learning discriminative representation of unimodal as well as multimodal tasks without changing the network. An important feature of our single-branch network is that it can be trained either using single or multiple modalities without sacrificing performance. We evaluated our proposed single-branch network on the challenging multimodal problem (face-voice association) for cross-modal verification and matching tasks with various loss formulations. Experimental results demonstrate the superiority of our proposed single-branch network over the existing methods in a wide range of experiments. Code: https://github.com/msaadsaeed/SBNet △ Less

Submitted 10 March, 2023; originally announced March 2023.

Comments: Accepted at ICASSP 2023

arXiv:2302.13033 [pdf, other]

Speaker Recognition in Realistic Scenario Using Multimodal Data

Authors: Saqlain Hussain Shah, Muhammad Saad Saeed, Shah Nawaz, Muhammad Haroon Yousaf

Abstract: In recent years, an association is established between faces and voices of celebrities leveraging large scale audio-visual information from YouTube. The availability of large scale audio-visual datasets is instrumental in develo** speaker recognition methods based on standard Convolutional Neural Networks. Thus, the aim of this paper is to leverage large scale audio-visual information to improve… ▽ More In recent years, an association is established between faces and voices of celebrities leveraging large scale audio-visual information from YouTube. The availability of large scale audio-visual datasets is instrumental in develo** speaker recognition methods based on standard Convolutional Neural Networks. Thus, the aim of this paper is to leverage large scale audio-visual information to improve speaker recognition task. To achieve this task, we proposed a two-branch network to learn joint representations of faces and voices in a multimodal system. Afterwards, features are extracted from the two-branch network to train a classifier for speaker recognition. We evaluated our proposed framework on a large scale audio-visual dataset named VoxCeleb$1$. Our results show that addition of facial information improved the performance of speaker recognition. Moreover, our results indicate that there is an overlap between face and voice. △ Less

Submitted 25 February, 2023; originally announced February 2023.

Comments: Accepted at the International Conference on Artificial Intelligence (ICAI'2023)

arXiv:2208.10238 [pdf, other]

Learning Branched Fusion and Orthogonal Projection for Face-Voice Association

Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Haris Khan, Sajid Javed, Muhammad Haroon Yousaf, Alessio Del Bue

Abstract: Recent years have seen an increased interest in establishing association between faces and voices of celebrities leveraging audio-visual information from YouTube. Prior works adopt metric learning methods to learn an embedding space that is amenable for associated matching and verification tasks. Albeit showing some progress, such formulations are, however, restrictive due to dependency on distanc… ▽ More Recent years have seen an increased interest in establishing association between faces and voices of celebrities leveraging audio-visual information from YouTube. Prior works adopt metric learning methods to learn an embedding space that is amenable for associated matching and verification tasks. Albeit showing some progress, such formulations are, however, restrictive due to dependency on distance-dependent margin parameter, poor run-time training complexity, and reliance on carefully crafted negative mining procedures. In this work, we hypothesize that an enriched representation coupled with an effective yet efficient supervision is important towards realizing a discriminative joint embedding space for face-voice association tasks. To this end, we propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings and clusters them based on their identity labels via orthogonality constraints. We coin our proposed mechanism as fusion and orthogonal projection (FOP) and instantiate in a two-stream network. The overall resulting framework is evaluated on VoxCeleb1 and MAV-Celeb datasets with a multitude of tasks, including cross-modal verification and matching. Results reveal that our method performs favourably against the current state-of-the-art methods and our proposed formulation of supervision is more effective and efficient than the ones employed by the contemporary methods. In addition, we leverage cross-modal verification and matching tasks to analyze the impact of multiple languages on face-voice association. Code is available: \url{https://github.com/msaadsaeed/FOP} △ Less

Submitted 22 August, 2022; originally announced August 2022.

Comments: Submitted: IEEE Transactions on Multimedia. arXiv admin note: substantial text overlap with arXiv:2112.10483

arXiv:2112.10483 [pdf, other]

Fusion and Orthogonal Projection for Improved Face-Voice Association

Authors: Muhammad Saad Saeed, Muhammad Haris Khan, Shah Nawaz, Muhammad Haroon Yousaf, Alessio Del Bue

Abstract: We study the problem of learning association between face and voice, which is gaining interest in the computer vision community lately. Prior works adopt pairwise or triplet loss formulations to learn an embedding space amenable for associated matching and verification tasks. Albeit showing some progress, such loss formulations are, however, restrictive due to dependency on distance-dependent marg… ▽ More We study the problem of learning association between face and voice, which is gaining interest in the computer vision community lately. Prior works adopt pairwise or triplet loss formulations to learn an embedding space amenable for associated matching and verification tasks. Albeit showing some progress, such loss formulations are, however, restrictive due to dependency on distance-dependent margin parameter, poor run-time training complexity, and reliance on carefully crafted negative mining procedures. In this work, we hypothesize that enriched feature representation coupled with an effective yet efficient supervision is necessary in realizing a discriminative joint embedding space for improved face-voice association. To this end, we propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings and clusters them based on their identity labels via orthogonality constraints. We coin our proposed mechanism as fusion and orthogonal projection (FOP) and instantiate in a two-stream pipeline. The overall resulting framework is evaluated on a large-scale VoxCeleb dataset with a multitude of tasks, including cross-modal verification and matching. Results show that our method performs favourably against the current state-of-the-art methods and our proposed supervision formulation is more effective and efficient than the ones employed by the contemporary methods. △ Less

Submitted 20 December, 2021; originally announced December 2021.

arXiv:2004.13780 [pdf, other]

Cross-modal Speaker Verification and Recognition: A Multilingual Perspective

Authors: Muhammad Saad Saeed, Shah Nawaz, Pietro Morerio, Arif Mahmood, Ignazio Gallo, Muhammad Haroon Yousaf, Alessio Del Bue

Abstract: Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to answer two closely related questions: "Is face-voice… ▽ More Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to answer two closely related questions: "Is face-voice association language independent?" and "Can a speaker be recognised irrespective of the spoken language?". These two questions are very important to understand effectiveness and to boost development of multilingual biometric systems. To answer them, we collected a Multilingual Audio-Visual dataset, containing human speech clips of $154$ identities with $3$ language annotations extracted from various videos uploaded online. Extensive experiments on the three splits of the proposed dataset have been performed to investigate and answer these novel research questions that clearly point out the relevance of the multilingual problem. △ Less

Submitted 22 April, 2021; v1 submitted 28 April, 2020; originally announced April 2020.

Comments: Accepted: CVPRW

arXiv:1809.09617 [pdf, other]

UAV-Empowered Disaster-Resilient Edge Architecture for Delay-Sensitive Communication

Authors: Zeeshan Kaleem, Muhammad Yousaf, Aamir Qamar, Ayaz Ahmad, Trung Q. Duong, Wan Choi, Abbas Jamalipour

Abstract: The fifth-generation (5G) communication systems will enable enhanced mobile broadband, ultra-reliable low latency, and massive connectivity services. The broadband and low-latency services are indispensable to public safety (PS) communication during natural or man-made disasters. Recently, the third generation partnership project long term evolution (3GPPLTE) has emerged as a promising candidate t… ▽ More The fifth-generation (5G) communication systems will enable enhanced mobile broadband, ultra-reliable low latency, and massive connectivity services. The broadband and low-latency services are indispensable to public safety (PS) communication during natural or man-made disasters. Recently, the third generation partnership project long term evolution (3GPPLTE) has emerged as a promising candidate to enable broadband PS communications. In this article, first we present six major PS-LTE enabling services and the current status of PS-LTE in 3GPP releases. Then, we discuss the spectrum bands allocated for PS-LTE in major countries by international telecommunication union (ITU). Finally, we propose a disaster resilient three-layered architecture for PS-LTE (DR-PSLTE). This architecture consists of a software-defined network (SDN) layer to provide centralized control, an unmanned air vehicle (UAV) cloudlet layer to facilitate edge computing or to enable emergency communication link, and a radio access layer. The proposed architecture is flexible and combines the benefits of SDNs and edge computing to efficiently meet the delay requirements of various PS-LTE services. Numerical results verified that under the proposed DR-PSLTE architecture, delay is reduced by 20% as compared with the conventional centralized computing architecture. △ Less

Submitted 28 January, 2019; v1 submitted 26 September, 2018; originally announced September 2018.

Comments: 9,5

arXiv:1108.3708 [pdf, ps, other]

doi 10.1109/ISWTA.2011.6089558

Evaluating Impact of Mobility on Wireless Routing Protocols

Authors: N. Javaid, M. Yousaf, A. Ahmad, A. Naveed, K. Djouani

Abstract: In this paper, we evaluate, analyze, and compare the impact of mobility on the behavior of three reactive protocols (AODV, DSR, DYMO) and three proactive protocols (DSDV, FSR, OLSR) in multi-hop wireless networks. We take into account throughput, end-to-end delay, and normalized routing load as performance parameters. Based upon the extensive simulation results in NS-2, we rank all of six protocol… ▽ More In this paper, we evaluate, analyze, and compare the impact of mobility on the behavior of three reactive protocols (AODV, DSR, DYMO) and three proactive protocols (DSDV, FSR, OLSR) in multi-hop wireless networks. We take into account throughput, end-to-end delay, and normalized routing load as performance parameters. Based upon the extensive simulation results in NS-2, we rank all of six protocols according to the performance parameters. Besides providing the interesting facts regarding the response of each protocol on varying mobilities and speeds, we also study the trade-offs, the routing protocols have to make. Such as, to achieve throughput, a protocol has to pay some cost in the form of increased end-to-end delay or routing overhead. △ Less

Submitted 18 August, 2011; originally announced August 2011.

Journal ref: IEEE Symposium on Wireless Telecommunications Applications (ISWTA) 2011

Showing 1–9 of 9 results for author: Yousaf, M