-
Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan
Authors:
Muhammad Saad Saeed,
Shah Nawaz,
Muhammad Salman Tahir,
Rohan Kumar Das,
Muhammad Zaigham Zaheer,
Marta Moscati,
Markus Schedl,
Muhammad Haris Khan,
Karthik Nandakumar,
Muhammad Haroon Yousaf
Abstract:
The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2…
▽ More
The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge.
△ Less
Submitted 16 April, 2024; v1 submitted 14 April, 2024;
originally announced April 2024.
-
Single-branch Network for Multimodal Training
Authors:
Muhammad Saad Saeed,
Shah Nawaz,
Muhammad Haris Khan,
Muhammad Zaigham Zaheer,
Karthik Nandakumar,
Muhammad Haroon Yousaf,
Arif Mahmood
Abstract:
With the rapid growth of social media platforms, users are sharing billions of multimedia posts containing audio, images, and text. Researchers have focused on building autonomous systems capable of processing such multimedia data to solve challenging multimodal tasks including cross-modal retrieval, matching, and verification. Existing works use separate networks to extract embeddings of each mod…
▽ More
With the rapid growth of social media platforms, users are sharing billions of multimedia posts containing audio, images, and text. Researchers have focused on building autonomous systems capable of processing such multimedia data to solve challenging multimodal tasks including cross-modal retrieval, matching, and verification. Existing works use separate networks to extract embeddings of each modality to bridge the gap between them. The modular structure of their branched networks is fundamental in creating numerous multimodal applications and has become a defacto standard to handle multiple modalities. In contrast, we propose a novel single-branch network capable of learning discriminative representation of unimodal as well as multimodal tasks without changing the network. An important feature of our single-branch network is that it can be trained either using single or multiple modalities without sacrificing performance. We evaluated our proposed single-branch network on the challenging multimodal problem (face-voice association) for cross-modal verification and matching tasks with various loss formulations. Experimental results demonstrate the superiority of our proposed single-branch network over the existing methods in a wide range of experiments. Code: https://github.com/msaadsaeed/SBNet
△ Less
Submitted 10 March, 2023;
originally announced March 2023.
-
Speaker Recognition in Realistic Scenario Using Multimodal Data
Authors:
Saqlain Hussain Shah,
Muhammad Saad Saeed,
Shah Nawaz,
Muhammad Haroon Yousaf
Abstract:
In recent years, an association is established between faces and voices of celebrities leveraging large scale audio-visual information from YouTube. The availability of large scale audio-visual datasets is instrumental in develo** speaker recognition methods based on standard Convolutional Neural Networks. Thus, the aim of this paper is to leverage large scale audio-visual information to improve…
▽ More
In recent years, an association is established between faces and voices of celebrities leveraging large scale audio-visual information from YouTube. The availability of large scale audio-visual datasets is instrumental in develo** speaker recognition methods based on standard Convolutional Neural Networks. Thus, the aim of this paper is to leverage large scale audio-visual information to improve speaker recognition task. To achieve this task, we proposed a two-branch network to learn joint representations of faces and voices in a multimodal system. Afterwards, features are extracted from the two-branch network to train a classifier for speaker recognition. We evaluated our proposed framework on a large scale audio-visual dataset named VoxCeleb$1$. Our results show that addition of facial information improved the performance of speaker recognition. Moreover, our results indicate that there is an overlap between face and voice.
△ Less
Submitted 25 February, 2023;
originally announced February 2023.
-
Learning Branched Fusion and Orthogonal Projection for Face-Voice Association
Authors:
Muhammad Saad Saeed,
Shah Nawaz,
Muhammad Haris Khan,
Sajid Javed,
Muhammad Haroon Yousaf,
Alessio Del Bue
Abstract:
Recent years have seen an increased interest in establishing association between faces and voices of celebrities leveraging audio-visual information from YouTube. Prior works adopt metric learning methods to learn an embedding space that is amenable for associated matching and verification tasks. Albeit showing some progress, such formulations are, however, restrictive due to dependency on distanc…
▽ More
Recent years have seen an increased interest in establishing association between faces and voices of celebrities leveraging audio-visual information from YouTube. Prior works adopt metric learning methods to learn an embedding space that is amenable for associated matching and verification tasks. Albeit showing some progress, such formulations are, however, restrictive due to dependency on distance-dependent margin parameter, poor run-time training complexity, and reliance on carefully crafted negative mining procedures. In this work, we hypothesize that an enriched representation coupled with an effective yet efficient supervision is important towards realizing a discriminative joint embedding space for face-voice association tasks. To this end, we propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings and clusters them based on their identity labels via orthogonality constraints. We coin our proposed mechanism as fusion and orthogonal projection (FOP) and instantiate in a two-stream network. The overall resulting framework is evaluated on VoxCeleb1 and MAV-Celeb datasets with a multitude of tasks, including cross-modal verification and matching. Results reveal that our method performs favourably against the current state-of-the-art methods and our proposed formulation of supervision is more effective and efficient than the ones employed by the contemporary methods. In addition, we leverage cross-modal verification and matching tasks to analyze the impact of multiple languages on face-voice association. Code is available: \url{https://github.com/msaadsaeed/FOP}
△ Less
Submitted 22 August, 2022;
originally announced August 2022.
-
Fusion and Orthogonal Projection for Improved Face-Voice Association
Authors:
Muhammad Saad Saeed,
Muhammad Haris Khan,
Shah Nawaz,
Muhammad Haroon Yousaf,
Alessio Del Bue
Abstract:
We study the problem of learning association between face and voice, which is gaining interest in the computer vision community lately. Prior works adopt pairwise or triplet loss formulations to learn an embedding space amenable for associated matching and verification tasks. Albeit showing some progress, such loss formulations are, however, restrictive due to dependency on distance-dependent marg…
▽ More
We study the problem of learning association between face and voice, which is gaining interest in the computer vision community lately. Prior works adopt pairwise or triplet loss formulations to learn an embedding space amenable for associated matching and verification tasks. Albeit showing some progress, such loss formulations are, however, restrictive due to dependency on distance-dependent margin parameter, poor run-time training complexity, and reliance on carefully crafted negative mining procedures. In this work, we hypothesize that enriched feature representation coupled with an effective yet efficient supervision is necessary in realizing a discriminative joint embedding space for improved face-voice association. To this end, we propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings and clusters them based on their identity labels via orthogonality constraints. We coin our proposed mechanism as fusion and orthogonal projection (FOP) and instantiate in a two-stream pipeline. The overall resulting framework is evaluated on a large-scale VoxCeleb dataset with a multitude of tasks, including cross-modal verification and matching. Results show that our method performs favourably against the current state-of-the-art methods and our proposed supervision formulation is more effective and efficient than the ones employed by the contemporary methods.
△ Less
Submitted 20 December, 2021;
originally announced December 2021.
-
Methanol and water maser observations separate disc and outflow sources in IRAS 19410+2336
Authors:
M. S. Darwish,
K. A. Edris,
A. M. S. Richards,
S. Etoka,
M. S. Saad,
M. M. Beheary,
G. A Fuller
Abstract:
We investigate the kinematics of high mass protostellar objects within the high mass star forming region IRAS 19410+2336. We performed high angular resolution observations of 6.7-GHz methanol and 22 GHz water masers using the MERLIN (Multi-Element Radio Linked Interferometer Network) and e-MERLIN interferometers. The 6.7-GHz methanol maser emission line was detected within the $\sim$ 16--27 km s…
▽ More
We investigate the kinematics of high mass protostellar objects within the high mass star forming region IRAS 19410+2336. We performed high angular resolution observations of 6.7-GHz methanol and 22 GHz water masers using the MERLIN (Multi-Element Radio Linked Interferometer Network) and e-MERLIN interferometers. The 6.7-GHz methanol maser emission line was detected within the $\sim$ 16--27 km s$^{-1}$ velocity range with a peak flux density $\sim$50 Jy. The maser spots are spread over $\sim$1.3 arcsec on the sky, corresponding to $\sim$2800 au at a distance of 2.16 kpc. These are the first astrometric measurements at 6.7 GHz in IRAS 19410+2336. The 22-GHz water maser line was imaged in 2005 and 2019 (the latter with good astrometry). Its velocities range from 13 to $\sim$29 km s$^{-1}$. The peak flux density was found to be 18.7 Jy and 13.487 Jy in 2005, and 2019, respectively. The distribution of the water maser components is up to 165 mas, $\sim$350 au at 2.16 kpc. We find that the Eastern methanol masers most probably trace outflows from the region of millimetre source mm1. The water masers to the West lie in a disc (flared or interacting with outflow/infall) around another more evolved millimetre source (13-s). The maser distribution suggests that the disc lies at an angle of 60$^{\circ}$ or more to the plane of the sky and the observed line of sight velocities then suggest an enclosed mass between 44 M$_{\odot}$ and as little as 11 M$_{\odot}$ if the disc is edge-on. The Western methanol masers may be infalling.
△ Less
Submitted 24 May, 2021;
originally announced May 2021.
-
Cross-modal Speaker Verification and Recognition: A Multilingual Perspective
Authors:
Muhammad Saad Saeed,
Shah Nawaz,
Pietro Morerio,
Arif Mahmood,
Ignazio Gallo,
Muhammad Haroon Yousaf,
Alessio Del Bue
Abstract:
Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to answer two closely related questions: "Is face-voice…
▽ More
Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to answer two closely related questions: "Is face-voice association language independent?" and "Can a speaker be recognised irrespective of the spoken language?". These two questions are very important to understand effectiveness and to boost development of multilingual biometric systems. To answer them, we collected a Multilingual Audio-Visual dataset, containing human speech clips of $154$ identities with $3$ language annotations extracted from various videos uploaded online. Extensive experiments on the three splits of the proposed dataset have been performed to investigate and answer these novel research questions that clearly point out the relevance of the multilingual problem.
△ Less
Submitted 22 April, 2021; v1 submitted 28 April, 2020;
originally announced April 2020.
-
Adaptive Artificial Intelligent Q&A Platform
Authors:
M. R,
Akram,
C. P,
Singhabahu,
M. S. M Saad,
P,
Deleepa,
Anupiya,
Nugaliyadde,
Yashas,
Mallawarachchi
Abstract:
The paper presents an approach to build a question and answer system that is capable of processing the information in a large dataset and allows the user to gain knowledge from this dataset by asking questions in natural language form. Key content of this research covers four dimensions which are; Corpus Preprocessing, Question Preprocessing, Deep Neural Network for Answer Extraction and Answer Ge…
▽ More
The paper presents an approach to build a question and answer system that is capable of processing the information in a large dataset and allows the user to gain knowledge from this dataset by asking questions in natural language form. Key content of this research covers four dimensions which are; Corpus Preprocessing, Question Preprocessing, Deep Neural Network for Answer Extraction and Answer Generation. The system is capable of understanding the question, responds to the user's query in natural language form as well. The goal is to make the user feel as if they were interacting with a person than a machine.
△ Less
Submitted 19 January, 2019;
originally announced February 2019.
-
Classification of Groups according to the number of end vertices in the coprime graph
Authors:
Tariq A. Alraqad,
Muhammad S. Saeed.,
Etaf S. Alshawarbeh
Abstract:
In this paper we characterize groups according to the number of end vertices in the associated coprime graphs. An upper bound on the order of the group that depends on the number of end vertices is obtained. We also prove that $2-$groups are the only groups whose coprime graphs have odd number of end vertices. Classifications of groups with small number of end vertices in the coprime graphs are gi…
▽ More
In this paper we characterize groups according to the number of end vertices in the associated coprime graphs. An upper bound on the order of the group that depends on the number of end vertices is obtained. We also prove that $2-$groups are the only groups whose coprime graphs have odd number of end vertices. Classifications of groups with small number of end vertices in the coprime graphs are given. One of the results shows that $\mathbb{Z}_4$ and $\mathbb{Z}_2\times \mathbb{Z}_2$ are the only groups whose coprime graph has exactly three end vertices.
△ Less
Submitted 6 March, 2018;
originally announced March 2018.
-
Magnus subgroups of one-relator surface groups
Authors:
James Howie,
Muhammad Sarwar Saeed
Abstract:
A one-relator surface group is the quotient of an orientable surface group by the normal closure of a single relator. A Magnus subgroup is the fundamental group of a suitable incompressible sub-surface. A number of results are proved about the intersections of such subgroups and their conjugates, analogous to results of Bagherzadeh, Brodskii, and Collins in classical one-relator group theory.
A one-relator surface group is the quotient of an orientable surface group by the normal closure of a single relator. A Magnus subgroup is the fundamental group of a suitable incompressible sub-surface. A number of results are proved about the intersections of such subgroups and their conjugates, analogous to results of Bagherzadeh, Brodskii, and Collins in classical one-relator group theory.
△ Less
Submitted 27 September, 2007;
originally announced September 2007.
-
Freiheitssätze for one-relator quotients of surface groups and of limit groups
Authors:
James Howie,
Muhammad Sarwar Saeed
Abstract:
Three versions of the Freiheitssatz are proved in the context of one-relator quotients of limit groups, where the latter are equipped with 1-acylindrical splittings over cyclic subgroups. These are natural extensions of previously published corresponding statements for one-relator quotients of orientable surface groups. Two of the proofs are new even in that restricted context.
Three versions of the Freiheitssatz are proved in the context of one-relator quotients of limit groups, where the latter are equipped with 1-acylindrical splittings over cyclic subgroups. These are natural extensions of previously published corresponding statements for one-relator quotients of orientable surface groups. Two of the proofs are new even in that restricted context.
△ Less
Submitted 4 September, 2007;
originally announced September 2007.