Search | arXiv e-print repository

ToSA: Token Selective Attention for Efficient Vision Transformers

Authors: Manish Kumar Singh, Rajeev Yasarla, Hong Cai, Mingu Lee, Fatih Porikli

Abstract: In this paper, we propose a novel token selective attention approach, ToSA, which can identify tokens that need to be attended as well as those that can skip a transformer layer. More specifically, a token selector parses the current attention maps and predicts the attention maps for the next layer, which are then used to select the important tokens that should participate in the attention operati… ▽ More In this paper, we propose a novel token selective attention approach, ToSA, which can identify tokens that need to be attended as well as those that can skip a transformer layer. More specifically, a token selector parses the current attention maps and predicts the attention maps for the next layer, which are then used to select the important tokens that should participate in the attention operation. The remaining tokens simply bypass the next layer and are concatenated with the attended ones to re-form a complete set of tokens. In this way, we reduce the quadratic computation and memory costs as fewer tokens participate in self-attention while maintaining the features for all the image patches throughout the network, which allows it to be used for dense prediction tasks. Our experiments show that by applying ToSA, we can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark. Furthermore, we evaluate on the dense prediction task of monocular depth estimation on NYU Depth V2, and show that we can achieve similar depth prediction accuracy using a considerably lighter backbone with ToSA. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted at CVPRW 2024

arXiv:2406.05505 [pdf, other]

I-SIRch: AI-Powered Concept Annotation Tool For Equitable Extraction And Analysis Of Safety Insights From Maternity Investigations

Authors: Mohit Kumar Singh, Georgina Cosma, Patrick Waterson, Jonathan Back, Gyuchan Thomas Jun

Abstract: Maternity care is a complex system involving treatments and interactions between patients, providers, and the care environment. To improve patient safety and outcomes, understanding the human factors (e.g. individuals decisions, local facilities) influencing healthcare delivery is crucial. However, most current tools for analysing healthcare data focus only on biomedical concepts (e.g. health cond… ▽ More Maternity care is a complex system involving treatments and interactions between patients, providers, and the care environment. To improve patient safety and outcomes, understanding the human factors (e.g. individuals decisions, local facilities) influencing healthcare delivery is crucial. However, most current tools for analysing healthcare data focus only on biomedical concepts (e.g. health conditions, procedures and tests), overlooking the importance of human factors. We developed a new approach called I-SIRch, using artificial intelligence to automatically identify and label human factors concepts in maternity healthcare investigation reports describing adverse maternity incidents produced by England's Healthcare Safety Investigation Branch (HSIB). These incident investigation reports aim to identify opportunities for learning and improving maternal safety across the entire healthcare system. I-SIRch was trained using real data and tested on both real and simulated data to evaluate its performance in identifying human factors concepts. When applied to real reports, the model achieved a high level of accuracy, correctly identifying relevant concepts in 90\% of the sentences from 97 reports. Applying I-SIRch to analyse these reports revealed that certain human factors disproportionately affected mothers from different ethnic groups. Our work demonstrates the potential of using automated tools to identify human factors concepts in maternity incident investigation reports, rather than focusing solely on biomedical concepts. This approach opens up new possibilities for understanding the complex interplay between social, technical, and organisational factors influencing maternal safety and population health outcomes. By taking a more comprehensive view of maternal healthcare delivery, we can develop targeted interventions to address disparities and improve maternal outcomes. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.03822 [pdf, other]

SilentCipher: Deep Audio Watermarking

Authors: Mayank Kumar Singh, Naoya Takahashi, Weihsiang Liao, Yuki Mitsufuji

Abstract: In the realm of audio watermarking, it is challenging to simultaneously encode imperceptible messages while enhancing the message capacity and robustness. Although recent advancements in deep learning-based methods bolster the message capacity and robustness over traditional methods, the encoded messages introduce audible artefacts that restricts their usage in professional settings. In this study… ▽ More In the realm of audio watermarking, it is challenging to simultaneously encode imperceptible messages while enhancing the message capacity and robustness. Although recent advancements in deep learning-based methods bolster the message capacity and robustness over traditional methods, the encoded messages introduce audible artefacts that restricts their usage in professional settings. In this study, we introduce three key innovations. Firstly, our work is the first deep learning-based model to integrate psychoacoustic model based thresholding to achieve imperceptible watermarks. Secondly, we introduce psuedo-differentiable compression layers, enhancing the robustness of our watermarking algorithm. Lastly, we introduce a method to eliminate the need for perceptual losses, enabling us to achieve SOTA in both robustness as well as imperceptible watermarking. Our contributions lead us to SilentCipher, a model enabling users to encode messages within audio signals sampled at 44.1kHz. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2403.12953 [pdf, other]

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

Authors: Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Risheek Garrepalli, Fatih Porikli

Abstract: In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame fea… ▽ More In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame features one time step ahead iteratively. In this way, F-Net learns the underlying motion and correspondence information, and we incorporate its features into the depth decoding process. Additionally, to enrich the learning of multiframe correspondence cues, we further leverage a reconstruction network, R-Net, which is trained via adaptively masked auto-encoding of multiframe feature volumes. At inference time, both F-Net and R-Net are used to produce queries to work with the depth decoder, as well as a final refinement network. Through extensive experiments on several benchmarks, i.e., NYUDv2, KITTI, DDAD, and Sintel, which cover indoor, driving, and open-domain scenarios, we show that FutureDepth significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy. Furthermore, FutureDepth is more efficient than existing SOTA video depth estimation models and has similar latencies when comparing to monocular models △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.12202 [pdf, other]

DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions

Authors: Yunxiao Shi, Manish Kumar Singh, Hong Cai, Fatih Porikli

Abstract: In this paper, we introduce a novel approach that harnesses both 2D and 3D attentions to enable highly accurate depth completion without requiring iterative spatial propagations. Specifically, we first enhance a baseline convolutional depth completion model by applying attention to 2D features in the bottleneck and skip connections. This effectively improves the performance of this simple network… ▽ More In this paper, we introduce a novel approach that harnesses both 2D and 3D attentions to enable highly accurate depth completion without requiring iterative spatial propagations. Specifically, we first enhance a baseline convolutional depth completion model by applying attention to 2D features in the bottleneck and skip connections. This effectively improves the performance of this simple network and sets it on par with the latest, complex transformer-based models. Leveraging the initial depths and features from this network, we uplift the 2D features to form a 3D point cloud and construct a 3D point transformer to process it, allowing the model to explicitly learn and exploit 3D geometric features. In addition, we propose normalization techniques to process the point cloud, which improves learning and leads to better accuracy than directly using point transformers off the shelf. Furthermore, we incorporate global attention on downsampled point cloud features, which enables long-range context while still being computationally feasible. We evaluate our method, DeCoTR, on established depth completion benchmarks, including NYU Depth V2 and KITTI, showcasing that it sets new state-of-the-art performance. We further conduct zero-shot evaluations on ScanNet and DDAD benchmarks and demonstrate that DeCoTR has superior generalizability compared to existing approaches. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Accepted at CVPR 2024

arXiv:2311.10794 [pdf, other]

Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression

Authors: Animesh Sinha, Bo Sun, Anmol Kalia, Arantxa Casanova, Elliot Blanchard, David Yan, Winnie Zhang, Tony Nelli, Jiahui Chen, Hardik Shah, Licheng Yu, Mitesh Kumar Singh, Ankit Ramchandani, Maziar Sanjabi, Sonal Gupta, Amy Bearman, Dhruv Mahajan

Abstract: We introduce Style Tailoring, a recipe to finetune Latent Diffusion Models (LDMs) in a distinct domain with high visual quality, prompt alignment and scene diversity. We choose sticker image generation as the target domain, as the images significantly differ from photorealistic samples typically generated by large-scale LDMs. We start with a competent text-to-image model, like Emu, and show that r… ▽ More We introduce Style Tailoring, a recipe to finetune Latent Diffusion Models (LDMs) in a distinct domain with high visual quality, prompt alignment and scene diversity. We choose sticker image generation as the target domain, as the images significantly differ from photorealistic samples typically generated by large-scale LDMs. We start with a competent text-to-image model, like Emu, and show that relying on prompt engineering with a photorealistic model to generate stickers leads to poor prompt alignment and scene diversity. To overcome these drawbacks, we first finetune Emu on millions of sticker-like images collected using weak supervision to elicit diversity. Next, we curate human-in-the-loop (HITL) Alignment and Style datasets from model generations, and finetune to improve prompt alignment and style alignment respectively. Sequential finetuning on these datasets poses a tradeoff between better style alignment and prompt alignment gains. To address this tradeoff, we propose a novel fine-tuning method called Style Tailoring, which jointly fits the content and style distribution and achieves best tradeoff. Evaluation results show our method improves visual quality by 14%, prompt alignment by 16.2% and scene diversity by 15.3%, compared to prompt engineering the base Emu model for stickers generation. △ Less

Submitted 16 November, 2023; originally announced November 2023.

Comments: 10 pages, 5 figures

arXiv:2309.15807 [pdf, other]

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

Authors: Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda , et al. (1 additional authors not shown)

Abstract: Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusivel… ▽ More Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred $68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2305.15055 [pdf, other]

Iteratively Improving Speech Recognition and Voice Conversion

Authors: Mayank Kumar Singh, Naoya Takahashi, Onoe Naoyuki

Abstract: Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality ASR remains to be a challenging task. In this work, we propose a novel iterative way of improving both the ASR and VC models. We first train an ASR model which i… ▽ More Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality ASR remains to be a challenging task. In this work, we propose a novel iterative way of improving both the ASR and VC models. We first train an ASR model which is used to ensure content preservation while training a VC model. In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers. By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models. Our proposed framework outperforms the ASR and one-shot VC baseline models on English singing and Hindi speech domains in subjective and objective evaluations in low-data resource settings. △ Less

Submitted 24 May, 2023; originally announced May 2023.

arXiv:2302.13838 [pdf, other]

Cross-modal Face- and Voice-style Transfer

Authors: Naoya Takahashi, Mayank K. Singh, Yuki Mitsufuji

Abstract: Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the content-creation process in many applications. However, as they are limited to the conversion within each modality, matching the impression of the generated face an… ▽ More Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the content-creation process in many applications. However, as they are limited to the conversion within each modality, matching the impression of the generated face and voice remains an open question. We propose a cross-modal style transfer framework called XFaVoT that jointly learns four tasks: image translation and voice conversion tasks with audio or image guidance, which enables the generation of ``face that matches given voice" and ``voice that matches given face", and intra-modality translation tasks with a single framework. Experimental results on multiple datasets show that XFaVoT achieves cross-modal style translation of image and voice, outperforming baselines in terms of quality, diversity, and face-voice correspondence. △ Less

Submitted 1 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

arXiv:2302.10536 [pdf, other]

Nonparallel Emotional Voice Conversion For Unseen Speaker-Emotion Pairs Using Dual Domain Adversarial Network & Virtual Domain Pairing

Authors: Nirmesh Shah, Mayank Kumar Singh, Naoya Takahashi, Naoyuki Onoe

Abstract: Primary goal of an emotional voice conversion (EVC) system is to convert the emotion of a given speech signal from one style to another style without modifying the linguistic content of the signal. Most of the state-of-the-art approaches convert emotions for seen speaker-emotion combinations only. In this paper, we tackle the problem of converting the emotion of speakers whose only neutral data ar… ▽ More Primary goal of an emotional voice conversion (EVC) system is to convert the emotion of a given speech signal from one style to another style without modifying the linguistic content of the signal. Most of the state-of-the-art approaches convert emotions for seen speaker-emotion combinations only. In this paper, we tackle the problem of converting the emotion of speakers whose only neutral data are present during the time of training and testing (i.e., unseen speaker-emotion combinations). To this end, we extend a recently proposed StartGANv2-VC architecture by utilizing dual encoders for learning the speaker and emotion style embeddings separately along with dual domain source classifiers. For achieving the conversion to unseen speaker-emotion combinations, we propose a Virtual Domain Pairing (VDP) training strategy, which virtually incorporates the speaker-emotion pairs that are not present in the real data without compromising the min-max game of a discriminator and generator in adversarial training. We evaluate the proposed method using a Hindi emotional database. △ Less

Submitted 21 February, 2023; originally announced February 2023.

Comments: Demo Samples at https://demosamplesites.github.io/EVCUP/

arXiv:2301.09776 [pdf, ps, other]

doi 10.1109/PCS56426.2022.10018001

Differentiable bit-rate estimation for neural-based video codec enhancement

Authors: Amir Said, Manish Kumar Singh, Reza Pourreza

Abstract: Neural networks (NN) can improve standard video compression by pre- and post-processing the encoded video. For optimal NN training, the standard codec needs to be replaced with a codec proxy that can provide derivatives of estimated bit-rate and distortion, which are used for gradient back-propagation. Since entropy coding of standard codecs is designed to take into account non-linear dependencies… ▽ More Neural networks (NN) can improve standard video compression by pre- and post-processing the encoded video. For optimal NN training, the standard codec needs to be replaced with a codec proxy that can provide derivatives of estimated bit-rate and distortion, which are used for gradient back-propagation. Since entropy coding of standard codecs is designed to take into account non-linear dependencies between transform coefficients, bit-rates cannot be well approximated with simple per-coefficient estimators. This paper presents a new approach for bit-rate estimation that is similar to the type employed in training end-to-end neural codecs, and able to efficiently take into account those statistical dependencies. It is defined from a mathematical model that provides closed-form formulas for the estimates and their gradients, reducing the computational complexity. Experimental results demonstrate the method's accuracy in estimating HEVC/H.265 codec bit-rates. △ Less

Submitted 23 January, 2023; originally announced January 2023.

Journal ref: Picture Coding Symposium (PCS), San Jose, CA, USA, 2022, pp. 379-383

arXiv:2301.01380 [pdf, other]

Ego-Only: Egocentric Action Detection without Exocentric Transferring

Authors: Huiyu Wang, Mitesh Kumar Singh, Lorenzo Torresani

Abstract: We present Ego-Only, the first approach that enables state-of-the-art action detection on egocentric (first-person) videos without any form of exocentric (third-person) transferring. Despite the content and appearance gap separating the two domains, large-scale exocentric transferring has been the default choice for egocentric action detection. This is because prior works found that egocentric mod… ▽ More We present Ego-Only, the first approach that enables state-of-the-art action detection on egocentric (first-person) videos without any form of exocentric (third-person) transferring. Despite the content and appearance gap separating the two domains, large-scale exocentric transferring has been the default choice for egocentric action detection. This is because prior works found that egocentric models are difficult to train from scratch and that transferring from exocentric representations leads to improved accuracy. However, in this paper, we revisit this common belief. Motivated by the large gap separating the two domains, we propose a strategy that enables effective training of egocentric models without exocentric transferring. Our Ego-Only approach is simple. It trains the video representation with a masked autoencoder finetuned for temporal segmentation. The learned features are then fed to an off-the-shelf temporal action localization method to detect actions. We find that this renders exocentric transferring unnecessary by showing remarkably strong results achieved by this simple Ego-Only approach on three established egocentric video datasets: Ego4D, EPIC-Kitchens-100, and Charades-Ego. On both action detection and action recognition, Ego-Only outperforms previous best exocentric transferring methods that use orders of magnitude more labels. Ego-Only sets new state-of-the-art results on these datasets and benchmarks without exocentric data. △ Less

Submitted 19 May, 2023; v1 submitted 3 January, 2023; originally announced January 2023.

arXiv:2210.11096 [pdf, other]

Robust One-Shot Singing Voice Conversion

Authors: Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

Abstract: Recent progress in deep generative models has improved the quality of voice conversion in the speech domain. However, high-quality singing voice conversion (SVC) of unseen singers remains challenging due to the wider variety of musical expressions in pitch, loudness, and pronunciation. Moreover, singing voices are often recorded with reverb and accompaniment music, which make SVC even more challen… ▽ More Recent progress in deep generative models has improved the quality of voice conversion in the speech domain. However, high-quality singing voice conversion (SVC) of unseen singers remains challenging due to the wider variety of musical expressions in pitch, loudness, and pronunciation. Moreover, singing voices are often recorded with reverb and accompaniment music, which make SVC even more challenging. In this work, we present a robust one-shot SVC (ROSVC) that performs any-to-any SVC robustly even on such distorted singing voices. To this end, we first propose a one-shot SVC model based on generative adversarial networks that generalizes to unseen singers via partial domain conditioning and learns to accurately recover the target pitch via pitch distribution matching and AdaIN-skip conditioning. We then propose a two-stage training method called Robustify that train the one-shot SVC model in the first stage on clean data to ensure high-quality conversion, and introduces enhancement modules to the encoders of the model in the second stage to enhance the feature extraction from distorted singing voices. To further improve the voice quality and pitch reconstruction accuracy, we finally propose a hierarchical diffusion model for singing voice neural vocoders. Experimental results show that the proposed method outperforms state-of-the-art one-shot SVC baselines for both seen and unseen singers and significantly improves the robustness against distortions. △ Less

Submitted 6 October, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

arXiv:2203.11556 [pdf, other]

VQ-Flows: Vector Quantized Local Normalizing Flows

Authors: Sahil Sidheekh, Chris B. Dock, Tushar Jain, Radu Balan, Maneesh K. Singh

Abstract: Normalizing flows provide an elegant approach to generative modeling that allows for efficient sampling and exact density evaluation of unknown data distributions. However, current techniques have significant limitations in their expressivity when the data distribution is supported on a low-dimensional manifold or has a non-trivial topology. We introduce a novel statistical framework for learning… ▽ More Normalizing flows provide an elegant approach to generative modeling that allows for efficient sampling and exact density evaluation of unknown data distributions. However, current techniques have significant limitations in their expressivity when the data distribution is supported on a low-dimensional manifold or has a non-trivial topology. We introduce a novel statistical framework for learning a mixture of local normalizing flows as "chart maps" over the data manifold. Our framework augments the expressivity of recent approaches while preserving the signature property of normalizing flows, that they admit exact density evaluation. We learn a suitable atlas of charts for the data manifold via a vector quantized auto-encoder (VQ-AE) and the distributions over them using a conditional flow. We validate experimentally that our probabilistic framework enables existing approaches to better model data distributions over complex manifolds. △ Less

Submitted 18 June, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

Comments: Accepted to The 38th Conference on Uncertainty in Artificial Intelligence (UAI) 2022

arXiv:2110.05054 [pdf, other]

Source Mixing and Separation Robust Audio Steganography

Authors: Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

Abstract: Audio steganography aims at concealing secret information in carrier audio with imperceptible modification on the carrier. Although previous works addressed the robustness of concealed message recovery against distortions introduced during transmission, they do not address the robustness against aggressive editing such as mixing of other audio sources and source separation. In this work, we propos… ▽ More Audio steganography aims at concealing secret information in carrier audio with imperceptible modification on the carrier. Although previous works addressed the robustness of concealed message recovery against distortions introduced during transmission, they do not address the robustness against aggressive editing such as mixing of other audio sources and source separation. In this work, we propose for the first time a steganography method that can embed information into individual sound sources in a mixture such as instrumental tracks in music. To this end, we propose a time-domain model and curriculum learning essential to learn to decode the concealed message from the separated sources. Experimental results show that the proposed method successfully conceals the information in an imperceptible perturbation and that the information can be correctly recovered even after mixing of other sources and separation by a source separation algorithm. Furthermore, we show that the proposed method can be applied to multiple sources simultaneously without interfering with the decoder for other sources even after the sources are mixed and separated. △ Less

Submitted 17 February, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

Comments: Accepted to ICASSP 2022

arXiv:2101.06842 [pdf, other]

Hierarchical disentangled representation learning for singing voice conversion

Authors: Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

Abstract: Conventional singing voice conversion (SVC) methods often suffer from operating in high-resolution audio owing to a high dimensionality of data. In this paper, we propose a hierarchical representation learning that enables the learning of disentangled representations with multiple resolutions independently. With the learned disentangled representations, the proposed method progressively performs S… ▽ More Conventional singing voice conversion (SVC) methods often suffer from operating in high-resolution audio owing to a high dimensionality of data. In this paper, we propose a hierarchical representation learning that enables the learning of disentangled representations with multiple resolutions independently. With the learned disentangled representations, the proposed method progressively performs SVC from low to high resolutions. Experimental results show that the proposed method outperforms baselines that operate with a single resolution in terms of mean opinion score (MOS), similarity score, and pitch accuracy. △ Less

Submitted 25 April, 2021; v1 submitted 17 January, 2021; originally announced January 2021.

Comments: accepted at IJCNN 2021

arXiv:2010.15390 [pdf, other]

Multitask Bandit Learning Through Heterogeneous Feedback Aggregation

Authors: Zhi Wang, Chicheng Zhang, Manish Kumar Singh, Laurel D. Riek, Kamalika Chaudhuri

Abstract: In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the $ε$-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily id… ▽ More In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the $ε$-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence bound-based algorithm, RobustAgg$(ε)$, that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise similarities of reward distributions between players is known, we achieve instance-dependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise similarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure. △ Less

Submitted 19 July, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

Journal ref: In International Conference on Artificial Intelligence and Statistics (pp. 1531-1539). PMLR (2021, March)

arXiv:2007.13524 [pdf, other]

Dynamic Relational Inference in Multi-Agent Trajectories

Authors: Ruichao Xiao, Manish Kumar Singh, Rose Yu

Abstract: Inferring interactions from multi-agent trajectories has broad applications in physics, vision and robotics. Neural relational inference (NRI) is a deep generative model that can reason about relations in complex dynamics without supervision. In this paper, we take a careful look at this approach for relational inference in multi-agent trajectories. First, we discover that NRI can be fundamentally… ▽ More Inferring interactions from multi-agent trajectories has broad applications in physics, vision and robotics. Neural relational inference (NRI) is a deep generative model that can reason about relations in complex dynamics without supervision. In this paper, we take a careful look at this approach for relational inference in multi-agent trajectories. First, we discover that NRI can be fundamentally limited without sufficient long-term observations. Its ability to accurately infer interactions degrades drastically for short output sequences. Next, we consider a more general setting of relational inference when interactions are changing overtime. We propose an extension ofNRI, which we call the DYnamic multi-AgentRelational Inference (DYARI) model that can reason about dynamic relations. We conduct exhaustive experiments to study the effect of model architecture, under-lying dynamics and training scheme on the performance of dynamic relational inference using a simulated physics system. We also showcase the usage of our model on real-world multi-agent basketball trajectories. △ Less

Submitted 8 October, 2020; v1 submitted 16 July, 2020; originally announced July 2020.

Comments: submitted to ICLR 2021

arXiv:2005.12147 [pdf, other]

NENET: An Edge Learnable Network for Link Prediction in Scene Text

Authors: Mayank Kumar Singh, Sayan Banerjee, Shubhasis Chaudhuri

Abstract: Text detection in scenes based on deep neural networks have shown promising results. Instead of using word bounding box regression, recent state-of-the-art methods have started focusing on character bounding box and pixel-level prediction. This necessitates the need to link adjacent characters, which we propose in this paper using a novel Graph Neural Network (GNN) architecture that allows us to l… ▽ More Text detection in scenes based on deep neural networks have shown promising results. Instead of using word bounding box regression, recent state-of-the-art methods have started focusing on character bounding box and pixel-level prediction. This necessitates the need to link adjacent characters, which we propose in this paper using a novel Graph Neural Network (GNN) architecture that allows us to learn both node and edge features as opposed to only the node features under the typical GNN. The main advantage of using GNN for link prediction lies in its ability to connect characters which are spatially separated and have an arbitrary orientation. We show our concept on the well known SynthText dataset, achieving top results as compared to state-of-the-art methods. △ Less

Submitted 25 May, 2020; originally announced May 2020.

Comments: 9 pages

arXiv:2004.13945 [pdf, other]

Linguistic Resources for Bhojpuri, Magahi and Maithili: Statistics about them, their Similarity Estimates, and Baselines for Three Applications

Authors: Rajesh Kumar Mundotiya, Manish Kumar Singh, Rahul Kapur, Swasti Mishra, Anil Kumar Singh

Abstract: Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eas… ▽ More Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we compare with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were exptected to indicate of linguistic properties such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to the corpus size the same across the languages to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we try to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The POS tagged data sizes are 16067, 14669 and 12310 sentences, respectively, for Bhojpuri, Magahi and Maithili. The sizes for chunking are 9695 and 1954 sentences for Bhojpuri and Maithili, respectively. △ Less

Submitted 17 August, 2021; v1 submitted 28 April, 2020; originally announced April 2020.

Comments: ACM Transactions on Asian and Low-Resource Language Information Processing (Accepted)

arXiv:1911.12928 [pdf, other]

Improving Voice Separation by Incorporating End-to-end Speech Recognition

Authors: Naoya Takahashi, Mayank Kumar Singh, Sakya Basak, Parthasaarathy Sudarsanam, Sriram Ganapathy, Yuki Mitsufuji

Abstract: Despite recent advances in voice separation methods, many challenges remain in realistic scenarios such as noisy recording and the limits of available data. In this work, we propose to explicitly incorporate the phonetic and linguistic nature of speech by taking a transfer learning approach using an end-to-end automatic speech recognition (E2EASR) system. The voice separation is conditioned on dee… ▽ More Despite recent advances in voice separation methods, many challenges remain in realistic scenarios such as noisy recording and the limits of available data. In this work, we propose to explicitly incorporate the phonetic and linguistic nature of speech by taking a transfer learning approach using an end-to-end automatic speech recognition (E2EASR) system. The voice separation is conditioned on deep features extracted from E2EASR to cover the long-term dependence of phonetic aspects. Experimental results on speech separation and enhancement task on the AVSpeech dataset show that the proposed method significantly improves the signal-to-distortion ratio over the baseline model and even outperforms an audio visual model, that utilizes visual information of lip movements. △ Less

Submitted 3 May, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

Comments: Accepted in ICASSP 2020

arXiv:1808.00948 [pdf, other]

Diverse Image-to-Image Translation via Disentangled Representations

Authors: Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Kumar Singh, Ming-Hsuan Yang

Abstract: Image-to-image translation aims to learn the map** between two visual domains. There are two main challenges for many applications: 1) the lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for producing diverse outputs without paired training images. To achieve diversity, we propose… ▽ More Image-to-image translation aims to learn the map** between two visual domains. There are two main challenges for many applications: 1) the lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for producing diverse outputs without paired training images. To achieve diversity, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time. To handle unpaired training data, we introduce a novel cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative comparisons, we measure realism with user study and diversity with a perceptual distance metric. We apply the proposed model to domain adaptation and show competitive performance when compared to the state-of-the-art on the MNIST-M and the LineMod datasets. △ Less

Submitted 2 August, 2018; originally announced August 2018.

Comments: ECCV 2018 (Oral). Project page: http://vllab.ucmerced.edu/hylee/DRIT/ Code: https://github.com/HsinYingLee/DRIT/

arXiv:1006.3609 [pdf]

doi 10.5121/ijngn.2010.2206

The Forecasting of 3G Market in India Based on Revised Technology Acceptance Model

Authors: Sudha Singh, D. K. Singh, M. K. Singh, Sujeet Kumar Singh

Abstract: 3G, processor of 2G services, is a family of standards for mobile telecommunications defined by the International Telecommunication Union [1]. 3G services include wide-area wireless voice telephone, video calls, and wireless data, all in a mobile environment. It allows simultaneous use of speech and data services and higher data rates.3G is defined to facilitate growth, increased bandwidth and sup… ▽ More 3G, processor of 2G services, is a family of standards for mobile telecommunications defined by the International Telecommunication Union [1]. 3G services include wide-area wireless voice telephone, video calls, and wireless data, all in a mobile environment. It allows simultaneous use of speech and data services and higher data rates.3G is defined to facilitate growth, increased bandwidth and support more diverse applications. The focus of this study is to examine the factors affecting the adoption of 3G services among Indian people. The study adopts the revised Technology Acceptance Model by adding five antecedents-perceived risks, cost of adoption, perceived service quality, subjective norms, and perceived lack of knowledge. Data have collected from more than 400 school/college/Institution students & employees of various Government/Private sectors using interviews & various convenience sampling procedures and analyzed using MS excel and MATLAB. Result shows that perceived usefulness has the most significant influence on attitude towards using 3G services, which is consistent with prior studies. Of the five antecedents, perceived risk and cost of adoption are found to be significantly influencing attitude towards use. The outcome of this study would be beneficial to private and public telecommunication organizations, various service providers, business community, banking services and people of India. Research findings and suggestions for future research are also discussed. △ Less

Submitted 18 June, 2010; originally announced June 2010.

Comments: 8 Pages

Journal ref: International Journal of Next-Generation Networks 2.2 (2010) 61-68

arXiv:1004.1708 [pdf]

Mathematical Principles in Software Quality Engineering

Authors: Manoranjan Kumar Singh, Rakesh. L

Abstract: Mathematics has many useful properties for develo** of complex software systems. One is that it can exactly describe a physical situation of the object or outcome of an action. Mathematics support abstraction and this is an excellent medium for modeling, since it is an exact medium there is a little possibility of ambiguity. This paper demonstrates that mathematics provides a high level of valid… ▽ More Mathematics has many useful properties for develo** of complex software systems. One is that it can exactly describe a physical situation of the object or outcome of an action. Mathematics support abstraction and this is an excellent medium for modeling, since it is an exact medium there is a little possibility of ambiguity. This paper demonstrates that mathematics provides a high level of validation when it is used as a software medium. It also outlines distinguishing characteristics of structural testing which is based on the source code of the program tested. Structural testing methods are very amenable to rigorous definition, mathematical analysis and precise measurement. Finally, it also discusses functional and structural testing debate to have a sense of complete testing. Any program can be considered to be a function in the sense that program input forms its domain and program outputs form its range. In general discrete mathematics is more applicable to functional testing, while graph theory pertains more to structural testing. △ Less

Submitted 10 April, 2010; originally announced April 2010.

Comments: IEEE Publication format, ISSN 1947 5500, http://sites.google.com/site/ijcsis/

Journal ref: IJCSIS, Vol. 7 No. 3, March 2010, 178-184

arXiv:1003.3090 [pdf]

doi 10.5121/ijcnc.2010.2202

Node Isolation Probability of Wireless Adhoc Networks in Nagakami Fading Channel

Authors: A. V. Babu, Mukesh Kumar Singh

Abstract: This paper investigates the issue of connectivity of a wireless adhoc network in the presence of channel impairments. We derive analytical expressions for the node isolation probability in an adhoc network in the presence of Nakagami-m fading with superimposed lognormal shadowing. The node isolation probability is the probability that a randomly chosen node is not able to communicate with none of… ▽ More This paper investigates the issue of connectivity of a wireless adhoc network in the presence of channel impairments. We derive analytical expressions for the node isolation probability in an adhoc network in the presence of Nakagami-m fading with superimposed lognormal shadowing. The node isolation probability is the probability that a randomly chosen node is not able to communicate with none of the other nodes in the network. An extensive investigation into the impact of path loss exponent, lognormal shadowing, Nakagami fading severity index, node density, and diversity order on the node isolation probability is conducted. The presented results are beneficial for the practical design of ad hoc networks. △ Less

Submitted 16 March, 2010; originally announced March 2010.

Comments: 16 pages, IJCNC Journal

Journal ref: International Journal of Computer Networks & Communications 2.2 (2010) 21-36

arXiv:1002.4004 [pdf]

Nature inspired artificial intelligence based adaptive traffic flow distribution in computer network

Authors: Manoj Kumar Singh

Abstract: Because of the stochastic nature of traffic requirement matrix, it is very difficult to get the optimal traffic distribution to minimize the delay even with adaptive routing protocol in a fixed connection network where capacity already defined for each link. Hence there is a requirement to define such a method, which could generate the optimal solution very quickly and efficiently. This paper pr… ▽ More Because of the stochastic nature of traffic requirement matrix, it is very difficult to get the optimal traffic distribution to minimize the delay even with adaptive routing protocol in a fixed connection network where capacity already defined for each link. Hence there is a requirement to define such a method, which could generate the optimal solution very quickly and efficiently. This paper presenting a new concept to provide the adaptive optimal traffic distribution for dynamic condition of traffic matrix using nature based intelligence methods. With the defined load and fixed capacity of links, average delay for packet has minimized with various variations of evolutionary programming and particle swarm optimization. Comparative study has given over their performance in terms of converging speed. Universal approximation capability, the key feature of feed forward neural network has applied to predict the flow distribution on each link to minimize the average delay for a total load available at present on the network. For any variation in the total load, the new flow distribution can be generated by neural network immediately, which could generate minimum delay in the network. With the inclusion of this information, performance of routing protocol will be improved very much. △ Less

Submitted 21 February, 2010; originally announced February 2010.

Journal ref: Journal of Computing, Volume 2, Issue 2, February 2010, https://sites.google.com/site/journalofcomputing/

arXiv:0910.1838 [pdf]

Password Based a Generalize Robust Security System Design Using Neural Network

Authors: Manoj Kumar Singh

Abstract: Among the various means of available resource protection including biometrics, password based system is most simple, user friendly, cost effective and commonly used. But this method having high sensitivity with attacks. Most of the advanced methods for authentication based on password encrypt the contents of password before storing or transmitting in physical domain. But all conventional cryptog… ▽ More Among the various means of available resource protection including biometrics, password based system is most simple, user friendly, cost effective and commonly used. But this method having high sensitivity with attacks. Most of the advanced methods for authentication based on password encrypt the contents of password before storing or transmitting in physical domain. But all conventional cryptographic based encryption methods are having its own limitations, generally either in terms of complexity or in terms of efficiency. Multi-application usability of password today forcing users to have a proper memory aids. Which itself degrades the level of security. In this paper a method to exploit the artificial neural network to develop the more secure means of authentication, which is more efficient in providing the authentication, at the same time simple in design, has given. Apart from protection, a step toward perfect security has taken by adding the feature of intruder detection along with the protection system. This is possible by analysis of several logical parameters associated with the user activities. A new method of designing the security system centrally based on neural network with intrusion detection capability to handles the challenges available with present solutions, for any kind of resource has presented. △ Less

Submitted 9 October, 2009; originally announced October 2009.

Comments: International Journal of Computer Science Issues, IJCSI, Volume 4, Issue 2, pp1-9, September 2009

Journal ref: M.K Singh, "Password Based A Generalize Robust Security System Design Using Neural Network", International Journal of Computer Science Issues, IJCSI, Volume 4, Issue 2, pp1-9, September 2009

Showing 1–27 of 27 results for author: Singh, M K