Search | arXiv e-print repository

UAV Immersive Video Streaming: A Comprehensive Survey, Benchmarking, and Open Challenges

Authors: Mohit K. Sharma, Chen-Feng Liu, Ibrahim Farhat, Nassim Sehad, Wassim Hamidouche, Merouane Debbah

Abstract: Over the past decade, the utilization of UAVs has witnessed significant growth, owing to their agility, rapid deployment, and maneuverability. In particular, the use of UAV-mounted 360-degree cameras to capture omnidirectional videos has enabled truly immersive viewing experiences with up to 6DoF. However, achieving this immersive experience necessitates encoding omnidirectional videos in high res… ▽ More Over the past decade, the utilization of UAVs has witnessed significant growth, owing to their agility, rapid deployment, and maneuverability. In particular, the use of UAV-mounted 360-degree cameras to capture omnidirectional videos has enabled truly immersive viewing experiences with up to 6DoF. However, achieving this immersive experience necessitates encoding omnidirectional videos in high resolution, leading to increased bitrates. Consequently, new challenges arise in terms of latency, throughput, perceived quality, and energy consumption for real-time streaming of such content. This paper presents a comprehensive survey of research efforts in UAV-based immersive video streaming, benchmarks popular video encoding schemes, and identifies open research challenges. Initially, we review the literature on 360-degree video coding, packaging, and streaming, with a particular focus on standardization efforts to ensure interoperability of immersive video streaming devices and services. Subsequently, we provide a comprehensive review of research efforts focused on optimizing video streaming for timevarying UAV wireless channels. Additionally, we introduce a high resolution 360-degree video dataset captured from UAVs under different flying conditions. This dataset facilitates the evaluation of complexity and coding efficiency of software and hardware video encoders based on popular video coding standards and formats, including AVC/H.264, HEVC/H.265, VVC/H.266, VP9, and AV1. Our results demonstrate that HEVC achieves the best trade-off between coding efficiency and complexity through its hardware implementation, while AV1 format excels in coding efficiency through its software implementation, specifically using the libsvt-av1 encoder. Furthermore, we present a real testbed showcasing 360-degree video streaming over a UAV, enabling remote control of the drone via a 5G cellular network. △ Less

Submitted 31 October, 2023; originally announced November 2023.

arXiv:2210.01447 [pdf, other]

A Novel Light Field Coding Scheme Based on Deep Belief Network & Weighted Binary Images for Additive Layered Displays

Authors: Sally Khaidem, Mansi Sharma

Abstract: Light-field displays create an immersive experience by providing binocular depth sensation and motion parallax. Stacking light attenuating layers is one approach to implement a light field display with a broader depth of field, wide viewing angles and high resolution. Due to the transparent holographic optical element (HOE) layers, additive layered displays can be integrated into augmented reality… ▽ More Light-field displays create an immersive experience by providing binocular depth sensation and motion parallax. Stacking light attenuating layers is one approach to implement a light field display with a broader depth of field, wide viewing angles and high resolution. Due to the transparent holographic optical element (HOE) layers, additive layered displays can be integrated into augmented reality (AR) wearables to overlay virtual objects onto the real world, creating a seamless mixed reality (XR) experience. This paper proposes a novel framework for light field representation and coding that utilizes Deep Belief Network (DBN) and weighted binary images suitable for additive layered displays. The weighted binary representation of layers makes the framework more flexible for adaptive bitrate encoding. The framework effectively captures intrinsic redundancies in the light field data, and thus provides a scalable solution for light field coding suitable for XR display applications. The latent code is encoded by H.265 codec generating a rate-scalable bit-stream. We achieve adaptive bitrate decoding by varying the number of weighted binary images and the H.265 quantization parameter, while maintaining an optimal reconstruction quality. The framework is tested on real and synthetic benchmark datasets, and the results validate the rate-scalable property of the proposed scheme. △ Less

Submitted 21 April, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

Comments: The paper is under consideration at Pattern Recognition Letters

arXiv:2206.11095 [pdf, other]

A High Resolution Multi-exposure Stereoscopic Image & Video Database of Natural Scenes

Authors: Rohit Choudhary, Mansi Sharma, Aditya Wadaskar

Abstract: Immersive displays such as VR headsets, AR glasses, Multiview displays, Free point televisions have emerged as a new class of display technologies in recent years, offering a better visual experience and viewer engagement as compared to conventional displays. With the evolution of 3D video and display technologies, the consumer market for High Dynamic Range (HDR) cameras and displays is quickly gr… ▽ More Immersive displays such as VR headsets, AR glasses, Multiview displays, Free point televisions have emerged as a new class of display technologies in recent years, offering a better visual experience and viewer engagement as compared to conventional displays. With the evolution of 3D video and display technologies, the consumer market for High Dynamic Range (HDR) cameras and displays is quickly growing. The lack of appropriate experimental data is a critical hindrance for the development of primary research efforts in the field of 3D HDR video technology. Also, the unavailability of sufficient real world multi-exposure experimental dataset is a major bottleneck for HDR imaging research, thereby limiting the quality of experience (QoE) for the viewers. In this paper, we introduce a diversified stereoscopic multi-exposure dataset captured within the campus of Indian Institute of Technology Madras, which is home to a diverse flora and fauna. The dataset is captured using ZED stereoscopic camera and provides intricate scenes of outdoor locations such as gardens, roadside views, festival venues, buildings and indoor locations such as academic and residential areas. The proposed dataset accommodates wide depth range, complex depth structure, complicate object movement, illumination variations, rich color dynamics, texture discrepancy in addition to significant randomness introduced by moving camera and background motion. The proposed dataset is made publicly available to the research community. Furthermore, the procedure for capturing, aligning and calibrating multi-exposure stereo videos and images is described in detail. Finally, we have discussed the progress, challenges, potential use cases and future research opportunities with respect to HDR imaging, depth estimation, consistent tone map** and 3D HDR coding. △ Less

Submitted 22 June, 2022; originally announced June 2022.

arXiv:2206.11048 [pdf, other]

Automated GI tract segmentation using deep learning

Authors: Manhar Sharma

Abstract: The job of Radiation oncologists is to deliver x-ray beams pointed toward the tumor and at the same time avoid the stomach and intestines. With MR-Linacs (magnetic resonance imaging and linear accelerator systems), oncologists can visualize the position of the tumor and allow for precise dose according to tumor cell presence which can vary from day to day. The current job of outlining the position… ▽ More The job of Radiation oncologists is to deliver x-ray beams pointed toward the tumor and at the same time avoid the stomach and intestines. With MR-Linacs (magnetic resonance imaging and linear accelerator systems), oncologists can visualize the position of the tumor and allow for precise dose according to tumor cell presence which can vary from day to day. The current job of outlining the position of the stomach and intestines to adjust the X-ray beams direction for the dose delivery to the tumor while avoiding the organs. This is a time-consuming and labor-intensive process that can easily prolong treatments from 15 minutes to an hour a day unless deep learning methods can automate the segmentation process. This paper discusses an automated segmentation process using deep learning to make this process faster and allow more patients to get effective treatment. △ Less

Submitted 5 September, 2023; v1 submitted 22 June, 2022; originally announced June 2022.

Comments: 8 pages, 9 figures

arXiv:2206.10131 [pdf, other]

An Integrated Representation & Compression Scheme Based on Convolutional Autoencoders with 4D DCT Perceptual Encoding for High Dynamic Range Light Fields

Authors: Sally Khaidem, Mansi Sharma

Abstract: The emerging and existing light field displays are highly capable of realistic presentation of 3D scenes on auto-stereoscopic glasses-free platforms. The light field size is a major drawback while utilising 3D displays and streaming purposes. When a light field is of high dynamic range, the size increases drastically. In this paper, we propose a novel compression algorithm for a high dynamic range… ▽ More The emerging and existing light field displays are highly capable of realistic presentation of 3D scenes on auto-stereoscopic glasses-free platforms. The light field size is a major drawback while utilising 3D displays and streaming purposes. When a light field is of high dynamic range, the size increases drastically. In this paper, we propose a novel compression algorithm for a high dynamic range light field which yields a perceptually lossless compression. The algorithm exploits the inter and intra view correlations of the HDR light field by interpreting it to be a four-dimension volume. The HDR light field compression is based on a novel 4DDCT-UCS (4D-DCT Uniform Colour Space) algorithm. Additional encoding of 4DDCT-UCS acquired images by HEVC eliminates intra-frame, inter-frame and intrinsic redundancies in HDR light field data. Comparison with state-of-the-art coders like JPEG-XL and HDR video coding algorithm exhibits superior compression performance of the proposed scheme for real-world light fields. △ Less

Submitted 21 June, 2022; originally announced June 2022.

arXiv:2202.06166 [pdf, other]

doi 10.1063/5.0088264

Do cities have a unique magnetic pulse?

Authors: Vincent Dumont, Trevor A. Bowen, Roger Roglans, Gregory Dobler, Mohit S. Sharma, Andy Karpf, Stuart D. Bale, Arne Wickenbrock, Elena Zhivun, Tom Kornack, Jonathan S. Wurtele, Dmitry Budker

Abstract: We present a comparative analysis of urban magnetic fields between two American cities: Berkeley (California) and Brooklyn Borough of New York City (New York). Our analysis uses data taken over a four-week period during which magnetic field data were continuously recorded using a fluxgate magnetometer of 70 pT/$\sqrt{\mathrm{Hz}}$ sensitivity. We identified significant differences in the magnetic… ▽ More We present a comparative analysis of urban magnetic fields between two American cities: Berkeley (California) and Brooklyn Borough of New York City (New York). Our analysis uses data taken over a four-week period during which magnetic field data were continuously recorded using a fluxgate magnetometer of 70 pT/$\sqrt{\mathrm{Hz}}$ sensitivity. We identified significant differences in the magnetic signatures. In particular, we noticed that Berkeley reaches a near-zero magnetic field activity at night whereas magnetic activity in Brooklyn continues during nighttime. We also present auxiliary measurements acquired using magnetoresistive vector magnetometers (VMR), with sensitivity of 300 pT/$\sqrt{\mathrm{Hz}}$, and demonstrate how cross-correlation, and frequency-domain analysis, combined with data filtering can be used to extract urban magnetometry signals and study local anthropogenic activities. Finally, we discuss the potential of using magnetometer networks to characterize the global magnetic field of cities and give directions for future development. △ Less

Submitted 12 February, 2022; originally announced February 2022.

Comments: 8 pages, 7 figures

Journal ref: Journal of Applied Physics 131, 204902 (2022)

arXiv:2108.12399 [pdf, other]

A Novel Hierarchical Light Field Coding Scheme Based on Hybrid Stacked Multiplicative Layers and Fourier Disparity Layers for Glasses-Free 3D Displays

Authors: Joshitha Ravishankar, Mansi Sharma

Abstract: This paper presents a novel hierarchical coding scheme for light fields based on transmittance patterns of low-rank multiplicative layers and Fourier disparity layers. The proposed scheme identifies multiplicative layers of light field view subsets optimized using a convolutional neural network for different scanning orders. Our approach exploits the hidden low-rank structure in the multiplicative… ▽ More This paper presents a novel hierarchical coding scheme for light fields based on transmittance patterns of low-rank multiplicative layers and Fourier disparity layers. The proposed scheme identifies multiplicative layers of light field view subsets optimized using a convolutional neural network for different scanning orders. Our approach exploits the hidden low-rank structure in the multiplicative layers obtained from the subsets of different scanning patterns. The spatial redundancies in the multiplicative layers can be efficiently removed by performing low-rank approximation at different ranks on the Krylov subspace. The intra-view and inter-view redundancies between approximated layers are further removed by HEVC encoding. Next, a Fourier disparity layer representation is constructed from the first subset of the approximated light field based on the chosen hierarchical order. Subsequent view subsets are synthesized by modeling the Fourier disparity layers that iteratively refine the representation with improved accuracy. The critical advantage of the proposed hybrid layered representation and coding scheme is that it utilizes not just spatial and temporal redundancies in light fields but efficiently exploits intrinsic similarities among neighboring sub-aperture images in both horizontal and vertical directions as specified by different predication orders. In addition, the scheme is flexible to realize a range of multiple bitrates at the decoder within a single integrated system. The compression performance of the proposed scheme is analyzed on real light fields. We achieved substantial bitrate savings and maintained good light field reconstruction quality. △ Less

Submitted 27 August, 2021; originally announced August 2021.

arXiv:2106.03851 [pdf, other]

Impact of data-splits on generalization: Identifying COVID-19 from cough and context

Authors: Makkunda Sharma, Nikhil Shenoy, Jigar Doshi, Piyush Bagad, Aman Dalmia, Parag Bhamare, Amrita Mahale, Saurabh Rane, Neeraj Agrawal, Rahul Panicker

Abstract: Rapidly scaling screening, testing and quarantine has shown to be an effective strategy to combat the COVID-19 pandemic. We consider the application of deep learning techniques to distinguish individuals with COVID from non-COVID by using data acquirable from a phone. Using cough and context (symptoms and meta-data) represent such a promising approach. Several independent works in this direction h… ▽ More Rapidly scaling screening, testing and quarantine has shown to be an effective strategy to combat the COVID-19 pandemic. We consider the application of deep learning techniques to distinguish individuals with COVID from non-COVID by using data acquirable from a phone. Using cough and context (symptoms and meta-data) represent such a promising approach. Several independent works in this direction have shown promising results. However, none of them report performance across clinically relevant data splits. Specifically, the performance where the development and test sets are split in time (retrospective validation) and across sites (broad validation). Although there is meaningful generalization across these splits the performance significantly varies (up to 0.1 AUC score). In addition, we study the performance of symptomatic and asymptomatic individuals across these three splits. Finally, we show that our model focuses on meaningful features of the input, cough bouts for cough and relevant symptoms for context. The code and checkpoints are available at https://github.com/WadhwaniAI/cough-against-covid △ Less

Submitted 5 June, 2021; originally announced June 2021.

Comments: Published as a workshop paper at ICLR 2021 AI for Public Health Workshop and ICLR 20201 Machine Learning for Preventing and Combating Pandemics Workshop

arXiv:2104.04678 [pdf, other]

A Flexible Lossy Depth Video Coding Scheme Based on Low-rank Tensor Modelling and HEVC Intra Prediction for Free Viewpoint Video

Authors: Mansi Sharma, Santosh Kumar

Abstract: The compression quality losses of depth sequences determine quality of view synthesis in free-viewpoint video. The depth map intra prediction in 3D extensions of the HEVC applies intra modes with auxiliary depth modeling modes (DMMs) to better preserve depth edges and handle motion discontinuities. Although such modes enable high efficiency compression, but at the cost of very high encoding comple… ▽ More The compression quality losses of depth sequences determine quality of view synthesis in free-viewpoint video. The depth map intra prediction in 3D extensions of the HEVC applies intra modes with auxiliary depth modeling modes (DMMs) to better preserve depth edges and handle motion discontinuities. Although such modes enable high efficiency compression, but at the cost of very high encoding complexity. Skip** conventional intra coding modes and DMMs in depth coding limits practical applicability of the HEVC for 3D display applications. In this paper, we introduce a novel low-complexity scheme for depth video compression based on low-rank tensor decomposition and HEVC intra coding. The proposed scheme leverages spatial and temporal redundancy by compactly representing the depth sequence as a high-order tensor. Tensor factorization into a set of factor matrices following CANDECOMP PARAFAC (CP) decomposition via alternating least squares give a low-rank approximation of the scene geometry. Further, compression of factor matrices with HEVC intra prediction support arbitrary target accuracy by flexible adjustment of bitrate, varying tensor decomposition ranks and quantization parameters. The results demonstrate proposed approach achieves significant rate gains by efficiently compressing depth planes in low-rank approximated representation. The proposed algorithm is applied to encode depth maps of benchmark Ballet and Breakdancing sequences. The decoded depth sequences are used for view synthesis in a multi-view video system, maintaining appropriate rendering quality. △ Less

Submitted 10 April, 2021; originally announced April 2021.

arXiv:2101.10039 [pdf, other]

Latent Factor Modeling of Users Subjective Perception for Stereoscopic 3D Video Recommendation

Authors: Balasubramanyam Appina, Mansi Sharma, Santosh Kumar

Abstract: Numerous stereoscopic 3D movies are released every year to theaters and created large revenues. Despite the improvement in stereo capturing and 3D video post-production technology, stereoscopic artifacts which cause viewer discomfort continue to appear even in high-budget films. Existing automatic 3D video quality measurement tools can detect distortions in stereoscopic images or videos, but they… ▽ More Numerous stereoscopic 3D movies are released every year to theaters and created large revenues. Despite the improvement in stereo capturing and 3D video post-production technology, stereoscopic artifacts which cause viewer discomfort continue to appear even in high-budget films. Existing automatic 3D video quality measurement tools can detect distortions in stereoscopic images or videos, but they fail to consider the viewer's subjective perception of those artifacts, and how these distortions affect their choices. In this paper, we introduce a novel recommendation system for stereoscopic 3D movies based on a latent factor model that meticulously analyse the viewer's subjective ratings and influence of 3D video distortions on their preferences. To the best of our knowledge, this is a first-of-its-kind model that recommends 3D movies based on stereo-film quality ratings accounting correlation between the viewer's visual discomfort and stereoscopic-artifact perception. The proposed model is trained and tested on benchmark Nama3ds1-cospad1 and LFOVIAS3DPh2 S3D video quality assessment datasets. The experiments revealed that resulting matrix-factorization based recommendation system is able to generalize considerably better for the viewer's subjective ratings. △ Less

Submitted 25 January, 2021; originally announced January 2021.

arXiv:2008.08243 [pdf, other]

Enabling Remote Whole-Body Control with 5G Edge Computing

Authors: Huaijiang Zhu, Manali Sharma, Kai Pfeiffer, Marco Mezzavilla, Jia Shen, Sundeep Rangan, Ludovic Righetti

Abstract: Real-world applications require light-weight, energy-efficient, fully autonomous robots. Yet, increasing autonomy is oftentimes synonymous with escalating computational requirements. It might thus be desirable to offload intensive computation--not only sensing and planning, but also low-level whole-body control--to remote servers in order to reduce on-board computational needs. Fifth Generation (5… ▽ More Real-world applications require light-weight, energy-efficient, fully autonomous robots. Yet, increasing autonomy is oftentimes synonymous with escalating computational requirements. It might thus be desirable to offload intensive computation--not only sensing and planning, but also low-level whole-body control--to remote servers in order to reduce on-board computational needs. Fifth Generation (5G) wireless cellular technology, with its low latency and high bandwidth capabilities, has the potential to unlock cloud-based high performance control of complex robots. However, state-of-the-art control algorithms for legged robots can only tolerate very low control delays, which even ultra-low latency 5G edge computing can sometimes fail to achieve. In this work, we investigate the problem of cloud-based whole-body control of legged robots over a 5G link. We propose a novel approach that consists of a standard optimization-based controller on the network edge and a local linear, approximately optimal controller that significantly reduces on-board computational needs while increasing robustness to delay and possible loss of communication. Simulation experiments on humanoid balancing and walking tasks that includes a realistic 5G communication model demonstrate significant improvement of the reliability of robot locomotion under jitter and delays likely to experienced in 5G wireless links. △ Less

Submitted 18 August, 2020; originally announced August 2020.

arXiv:2005.00698 [pdf]

doi 10.1109/JSEN.2020.3045135

Deep ConvLSTM with self-attention for human activity decoding using wearables

Authors: Satya P. Singh, Aimé Lay-Ekuakille, Deepak Gangwar, Madan Kumar Sharma, Sukrit Gupta

Abstract: Decoding human activity accurately from wearable sensors can aid in applications related to healthcare and context awareness. The present approaches in this domain use recurrent and/or convolutional models to capture the spatio-temporal features from time-series data from multiple sensors. We propose a deep neural network architecture that not only captures the spatio-temporal features of multiple… ▽ More Decoding human activity accurately from wearable sensors can aid in applications related to healthcare and context awareness. The present approaches in this domain use recurrent and/or convolutional models to capture the spatio-temporal features from time-series data from multiple sensors. We propose a deep neural network architecture that not only captures the spatio-temporal features of multiple sensor time-series data but also selects, learns important time points by utilizing a self-attention mechanism. We show the validity of the proposed approach across different data sampling strategies on six public datasets and demonstrate that the self-attention mechanism gave a significant improvement in performance over deep networks using a combination of recurrent and convolution networks. We also show that the proposed approach gave a statistically significant performance enhancement over previous state-of-the-art methods for the tested datasets. The proposed methods open avenues for better decoding of human activity from multiple body sensors over extended periods of time. The code implementation for the proposed model is available at https://github.com/isukrit/encodingHumanActivity. △ Less

Submitted 17 December, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

Comments: 8 pages, 2 figures, 3 tables. IEEE Sensors Journal, 2020

arXiv:2004.09347 [pdf, other]

End-to-End Whisper to Natural Speech Conversion using Modified Transformer Network

Authors: Abhishek Niranjan, Mukesh Sharma, Sai Bharath Chandra Gutha, M Ali Basha Shaik

Abstract: Machine recognition of an atypical speech like whispered speech, is a challenging task. We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach by proposing enhanced transformer architecture, which uses both parallel and non-parallel data. We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features. The proposed networks a… ▽ More Machine recognition of an atypical speech like whispered speech, is a challenging task. We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach by proposing enhanced transformer architecture, which uses both parallel and non-parallel data. We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features. The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation. Further, we also investigate the effectiveness of embedded auxillary decoder used after N encoder sub-layers, trained with the frame-level objective function for identifying source phoneme labels. We show results on opensource wTIMIT and CHAINS datasets by measuring word error rate using end-to-end ASR and also BLEU scores for the generated speech. Alternatively, we also propose a novel method to measure spectral shape of it by measuring formant distributions w.r.t. reference speech, as formant divergence metric. We have found whisper-to-natural converted speech formants probability distribution is similar to the groundtruth distribution. To the authors' best knowledge, this is the first time enhanced transformer has been proposed, both with and without auxiliary decoder for whisper-to-natural-speech conversion and vice versa. △ Less

Submitted 5 April, 2021; v1 submitted 20 April, 2020; originally announced April 2020.

arXiv:2001.01469 [pdf, other]

TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images

Authors: Shubham Paliwal, Vishwanath D, Rohit Rahul, Monika Sharma, Lovekesh Vig

Abstract: With the widespread use of mobile phones and scanners to photograph and upload documents, the need for extracting the information trapped in unstructured document images such as retail receipts, insurance claim forms and financial invoices is becoming more acute. A major hurdle to this objective is that these images often contain information in the form of tables and extracting data from tabular s… ▽ More With the widespread use of mobile phones and scanners to photograph and upload documents, the need for extracting the information trapped in unstructured document images such as retail receipts, insurance claim forms and financial invoices is becoming more acute. A major hurdle to this objective is that these images often contain information in the form of tables and extracting data from tabular sub-images presents a unique set of challenges. This includes accurate detection of the tabular region within an image, and subsequently detecting and extracting information from the rows and columns of the detected table. While some progress has been made in table detection, extracting the table contents is still a challenge since this involves more fine grained table structure(rows & columns) recognition. Prior approaches have attempted to solve the table detection and structure recognition problems independently using two separate models. In this paper, we propose TableNet: a novel end-to-end deep learning model for both table detection and structure recognition. The model exploits the interdependence between the twin tasks of table detection and table structure recognition to segment out the table and column regions. This is followed by semantic rule-based row extraction from the identified tabular sub-regions. The proposed model and extraction approach was evaluated on the publicly available ICDAR 2013 and Marmot Table datasets obtaining state of the art results. Additionally, we demonstrate that feeding additional semantic features further improves model performance and that the model exhibits transfer learning across datasets. Another contribution of this paper is to provide additional table structure annotations for the Marmot data, which currently only has annotations for table detection. △ Less

Submitted 6 January, 2020; originally announced January 2020.

arXiv:1907.04294 [pdf, other]

An Attention Mechanism for Musical Instrument Recognition

Authors: Siddharth Gururani, Mohit Sharma, Alexander Lerch

Abstract: While the automatic recognition of musical instruments has seen significant progress, the task is still considered hard for music featuring multiple instruments as opposed to single instrument recordings. Datasets for polyphonic instrument recognition can be categorized into roughly two categories. Some, such as MedleyDB, have strong per-frame instrument activity annotations but are usually small… ▽ More While the automatic recognition of musical instruments has seen significant progress, the task is still considered hard for music featuring multiple instruments as opposed to single instrument recordings. Datasets for polyphonic instrument recognition can be categorized into roughly two categories. Some, such as MedleyDB, have strong per-frame instrument activity annotations but are usually small in size. Other, larger datasets such as OpenMIC only have weak labels, i.e., instrument presence or absence is annotated only for long snippets of a song. We explore an attention mechanism for handling weakly labeled data for multi-label instrument recognition. Attention has been found to perform well for other tasks with weakly labeled data. We compare the proposed attention model to multiple models which include a baseline binary relevance random forest, recurrent neural network, and fully connected neural networks. Our results show that incorporating attention leads to an overall improvement in classification accuracy metrics across all 20 instruments in the OpenMIC dataset. We find that attention enables models to focus on (or `attend to') specific time segments in the audio relevant to each instrument label leading to interpretable results. △ Less

Submitted 9 July, 2019; originally announced July 2019.

Comments: To appear in: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Delft, 2019

arXiv:1903.03652 [pdf, ps, other]

Deep Learning Based Online Power Control for Large Energy Harvesting Networks

Authors: Mohit K Sharma, Alessio Zappone, Merouane Debbah, Mohamad Assaad

Abstract: In this paper, we propose a deep learning based approach to design online power control policies for large EH networks, which are often intractable stochastic control problems. In the proposed approach, for a given EH network, the optimal online power control rule is learned by training a deep neural network (DNN), using the solution of offline policy design problem. Under the proposed scheme, in… ▽ More In this paper, we propose a deep learning based approach to design online power control policies for large EH networks, which are often intractable stochastic control problems. In the proposed approach, for a given EH network, the optimal online power control rule is learned by training a deep neural network (DNN), using the solution of offline policy design problem. Under the proposed scheme, in a given time slot, the transmit power is obtained by feeding the current system state to the trained DNN. Our results illustrate that the DNN based online power control scheme outperforms a Markov decision process based policy. In general, the proposed deep learning based approach can be used to find solutions to large intractable stochastic control problems. △ Less

Submitted 8 March, 2019; originally announced March 2019.

Comments: 5 pages, to appear at ICASSP 2019

arXiv:1903.03195 [pdf, other]

doi 10.3390/s19061415

The life of a New York City noise sensor network

Authors: Charlie Mydlarz, Mohit Sharma, Yitzchak Lockerman, Ben Steers, Claudio Silva, Juan Pablo Bello

Abstract: Noise pollution is one of the topmost quality of life issues for urban residents in the United States. Continued exposure to high levels of noise has proven effects on health, including acute effects such as sleep disruption, and long-term effects such as hypertension, heart disease, and hearing loss. To investigate and ultimately aid in the mitigation of urban noise, a network of 55 sensor nodes… ▽ More Noise pollution is one of the topmost quality of life issues for urban residents in the United States. Continued exposure to high levels of noise has proven effects on health, including acute effects such as sleep disruption, and long-term effects such as hypertension, heart disease, and hearing loss. To investigate and ultimately aid in the mitigation of urban noise, a network of 55 sensor nodes has been deployed across New York City for over two years, collecting sound pressure level (SPL) and audio data. This network has cumulatively amassed over 75 years of calibrated, high-resolution SPL measurements and 35 years of audio data. In addition, high frequency telemetry data has been collected that provides an indication of a sensors' health. This telemetry data was analyzed over an 18 month period across 31 of the sensors. It has been used to develop a prototype model for pre-failure detection which has the ability to identify sensors in a prefail state 69.1% of the time. The entire network infrastructure is outlined, including the operation of the sensors, followed by an analysis of its data yield and the development of the fault detection approach and the future system integration plans for this. △ Less

Submitted 26 March, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

Comments: This article belongs to the Section Intelligent Sensors, 24 pages, 15 figures, 3 tables, 45 references

Journal ref: Sensors 2019, 19, 1415

Showing 1–17 of 17 results for author: Sharma, M