Search | arXiv e-print repository

Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity

Authors: Taehee Kim, ChaeHun Park, Jimin Hong, Radhika Dua, Edward Choi, Jaegul Choo

Abstract: Semantically meaningful sentence embeddings are important for numerous tasks in natural language processing. To obtain such embeddings, recent studies explored the idea of utilizing synthetically generated data from pretrained language models (PLMs) as a training corpus. However, PLMs often generate sentences much different from the ones written by human. We hypothesize that treating all these syn… ▽ More Semantically meaningful sentence embeddings are important for numerous tasks in natural language processing. To obtain such embeddings, recent studies explored the idea of utilizing synthetically generated data from pretrained language models (PLMs) as a training corpus. However, PLMs often generate sentences much different from the ones written by human. We hypothesize that treating all these synthetic examples equally for training deep neural networks can have an adverse effect on learning semantically meaningful embeddings. To analyze this, we first train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences. Based on this, we propose a novel approach that first trains the classifier to measure the importance of each sentence. The distilled information from the classifier is then used to train a reliable sentence embedding model. Through extensive evaluation on four real-world datasets, we demonstrate that our model trained on synthetic data generalizes well and outperforms the existing baselines. Our implementation is publicly available at https://github.com/ddehun/coling2022_reweighting_sts. △ Less

Submitted 30 August, 2022; v1 submitted 29 August, 2022; originally announced August 2022.

Comments: COLING2022

arXiv:2208.08853 [pdf, other]

Automatic Detection of Noisy Electrocardiogram Signals without Explicit Noise Labels

Authors: Radhika Dua, Jiyoung Lee, Joon-myoung Kwon, Edward Choi

Abstract: Electrocardiogram (ECG) signals are beneficial in diagnosing cardiovascular diseases, which are one of the leading causes of death. However, they are often contaminated by noise artifacts and affect the automatic and manual diagnosis process. Automatic deep learning-based examination of ECG signals can lead to inaccurate diagnosis, and manual analysis involves rejection of noisy ECG samples by cli… ▽ More Electrocardiogram (ECG) signals are beneficial in diagnosing cardiovascular diseases, which are one of the leading causes of death. However, they are often contaminated by noise artifacts and affect the automatic and manual diagnosis process. Automatic deep learning-based examination of ECG signals can lead to inaccurate diagnosis, and manual analysis involves rejection of noisy ECG samples by clinicians, which might cost extra time. To address this limitation, we present a two-stage deep learning-based framework to automatically detect the noisy ECG samples. Through extensive experiments and analysis on two different datasets, we observe that the deep learning-based framework can detect slightly and highly noisy ECG samples effectively. We also study the transfer of the model learned on one dataset to another dataset and observe that the framework effectively detects noisy ECG samples. △ Less

Submitted 8 August, 2022; originally announced August 2022.

Comments: PRHA Workshop, ICPR 2022

arXiv:2207.13083 [pdf, other]

Task Agnostic and Post-hoc Unseen Distribution Detection

Authors: Radhika Dua, Seongjun Yang, Yixuan Li, Edward Choi

Abstract: Despite the recent advances in out-of-distribution(OOD) detection, anomaly detection, and uncertainty estimation tasks, there do not exist a task-agnostic and post-hoc approach. To address this limitation, we design a novel clustering-based ensembling method, called Task Agnostic and Post-hoc Unseen Distribution Detection (TAPUDD) that utilizes the features extracted from the model trained on a sp… ▽ More Despite the recent advances in out-of-distribution(OOD) detection, anomaly detection, and uncertainty estimation tasks, there do not exist a task-agnostic and post-hoc approach. To address this limitation, we design a novel clustering-based ensembling method, called Task Agnostic and Post-hoc Unseen Distribution Detection (TAPUDD) that utilizes the features extracted from the model trained on a specific task. Explicitly, it comprises of TAP-Mahalanobis, which clusters the training datasets' features and determines the minimum Mahalanobis distance of the test sample from all clusters. Further, we propose the Ensembling module that aggregates the computation of iterative TAP-Mahalanobis for a different number of clusters to provide reliable and efficient cluster computation. Through extensive experiments on synthetic and real-world datasets, we observe that our approach can detect unseen samples effectively across diverse tasks and performs better or on-par with the existing baselines. To this end, we eliminate the necessity of determining the optimal value of the number of clusters and demonstrate that our method is more viable for large-scale classification tasks. △ Less

Submitted 26 July, 2022; originally announced July 2022.

arXiv:2207.03075 [pdf, other]

Towards the Practical Utility of Federated Learning in the Medical Domain

Authors: Seongjun Yang, Hyeonji Hwang, Daeyoung Kim, Radhika Dua, Jong-Yeup Kim, Eunho Yang, Edward Choi

Abstract: Federated learning (FL) is an active area of research. One of the most suitable areas for adopting FL is the medical domain, where patient privacy must be respected. Previous research, however, does not provide a practical guide to applying FL in the medical domain. We propose empirical benchmarks and experimental settings for three representative medical datasets with different modalities: longit… ▽ More Federated learning (FL) is an active area of research. One of the most suitable areas for adopting FL is the medical domain, where patient privacy must be respected. Previous research, however, does not provide a practical guide to applying FL in the medical domain. We propose empirical benchmarks and experimental settings for three representative medical datasets with different modalities: longitudinal electronic health records, skin cancer images, and electrocardiogram signals. The likely users of FL such as medical institutions and IT companies can take these benchmarks as guides for adopting FL and minimize their trial and error. For each dataset, each client data is from a different source to preserve real-world heterogeneity. We evaluate six FL algorithms designed for addressing data heterogeneity among clients, and a hybrid algorithm combining the strengths of two representative FL algorithms. Based on experiment results from three modalities, we discover that simple FL algorithms tend to outperform more sophisticated ones, while the hybrid algorithm consistently shows good, if not the best performance. We also find that a frequent global model update leads to better performance under a fixed training iteration budget. As the number of participating clients increases, higher cost is incurred due to increased IT administrators and GPUs, but the performance consistently increases. We expect future users will refer to these empirical benchmarks to design the FL experiments in the medical domain considering their clinical tasks and obtain stronger performance with lower costs. △ Less

Submitted 19 May, 2023; v1 submitted 7 July, 2022; originally announced July 2022.

Comments: Accepted to the Main conference of CHIL2023

arXiv:2201.08995 [pdf]

Fuel consumption elasticities, rebound effect and feebate effectiveness in the Indian and Chinese new car markets

Authors: Prateek Bansal, Rubal Dua

Abstract: China and India, the world's two most populous develo** economies, are also among the world's largest automotive markets and carbon emitters. To reduce carbon emissions from the passenger car sector, both countries have considered various policy levers affecting fuel prices, car prices and fuel economy. This study estimates the responsiveness of new car buyers in China and India to such policy l… ▽ More China and India, the world's two most populous develo** economies, are also among the world's largest automotive markets and carbon emitters. To reduce carbon emissions from the passenger car sector, both countries have considered various policy levers affecting fuel prices, car prices and fuel economy. This study estimates the responsiveness of new car buyers in China and India to such policy levers and drivers including income. Furthermore, we estimate the potential for rebound effect and the effectiveness of a feebate policy. To accomplish this, we developed a joint discrete-continuous model of car choice and usage based on revealed preference survey data from approximately 8000 new car buyers from India and China who purchased cars in 2016-17. Conditional on buying a new car, the fuel consumption in both markets is found to be relatively unresponsive to fuel price and income, with magnitudes of elasticity estimates ranging from 0.12 to 0.15. For both markets, the mean segment-level direct elasticities of fuel consumption relative to car price and fuel economy range from 0.57 to 0.65. The rebound effect on fuel savings due to cost-free fuel economy improvement is found to be 17.1% for India and 18.8% for China. A revenue-neutral feebate policy, with average rebates and fees of up to around 15% of the retail price, resulted in fuel savings of around 0.7% for both markets. While the feebate policy's rebound effect is low - 7.3% for India and 1.6% for China - it does not appear to be an effective fuel conservation policy. △ Less

Submitted 22 January, 2022; originally announced January 2022.

arXiv:2201.07788 [pdf, other]

ConDor: Self-Supervised Canonicalization of 3D Pose for Partial Shapes

Authors: Rahul Sajnani, Adrien Poulenard, Jivitesh Jain, Radhika Dua, Leonidas J. Guibas, Srinath Sridhar

Abstract: Progress in 3D object understanding has relied on manually canonicalized shape datasets that contain instances with consistent position and orientation (3D pose). This has made it hard to generalize these methods to in-the-wild shapes, eg., from internet model collections or depth sensors. ConDor is a self-supervised method that learns to Canonicalize the 3D orientation and position for full and p… ▽ More Progress in 3D object understanding has relied on manually canonicalized shape datasets that contain instances with consistent position and orientation (3D pose). This has made it hard to generalize these methods to in-the-wild shapes, eg., from internet model collections or depth sensors. ConDor is a self-supervised method that learns to Canonicalize the 3D orientation and position for full and partial 3D point clouds. We build on top of Tensor Field Networks (TFNs), a class of permutation- and rotation-equivariant, and translation-invariant 3D networks. During inference, our method takes an unseen full or partial 3D point cloud at an arbitrary pose and outputs an equivariant canonical pose. During training, this network uses self-supervision losses to learn the canonical pose from an un-canonicalized collection of full and partial 3D point clouds. ConDor can also learn to consistently co-segment object parts without any supervision. Extensive quantitative results on four new metrics show that our approach outperforms existing methods while enabling new applications such as operation on depth images and annotation transfer. △ Less

Submitted 14 April, 2022; v1 submitted 19 January, 2022; originally announced January 2022.

Comments: Accepted to CVPR 2022, New Orleans, Louisiana. For project page and code, see https://ivl.cs.brown.edu/ConDor/

arXiv:2110.09276 [pdf, other]

Natural Attribute-based Shift Detection

Authors: Jeonghoon Park, Jimin Hong, Radhika Dua, Daehoon Gwak, Yixuan Li, Jaegul Choo, Edward Choi

Abstract: Despite the impressive performance of deep networks in vision, language, and healthcare, unpredictable behaviors on samples from the distribution different than the training distribution cause severe problems in deployment. For better reliability of neural-network-based classifiers, we define a new task, natural attribute-based shift (NAS) detection, to detect the samples shifted from the training… ▽ More Despite the impressive performance of deep networks in vision, language, and healthcare, unpredictable behaviors on samples from the distribution different than the training distribution cause severe problems in deployment. For better reliability of neural-network-based classifiers, we define a new task, natural attribute-based shift (NAS) detection, to detect the samples shifted from the training distribution by some natural attribute such as age of subjects or brightness of images. Using the natural attributes present in existing datasets, we introduce benchmark datasets in vision, language, and medical for NAS detection. Further, we conduct an extensive evaluation of prior representative out-of-distribution (OOD) detection methods on NAS datasets and observe an inconsistency in their performance. To understand this, we provide an analysis on the relationship between the location of NAS samples in the feature space and the performance of distance- and confidence-based OOD detection methods. Based on the analysis, we split NAS samples into three categories and further suggest a simple modification to the training objective to obtain an improved OOD detection method that is capable of detecting samples from all NAS categories. △ Less

Submitted 18 October, 2021; originally announced October 2021.

arXiv:2011.11737 [pdf, other]

doi 10.3847/1538-4357/abccd9

The Radio Luminosity-Risetime Function of Core-Collapse Supernovae

Authors: Michael F. Bietenholz, N. Bartel, M. Argo, R. Dua, S. Ryder, A. Soderberg

Abstract: We assemble a large set of 2-10 GHz radio flux density measurements and upper limits of 294 different supernovae (SNe), from the literature and our own and archival data. Only 31% of the SNe were detected. We characterize the SN lightcurves near the peak using a two-parameter model, with $t_{\rm pk}$ being the time to rise to a peak and $L_{\rm pk}$ the spectral luminosity at that peak. Over all S… ▽ More We assemble a large set of 2-10 GHz radio flux density measurements and upper limits of 294 different supernovae (SNe), from the literature and our own and archival data. Only 31% of the SNe were detected. We characterize the SN lightcurves near the peak using a two-parameter model, with $t_{\rm pk}$ being the time to rise to a peak and $L_{\rm pk}$ the spectral luminosity at that peak. Over all SNe in our sample at $D<100$ Mpc, we find that $t_{\rm pk} = 10^{1.7\pm0.9}$ d, and that $L_{\rm pk} = 10^{25.5\pm1.6}$ erg s$^{-1}$ Hz$^{-1}$, and therefore that generally, 50% of SNe will have $L_{\rm pk} < 10^{25.5}$ erg s$^{-1}$ Hz$^{-1}$. These $L_{\rm pk}$ values are ~30 times lower than those for only detected SNe. Types I b/c and II (excluding IIn's) have similar mean values of $L_{\rm pk}$ but the former have a wider range, whereas Type IIn SNe have ~10 times higher values with $L_{\rm pk} = 10^{26.5\pm1.1}$ erg s$^{-1}$ Hz$^{-1}$. As for $t_{\rm pk}$, Type I b/c have $t_{\rm pk}$ of only $10^{1.1\pm0.5}$ d while Type II have $t_{\rm pk} = 10^{1.6\pm1.0}$ and Type IIn the longest timescales with $t_{\rm pk} = 10^{3.1\pm0.7}$ d. We also estimate the distribution of progenitor mass-loss rates, $\dot M$, and find the mean and standard deviation of log$_{10}(\dot M/$Msol) yr$^{-1}$ are $-5.4\pm1.2$ (assuming $v_{\rm wind}=1000$ km s$^{-1}$) for Type I~b/c SNe, and $-6.9\pm1.4$ (assuming $v_{\rm wind} = 10$ km s$^{-1}$ for Type II SNe excluding Type IIn. △ Less

Submitted 14 January, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

Comments: Accepted for publication in the Astrophysical Journal 15 Figures, 4 Tables; Full version of Table 1 in ancillary files. Minor revisions only from version 1

arXiv:2010.12852 [pdf, other]

Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions

Authors: Radhika Dua, Sai Srinivas Kancheti, Vineeth N Balasubramanian

Abstract: Visual Question Answering is a multi-modal task that aims to measure high-level visual understanding. Contemporary VQA models are restrictive in the sense that answers are obtained via classification over a limited vocabulary (in the case of open-ended VQA), or via classification over a set of multiple-choice-type answers. In this work, we present a completely generative formulation where a multi-… ▽ More Visual Question Answering is a multi-modal task that aims to measure high-level visual understanding. Contemporary VQA models are restrictive in the sense that answers are obtained via classification over a limited vocabulary (in the case of open-ended VQA), or via classification over a set of multiple-choice-type answers. In this work, we present a completely generative formulation where a multi-word answer is generated for a visual query. To take this a step forward, we introduce a new task: ViQAR (Visual Question Answering and Reasoning), wherein a model must generate the complete answer and a rationale that seeks to justify the generated answer. We propose an end-to-end architecture to solve this task and describe how to evaluate it. We show that our model generates strong answers and rationales through qualitative and quantitative evaluation, as well as through a human Turing Test. △ Less

Submitted 17 June, 2021; v1 submitted 24 October, 2020; originally announced October 2020.

Comments: MULA Workshop, CVPR 2021

arXiv:1904.06695 [pdf]

Eliciting Preferences of Ridehailing Users and Drivers: Evidence from the United States

Authors: Prateek Bansal, Akanksha Sinha, Rubal Dua, Ricardo Daziano

Abstract: Transportation Network Companies (TNCs) are changing the transportation ecosystem, but micro-decisions of drivers and users need to be better understood to assess the system-level impacts of TNCs. In this regard, we contribute to the literature by estimating a) individuals' preferences of being a rider, a driver, or a non-user of TNC services; b) preferences of ridehailing users for ridepooling; c… ▽ More Transportation Network Companies (TNCs) are changing the transportation ecosystem, but micro-decisions of drivers and users need to be better understood to assess the system-level impacts of TNCs. In this regard, we contribute to the literature by estimating a) individuals' preferences of being a rider, a driver, or a non-user of TNC services; b) preferences of ridehailing users for ridepooling; c) TNC drivers' choice to switch to vehicles with better fuel economy, and also d) the drivers' decision to buy, rent or lease new vehicles with driving for TNCs being a major consideration. Elicitation of drivers' preferences using a unique sample (N=11,902) of the U.S. population residing in TNC-served areas is the key feature of this study. The statistical analysis indicates that ridehailing services are mainly attracting personal vehicle users as riders, without substantially affecting demand for transit. Moreover, around 10% of ridehailing users reported postponing the purchase of a new car due to the availability of TNC services. The model estimation results indicate that the likelihood of being a TNC user increases with the increase in age for someone younger than 44 years, but the pattern is reversed post 44 years. This change in direction of the marginal effect of age is insightful as the previous studies have reported a negative association. We also find that postgraduate drivers who live in metropolitan regions are more likely to switch to fuel-efficient vehicles. These findings would inform transportation planners and TNCs in develo** policies to improve the fuel economy of the fleet. △ Less

Submitted 14 April, 2019; originally announced April 2019.

arXiv:1904.03977 [pdf, other]

VayuAnukulani: Adaptive Memory Networks for Air Pollution Forecasting

Authors: Divyam Madaan, Radhika Dua, Prerana Mukherjee, Brejesh Lall

Abstract: Air pollution is the leading environmental health hazard globally due to various sources which include factory emissions, car exhaust and cooking stoves. As a precautionary measure, air pollution forecast serves as the basis for taking effective pollution control measures, and accurate air pollution forecasting has become an important task. In this paper, we forecast fine-grained ambient air quali… ▽ More Air pollution is the leading environmental health hazard globally due to various sources which include factory emissions, car exhaust and cooking stoves. As a precautionary measure, air pollution forecast serves as the basis for taking effective pollution control measures, and accurate air pollution forecasting has become an important task. In this paper, we forecast fine-grained ambient air quality information for 5 prominent locations in Delhi based on the historical and real-time ambient air quality and meteorological data reported by Central Pollution Control board. We present VayuAnukulani system, a novel end-to-end solution to predict air quality for next 24 hours by estimating the concentration and level of different air pollutants including nitrogen dioxide ($NO_2$), particulate matter ($PM_{2.5}$ and $PM_{10}$) for Delhi. Extensive experiments on data sources obtained in Delhi demonstrate that the proposed adaptive attention based Bidirectional LSTM Network outperforms several baselines for classification and regression models. The accuracy of the proposed adaptive system is $\sim 15 - 20\%$ better than the same offline trained model. We compare the proposed methodology on several competing baselines, and show that the network outperforms conventional methods by $\sim 3 - 5 \%$. △ Less

Submitted 8 April, 2019; originally announced April 2019.

arXiv:0908.1515 [pdf]

doi 10.1074/jbc.M109.055863

Integration of a Phosphatase Cascade with the MAP Kinase Pathway provides for a Novel Signal Processing Function

Authors: Virendra K. Chaudhri, Dhiraj Kumar, Manjari Misra, Raina Dua, Kanury V. S. Rao

Abstract: We mathematically modeled the receptor-activated MAP kinase signaling by incorporating the regulation through cellular phosphatases. Activation induced the alignment of a phosphatase cascade in parallel with the MAP kinase pathway. A novel regulatory motif was thus generated, providing for the combinatorial control of each MAPK intermediate. This ensured a non-linear mode of signal transmission… ▽ More We mathematically modeled the receptor-activated MAP kinase signaling by incorporating the regulation through cellular phosphatases. Activation induced the alignment of a phosphatase cascade in parallel with the MAP kinase pathway. A novel regulatory motif was thus generated, providing for the combinatorial control of each MAPK intermediate. This ensured a non-linear mode of signal transmission with the output being shaped by the balance between the strength of input signal, and the activity gradient along the phosphatase axis. Shifts in this balance yielded modulations in topology of the motif, thereby expanding the repertoire of output responses. Thus we identify an added dimension to signal processing, wherein the output response to an external stimulus is additionally filtered through indicators that define the phenotypic status of the cell. △ Less

Submitted 11 August, 2009; originally announced August 2009.

Comments: Whole Manuscript 33 pages inclduing Main text, 7 Figures and Supporting Information

Journal ref: J Biol Chem 285,(2), 2010

Showing 1–12 of 12 results for author: Dua, R