-
A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles
Authors:
Nirmalya Thakur,
Vanessa Su,
Mingchen Shao,
Kesha A. Patel,
Hongseok Jeong,
Victoria Knieling,
Andrew Bian
Abstract:
The work of this paper presents a dataset that contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. The dataset is available at https://dx.doi.org/10.21227/40s8-xf63. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder…
▽ More
The work of this paper presents a dataset that contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. The dataset is available at https://dx.doi.org/10.21227/40s8-xf63. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After develo** this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. Finally, this paper also presents a list of open research questions that may be investigated using this dataset.
△ Less
Submitted 16 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
Make Continual Learning Stronger via C-Flat
Authors:
Ang Bian,
Wei Li,
Hangjie Yuan,
Chengrong Yu,
Zixiang Zhao,
Mang Wang,
Aojun Lu,
Tao Feng
Abstract:
Model generalization ability upon incrementally acquiring dynamically updating knowledge from sequentially arriving tasks is crucial to tackle the sensitivity-stability dilemma in Continual Learning (CL). Weight loss landscape sharpness minimization seeking for flat minima lying in neighborhoods with uniform low loss or smooth gradient is proven to be a strong training regime improving model gener…
▽ More
Model generalization ability upon incrementally acquiring dynamically updating knowledge from sequentially arriving tasks is crucial to tackle the sensitivity-stability dilemma in Continual Learning (CL). Weight loss landscape sharpness minimization seeking for flat minima lying in neighborhoods with uniform low loss or smooth gradient is proven to be a strong training regime improving model generalization compared with loss minimization based optimizer like SGD. Yet only a few works have discussed this training regime for CL, proving that dedicated designed zeroth-order sharpness optimizer can improve CL performance. In this work, we propose a Continual Flatness (C-Flat) method featuring a flatter loss landscape tailored for CL. C-Flat could be easily called with only one line of code and is plug-and-play to any CL methods. A general framework of C-Flat applied to all CL categories and a thorough comparison with loss minima optimizer and flat minima based CL approaches is presented in this paper, showing that our method can boost CL performance in almost all cases. Code will be publicly available upon publication.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Progressive Learning without Forgetting
Authors:
Tao Feng,
Hangjie Yuan,
Mang Wang,
Ziyuan Huang,
Ang Bian,
Jianzhou Zhang
Abstract:
Learning from changing tasks and sequential experience without forgetting the obtained knowledge is a challenging problem for artificial neural networks. In this work, we focus on two challenging problems in the paradigm of Continual Learning (CL) without involving any old data: (i) the accumulation of catastrophic forgetting caused by the gradually fading knowledge space from which the model lear…
▽ More
Learning from changing tasks and sequential experience without forgetting the obtained knowledge is a challenging problem for artificial neural networks. In this work, we focus on two challenging problems in the paradigm of Continual Learning (CL) without involving any old data: (i) the accumulation of catastrophic forgetting caused by the gradually fading knowledge space from which the model learns the previous knowledge; (ii) the uncontrolled tug-of-war dynamics to balance the stability and plasticity during the learning of new tasks. In order to tackle these problems, we present Progressive Learning without Forgetting (PLwF) and a credit assignment regime in the optimizer. PLwF densely introduces model functions from previous tasks to construct a knowledge space such that it contains the most reliable knowledge on each task and the distribution information of different tasks, while credit assignment controls the tug-of-war dynamics by removing gradient conflict through projection. Extensive ablative experiments demonstrate the effectiveness of PLwF and credit assignment. In comparison with other CL methods, we report notably better results even without relying on any raw data.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
A Bhattacharyya Coefficient-Based Framework for Noise Model-Aware Random Walker Image Segmentation
Authors:
Dominik Drees,
Florian Eilers,
Ang Bian,
Xiaoyi Jiang
Abstract:
One well established method of interactive image segmentation is the random walker algorithm. Considerable research on this family of segmentation methods has been continuously conducted in recent years with numerous applications. These methods are common in using a simple Gaussian weight function which depends on a parameter that strongly influences the segmentation performance. In this work we p…
▽ More
One well established method of interactive image segmentation is the random walker algorithm. Considerable research on this family of segmentation methods has been continuously conducted in recent years with numerous applications. These methods are common in using a simple Gaussian weight function which depends on a parameter that strongly influences the segmentation performance. In this work we propose a general framework of deriving weight functions based on probabilistic modeling. This framework can be concretized to cope with virtually any well-defined noise model. It eliminates the critical parameter and thus avoids time-consuming parameter search. We derive the specific weight functions for common noise types and show their superior performance on synthetic data as well as different biomedical image data (MRI images from the NYU fastMRI dataset, larvae images acquired with the FIM technique). Our framework can also be used in multiple other applications, e.g., the graph cut algorithm and its extensions.
△ Less
Submitted 2 June, 2022;
originally announced June 2022.
-
Analysis of Error-prone Electronic Health Records with Multi-wave Validation Sampling: Association of Maternal Weight Gain during Pregnancy with Childhood Outcomes
Authors:
Bryan E. Shepherd,
Kyunghee Han,
Tong Chen,
Aihua Bian,
Shannon Pugh,
Stephany N. Duda,
Thomas Lumley,
William J. Heerman,
Pamela A. Shaw
Abstract:
Electronic health record (EHR) data are increasingly used for biomedical research, but these data have recognized data quality challenges. Data validation is necessary to use EHR data with confidence, but limited resources typically make complete data validation impossible. Using EHR data, we illustrate prospective, multi-wave, two-phase validation sampling to estimate the association between mate…
▽ More
Electronic health record (EHR) data are increasingly used for biomedical research, but these data have recognized data quality challenges. Data validation is necessary to use EHR data with confidence, but limited resources typically make complete data validation impossible. Using EHR data, we illustrate prospective, multi-wave, two-phase validation sampling to estimate the association between maternal weight gain during pregnancy and the risks of her child develo** obesity or asthma. The optimal validation sampling design depends on the unknown efficient influence functions of regression coefficients of interest. In the first wave of our multi-wave validation design, we estimate the influence function using the unvalidated (phase 1) data to determine our validation sample; then in subsequent waves, we re-estimate the influence function using validated (phase 2) data and update our sampling. For efficiency, estimation combines obesity and asthma sampling frames while calibrating sampling weights using generalized raking. We validated 996 of 10,335 mother-child EHR dyads in 6 sampling waves. Estimated associations between childhood obesity/asthma and maternal weight gain, as well as other covariates, are compared to naive estimates that only use unvalidated data. In some cases, estimates markedly differ, underscoring the importance of efficient validation sampling to obtain accurate estimates incorporating validated data.
△ Less
Submitted 28 September, 2021;
originally announced September 2021.
-
User Acceptance of Gender Stereotypes in Automated Career Recommendations
Authors:
Clarice Wang,
Kathryn Wang,
Andrew Bian,
Rashidul Islam,
Kamrun Naher Keya,
James Foulds,
Shimei Pan
Abstract:
Currently, there is a surge of interest in fair Artificial Intelligence (AI) and Machine Learning (ML) research which aims to mitigate discriminatory bias in AI algorithms, e.g. along lines of gender, age, and race. While most research in this domain focuses on develo** fair AI algorithms, in this work, we show that a fair AI algorithm on its own may be insufficient to achieve its intended resul…
▽ More
Currently, there is a surge of interest in fair Artificial Intelligence (AI) and Machine Learning (ML) research which aims to mitigate discriminatory bias in AI algorithms, e.g. along lines of gender, age, and race. While most research in this domain focuses on develo** fair AI algorithms, in this work, we show that a fair AI algorithm on its own may be insufficient to achieve its intended results in the real world. Using career recommendation as a case study, we build a fair AI career recommender by employing gender debiasing machine learning techniques. Our offline evaluation showed that the debiased recommender makes fairer career recommendations without sacrificing its accuracy. Nevertheless, an online user study of more than 200 college students revealed that participants on average prefer the original biased system over the debiased system. Specifically, we found that perceived gender disparity is a determining factor for the acceptance of a recommendation. In other words, our results demonstrate we cannot fully address the gender bias issue in AI recommendations without addressing the gender bias in humans.
△ Less
Submitted 28 July, 2021; v1 submitted 13 June, 2021;
originally announced June 2021.
-
Pressure-enhanced interlayer exciton in WS2/MoSe2 van der Waals heterostructure
Authors:
Xiaoli Ma,
Shaohua Fu,
Jianwei Ding,
Meng Liu,
Ang Bian,
Fang Hong,
Jia-Tao Sun,
Xiaoxian Zhang,
Xiaohui Yu,
Dawei He
Abstract:
The atomic-level vdW heterostructures have been one of the most interesting quantum material systems, due to their exotic physical properties. The interlayer coupling in these systems plays a critical role to realize novel physical observation and enrich interface functionality. However, there is still lack of investigation on the tuning of interlayer coupling in a quantitative way. A prospective…
▽ More
The atomic-level vdW heterostructures have been one of the most interesting quantum material systems, due to their exotic physical properties. The interlayer coupling in these systems plays a critical role to realize novel physical observation and enrich interface functionality. However, there is still lack of investigation on the tuning of interlayer coupling in a quantitative way. A prospective strategy to tune the interlayer coupling is to change the electronic structure and interlayer distance by high pressure, which is a well-established method to tune the physical properties. Here, we construct a high-quality WS2/MoSe2 heterostructure in a DAC and successfully tuned the interlayer coupling through hydrostatic pressure. Typical photoluminescence spectra of the monolayer MoSe2 (ML-MoSe2), monolayer WS2 (ML-WS2) and WS2/MoSe2 heterostructure have been observed and it's intriguing that their photoluminescence peaks shift with respect to applied pressure in a quite different way. The intralayer exciton of ML-MoSe2 and ML-WS2 show blue shift under high pressure with a coefficient of 19.8 meV/GPa and 9.3 meV/GPa, respectively, while their interlayer exciton shows relative weak pressure dependence with a coefficient of 3.4 meV/GPa. Meanwhile, external pressure helps to drive stronger interlayer interaction and results in a higher ratio of interlayer/intralayer exciton intensity, indicating the enhanced interlayer exciton behavior. The first-principles calculation reveals the stronger interlayer interaction which leads to enhanced interlayer exciton behavior in WS2/MoSe2 heterostructure under external pressure and reveals the robust peak of interlayer exciton. This work provides an effective strategy to study the interlayer interaction in vdW heterostructures, which could be of great importance for the material and device design in various similar quantum systems.
△ Less
Submitted 15 March, 2021;
originally announced March 2021.
-
Provable Non-Convex Optimization and Algorithm Validation via Submodularity
Authors:
Yatao An Bian
Abstract:
Submodularity is one of the most well-studied properties of problem classes in combinatorial optimization and many applications of machine learning and data mining, with strong implications for guaranteed optimization. In this thesis, we investigate the role of submodularity in provable non-convex optimization and validation of algorithms. A profound understanding which classes of functions can be…
▽ More
Submodularity is one of the most well-studied properties of problem classes in combinatorial optimization and many applications of machine learning and data mining, with strong implications for guaranteed optimization. In this thesis, we investigate the role of submodularity in provable non-convex optimization and validation of algorithms. A profound understanding which classes of functions can be tractably optimized remains a central challenge for non-convex optimization. By advancing the notion of submodularity to continuous domains (termed "continuous submodularity"), we characterize a class of generally non-convex and non-concave functions -- continuous submodular functions, and derive algorithms for approximately maximizing them with strong approximation guarantees. Meanwhile, continuous submodularity captures a wide spectrum of applications, ranging from revenue maximization with general marketing strategies, MAP inference for DPPs to mean field inference for probabilistic log-submodular models, which renders it as a valuable domain knowledge in optimizing this class of objectives. Validation of algorithms is an information-theoretic framework to investigate the robustness of algorithms to fluctuations in the input/observations and their generalization ability. We investigate various algorithms for one of the paradigmatic unconstrained submodular maximization problem: MaxCut. Due to submodularity of the MaxCut objective, we are able to present efficient approaches to calculate the algorithmic information content of MaxCut algorithms. The results provide insights into the robustness of different algorithmic techniques for MaxCut.
△ Less
Submitted 18 December, 2019;
originally announced December 2019.
-
COLA: Decentralized Linear Learning
Authors:
Lie He,
An Bian,
Martin Jaggi
Abstract:
Decentralized machine learning is a promising emerging paradigm in view of global challenges of data ownership and privacy. We consider learning of linear classification and regression models, in the setting where the training data is decentralized over many user devices, and the learning algorithm must run on-device, on an arbitrary communication network, without a central coordinator. We propose…
▽ More
Decentralized machine learning is a promising emerging paradigm in view of global challenges of data ownership and privacy. We consider learning of linear classification and regression models, in the setting where the training data is decentralized over many user devices, and the learning algorithm must run on-device, on an arbitrary communication network, without a central coordinator. We propose COLA, a new decentralized training algorithm with strong theoretical guarantees and superior practical performance. Our framework overcomes many limitations of existing methods, and achieves communication efficiency, scalability, elasticity as well as resilience to changes in data and participating devices.
△ Less
Submitted 18 June, 2019; v1 submitted 13 August, 2018;
originally announced August 2018.
-
A Distributed Second-Order Algorithm You Can Trust
Authors:
Celestine Dünner,
Aurelien Lucchi,
Matilde Gargiani,
An Bian,
Thomas Hofmann,
Martin Jaggi
Abstract:
Due to the rapid growth of data and computational resources, distributed optimization has become an active research area in recent years. While first-order methods seem to dominate the field, second-order methods are nevertheless attractive as they potentially require fewer communication rounds to converge. However, there are significant drawbacks that impede their wide adoption, such as the compu…
▽ More
Due to the rapid growth of data and computational resources, distributed optimization has become an active research area in recent years. While first-order methods seem to dominate the field, second-order methods are nevertheless attractive as they potentially require fewer communication rounds to converge. However, there are significant drawbacks that impede their wide adoption, such as the computation and the communication of a large Hessian matrix. In this paper we present a new algorithm for distributed training of generalized linear models that only requires the computation of diagonal blocks of the Hessian matrix on the individual workers. To deal with this approximate information we propose an adaptive approach that - akin to trust-region methods - dynamically adapts the auxiliary model to compensate for modeling errors. We provide theoretical rates of convergence for a wide class of problems including L1-regularized objectives. We also demonstrate that our approach achieves state-of-the-art results on multiple large benchmark datasets.
△ Less
Submitted 20 June, 2018;
originally announced June 2018.
-
Optimal DR-Submodular Maximization and Applications to Provable Mean Field Inference
Authors:
An Bian,
Joachim M. Buhmann,
Andreas Krause
Abstract:
Mean field inference in probabilistic models is generally a highly nonconvex problem. Existing optimization methods, e.g., coordinate ascent algorithms, can only generate local optima.
In this work we propose provable mean filed methods for probabilistic log-submodular models and its posterior agreement (PA) with strong approximation guarantees. The main algorithmic technique is a new Double Gre…
▽ More
Mean field inference in probabilistic models is generally a highly nonconvex problem. Existing optimization methods, e.g., coordinate ascent algorithms, can only generate local optima.
In this work we propose provable mean filed methods for probabilistic log-submodular models and its posterior agreement (PA) with strong approximation guarantees. The main algorithmic technique is a new Double Greedy scheme, termed DR-DoubleGreedy, for continuous DR-submodular maximization with box-constraints. It is a one-pass algorithm with linear time complexity, reaching the optimal 1/2 approximation ratio, which may be of independent interest. We validate the superior performance of our algorithms against baseline algorithms on both synthetic and real-world datasets.
△ Less
Submitted 29 November, 2018; v1 submitted 18 May, 2018;
originally announced May 2018.
-
Continuous DR-submodular Maximization: Structure and Algorithms
Authors:
An Bian,
Kfir Y. Levy,
Andreas Krause,
Joachim M. Buhmann
Abstract:
DR-submodular continuous functions are important objectives with wide real-world applications spanning MAP inference in determinantal point processes (DPPs), and mean-field inference for probabilistic submodular models, amongst others. DR-submodularity captures a subclass of non-convex functions that enables both exact minimization and approximate maximization in polynomial time.
In this work we…
▽ More
DR-submodular continuous functions are important objectives with wide real-world applications spanning MAP inference in determinantal point processes (DPPs), and mean-field inference for probabilistic submodular models, amongst others. DR-submodularity captures a subclass of non-convex functions that enables both exact minimization and approximate maximization in polynomial time.
In this work we study the problem of maximizing non-monotone DR-submodular continuous functions under general down-closed convex constraints. We start by investigating geometric properties that underlie such objectives, e.g., a strong relation between (approximately) stationary points and global optimum is proved. These properties are then used to devise two optimization algorithms with provable guarantees. Concretely, we first devise a "two-phase" algorithm with $1/4$ approximation guarantee. This algorithm allows the use of existing methods for finding (approximately) stationary points as a subroutine, thus, harnessing recent progress in non-convex optimization. Then we present a non-monotone Frank-Wolfe variant with $1/e$ approximation guarantee and sublinear convergence rate. Finally, we extend our approach to a broader class of generalized DR-submodular continuous functions, which captures a wider spectrum of applications. Our theoretical findings are validated on synthetic and real-world problem instances.
△ Less
Submitted 24 May, 2019; v1 submitted 3 November, 2017;
originally announced November 2017.
-
Guarantees for Greedy Maximization of Non-submodular Functions with Applications
Authors:
Andrew An Bian,
Joachim M. Buhmann,
Andreas Krause,
Sebastian Tschiatschek
Abstract:
We investigate the performance of the standard Greedy algorithm for cardinality constrained maximization of non-submodular nondecreasing set functions. While there are strong theoretical guarantees on the performance of Greedy for maximizing submodular functions, there are few guarantees for non-submodular ones. However, Greedy enjoys strong empirical performance for many important non-submodular…
▽ More
We investigate the performance of the standard Greedy algorithm for cardinality constrained maximization of non-submodular nondecreasing set functions. While there are strong theoretical guarantees on the performance of Greedy for maximizing submodular functions, there are few guarantees for non-submodular ones. However, Greedy enjoys strong empirical performance for many important non-submodular functions, e.g., the Bayesian A-optimality objective in experimental design. We prove theoretical guarantees supporting the empirical performance. Our guarantees are characterized by a combination of the (generalized) curvature $α$ and the submodularity ratio $γ$. In particular, we prove that Greedy enjoys a tight approximation guarantee of $\frac{1}α(1- e^{-γα})$ for cardinality constrained maximization. In addition, we bound the submodularity ratio and curvature for several important real-world objectives, including the Bayesian A-optimality objective, the determinantal function of a square submatrix and certain linear programs with combinatorial constraints. We experimentally validate our theoretical findings for both synthetic and real-world applications.
△ Less
Submitted 14 May, 2019; v1 submitted 6 March, 2017;
originally announced March 2017.
-
Guaranteed Non-convex Optimization: Submodular Maximization over Continuous Domains
Authors:
Andrew An Bian,
Baharan Mirzasoleiman,
Joachim M. Buhmann,
Andreas Krause
Abstract:
Submodular continuous functions are a category of (generally) non-convex/non-concave functions with a wide spectrum of applications. We characterize these functions and demonstrate that they can be maximized efficiently with approximation guarantees. Specifically, i) We introduce the weak DR property that gives a unified characterization of submodularity for all set, integer-lattice and continuous…
▽ More
Submodular continuous functions are a category of (generally) non-convex/non-concave functions with a wide spectrum of applications. We characterize these functions and demonstrate that they can be maximized efficiently with approximation guarantees. Specifically, i) We introduce the weak DR property that gives a unified characterization of submodularity for all set, integer-lattice and continuous functions; ii) for maximizing monotone DR-submodular continuous functions under general down-closed convex constraints, we propose a Frank-Wolfe variant with $(1-1/e)$ approximation guarantee, and sub-linear convergence rate; iii) for maximizing general non-monotone submodular continuous functions subject to box constraints, we propose a DoubleGreedy algorithm with $1/3$ approximation guarantee. Submodular continuous functions naturally find applications in various real-world settings, including influence and revenue maximization with continuous assignments, sensor energy management, multi-resolution data summarization, facility location, etc. Experimental results show that the proposed algorithms efficiently generate superior solutions compared to baseline algorithms.
△ Less
Submitted 6 May, 2019; v1 submitted 17 June, 2016;
originally announced June 2016.
-
Parallel Coordinate Descent Newton Method for Efficient $\ell_1$-Regularized Minimization
Authors:
An Bian,
Xiong Li,
Yuncai Liu,
Ming-Hsuan Yang
Abstract:
The recent years have witnessed advances in parallel algorithms for large scale optimization problems. Notwithstanding demonstrated success, existing algorithms that parallelize over features are usually limited by divergence issues under high parallelism or require data preprocessing to alleviate these problems. In this work, we propose a Parallel Coordinate Descent Newton algorithm using multidi…
▽ More
The recent years have witnessed advances in parallel algorithms for large scale optimization problems. Notwithstanding demonstrated success, existing algorithms that parallelize over features are usually limited by divergence issues under high parallelism or require data preprocessing to alleviate these problems. In this work, we propose a Parallel Coordinate Descent Newton algorithm using multidimensional approximate Newton steps (PCDN), where the off-diagonal elements of the Hessian are set to zero to enable parallelization. It randomly partitions the feature set into $b$ bundles/subsets with size of $P$, and sequentially processes each bundle by first computing the descent directions for each feature in parallel and then conducting $P$-dimensional line search to obtain the step size. We show that: (1) PCDN is guaranteed to converge globally despite increasing parallelism; (2) PCDN converges to the specified accuracy $ε$ within the limited iteration number of $T_ε$, and $T_ε$ decreases with increasing parallelism (bundle size $P$). Using the implementation technique of maintaining intermediate quantities, we minimize the data transfer and synchronization cost of the $P$-dimensional line search. For concreteness, the proposed PCDN algorithm is applied to $\ell_1$-regularized logistic regression and $\ell_2$-loss SVM. Experimental evaluations on six benchmark datasets show that the proposed PCDN algorithm exploits parallelism well and outperforms the state-of-the-art methods in speed without losing accuracy.
△ Less
Submitted 7 December, 2017; v1 submitted 18 June, 2013;
originally announced June 2013.