Search | arXiv e-print repository

Balancing Efficiency vs. Effectiveness and Providing Missing Label Robustness in Multi-Label Stream Classification

Abstract: Available works addressing multi-label classification in a data stream environment focus on proposing accurate models; however, these models often exhibit inefficiency and cannot balance effectiveness and efficiency. In this work, we propose a neural network-based approach that tackles this issue and is suitable for high-dimensional multi-label classification. Our model uses a selective concept dr… ▽ More Available works addressing multi-label classification in a data stream environment focus on proposing accurate models; however, these models often exhibit inefficiency and cannot balance effectiveness and efficiency. In this work, we propose a neural network-based approach that tackles this issue and is suitable for high-dimensional multi-label classification. Our model uses a selective concept drift adaptation mechanism that makes it suitable for a non-stationary environment. Additionally, we adapt our model to an environment with missing labels using a simple yet effective imputation strategy and demonstrate that it outperforms a vast majority of the state-of-the-art supervised models. To achieve our purposes, we introduce a weighted binary relevance-based approach named ML-BELS using the Broad Ensemble Learning System (BELS) as its base classifier. Instead of a chain of stacked classifiers, our model employs independent weighted ensembles, with the weights generated by the predictions of a BELS classifier. We show that using the weighting strategy on datasets with low label cardinality negatively impacts the accuracy of the model; with this in mind, we use the label cardinality as a trigger for applying the weights. We present an extensive assessment of our model using 11 state-of-the-art baselines, five synthetics, and 13 real-world datasets, all with different characteristics. Our results demonstrate that the proposed approach ML-BELS is successful in balancing effectiveness and efficiency, and is robust to missing labels and concept drift. △ Less

Submitted 1 October, 2023; originally announced October 2023.

arXiv:2308.14175 [pdf, other]

Leveraging Linear Independence of Component Classifiers: Optimizing Size and Prediction Accuracy for Online Ensembles

Authors: Enes Bektas, Fazli Can

Abstract: Ensembles, which employ a set of classifiers to enhance classification accuracy collectively, are crucial in the era of big data. However, although there is general agreement that the relation between ensemble size and its prediction accuracy, the exact nature of this relationship is still unknown. We introduce a novel perspective, rooted in the linear independence of classifier's votes, to analyz… ▽ More Ensembles, which employ a set of classifiers to enhance classification accuracy collectively, are crucial in the era of big data. However, although there is general agreement that the relation between ensemble size and its prediction accuracy, the exact nature of this relationship is still unknown. We introduce a novel perspective, rooted in the linear independence of classifier's votes, to analyze the interplay between ensemble size and prediction accuracy. This framework reveals a theoretical link, consequently proposing an ensemble size based on this relationship. Our study builds upon a geometric framework and develops a series of theorems. These theorems clarify the role of linear dependency in crafting ensembles. We present a method to determine the minimum ensemble size required to ensure a target probability of linearly independent votes among component classifiers. Incorporating real and synthetic datasets, our empirical results demonstrate a trend: increasing the number of classifiers enhances accuracy, as predicted by our theoretical insights. However, we also identify a point of diminishing returns, beyond which additional classifiers provide diminishing improvements in accuracy. Surprisingly, the calculated ideal ensemble size deviates from empirical results for certain datasets, emphasizing the influence of other factors. This study opens avenues for deeper investigations into the complex dynamics governing ensemble design and offers guidance for constructing efficient and effective ensembles in practical scenarios. △ Less

Submitted 27 August, 2023; originally announced August 2023.

arXiv:2308.10807 [pdf, ps, other]

doi 10.1145/3583780.3615266

DynED: Dynamic Ensemble Diversification in Data Stream Classification

Authors: Soheil Abadifard, Sepehr Bakhshi, Sanaz Gheibuni, Fazli Can

Abstract: Ensemble methods are commonly used in classification due to their remarkable performance. Achieving high accuracy in a data stream environment is a challenging task considering disruptive changes in the data distribution, also known as concept drift. A greater diversity of ensemble components is known to enhance prediction accuracy in such settings. Despite the diversity of components within an en… ▽ More Ensemble methods are commonly used in classification due to their remarkable performance. Achieving high accuracy in a data stream environment is a challenging task considering disruptive changes in the data distribution, also known as concept drift. A greater diversity of ensemble components is known to enhance prediction accuracy in such settings. Despite the diversity of components within an ensemble, not all contribute as expected to its overall performance. This necessitates a method for selecting components that exhibit high performance and diversity. We present a novel ensemble construction and maintenance approach based on MMR (Maximal Marginal Relevance) that dynamically combines the diversity and prediction accuracy of components during the process of structuring an ensemble. The experimental results on both four real and 11 synthetic datasets demonstrate that the proposed approach (DynED) provides a higher average mean accuracy compared to the five state-of-the-art baselines. △ Less

Submitted 6 September, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

Comments: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM '23), October 21--25, 2023, Birmingham, United Kingdom

arXiv:2210.12383 [pdf, other]

Stance Detection and Open Research Avenues

Authors: Dilek Küçük, Fazli Can

Abstract: This tutorial aims to cover the state-of-the-art on stance detection and address open research avenues for interested researchers and practitioners. Stance detection is a recent research topic where the stance towards a given target or target set is determined based on the given content and there are significant application opportunities of stance detection in various domains. The tutorial compris… ▽ More This tutorial aims to cover the state-of-the-art on stance detection and address open research avenues for interested researchers and practitioners. Stance detection is a recent research topic where the stance towards a given target or target set is determined based on the given content and there are significant application opportunities of stance detection in various domains. The tutorial comprises two parts where the first part outlines the fundamental concepts, problems, approaches, and resources of stance detection, while the second part covers open research avenues and application areas of stance detection. The tutorial will be a useful guide for researchers and practitioners of stance detection, social media analysis, information retrieval, and natural language processing. △ Less

Submitted 22 October, 2022; originally announced October 2022.

arXiv:2210.05401 [pdf, other]

MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection

Authors: Cagri Toraman, Oguzhan Ozcelik, Furkan Şahinuç, Fazli Can

Abstract: The rapid dissemination of misinformation through online social networks poses a pressing issue with harmful consequences jeopardizing human health, public safety, democracy, and the economy; therefore, urgent action is required to address this problem. In this study, we construct a new human-annotated dataset, called MiDe22, having 5,284 English and 5,064 Turkish tweets with their misinformation… ▽ More The rapid dissemination of misinformation through online social networks poses a pressing issue with harmful consequences jeopardizing human health, public safety, democracy, and the economy; therefore, urgent action is required to address this problem. In this study, we construct a new human-annotated dataset, called MiDe22, having 5,284 English and 5,064 Turkish tweets with their misinformation labels for several recent events between 2020 and 2022, including the Russia-Ukraine war, COVID-19 pandemic, and Refugees. The dataset includes user engagements with the tweets in terms of likes, replies, retweets, and quotes. We also provide a detailed data analysis with descriptive statistics and the experimental results of a benchmark evaluation for misinformation detection. △ Less

Submitted 11 July, 2024; v1 submitted 11 October, 2022; originally announced October 2022.

Comments: Published at LREC-COLING 2024

arXiv:2202.00070 [pdf, other]

Implicit Concept Drift Detection for Multi-label Data Streams

Authors: Ege Berkay Gulcan, Fazli Can

Abstract: Many real-world applications adopt multi-label data streams as the need for algorithms to deal with rapidly changing data increases. Changes in data distribution, also known as concept drift, cause the existing classification models to rapidly lose their effectiveness. To assist the classifiers, we propose a novel algorithm called Label Dependency Drift Detector (LD3), an implicit (unsupervised) c… ▽ More Many real-world applications adopt multi-label data streams as the need for algorithms to deal with rapidly changing data increases. Changes in data distribution, also known as concept drift, cause the existing classification models to rapidly lose their effectiveness. To assist the classifiers, we propose a novel algorithm called Label Dependency Drift Detector (LD3), an implicit (unsupervised) concept drift detector using label dependencies within the data for multi-label data streams. Our study exploits the dynamic temporal dependencies between labels using a label influence ranking method, which leverages a data fusion algorithm and uses the produced ranking to detect concept drift. LD3 is the first unsupervised concept drift detection algorithm in the multi-label classification problem area. In this study, we perform an extensive evaluation of LD3 by comparing it with 14 prevalent supervised concept drift detection algorithms that we adapt to the problem area using 12 datasets and a baseline classifier. The results show that LD3 provides between 19.8\% and 68.6\% better predictive performance than comparable detectors on both real-world and synthetic data streams. △ Less

Submitted 31 January, 2022; originally announced February 2022.

Comments: 18 pages, 7 figures, submitted to Artificial Intelligence Review

arXiv:2110.03540 [pdf, other]

A Broad Ensemble Learning System for Drifting Stream Classification

Authors: Sepehr Bakhshi, Pouya Ghahramanian, Hamed Bonab, Fazli Can

Abstract: In a data stream environment, classification models must handle concept drift efficiently and effectively. Ensemble methods are widely used for this purpose; however, the ones available in the literature either use a large data chunk to update the model or learn the data one by one. In the former, the model may miss the changes in the data distribution, and in the latter, the model may suffer from… ▽ More In a data stream environment, classification models must handle concept drift efficiently and effectively. Ensemble methods are widely used for this purpose; however, the ones available in the literature either use a large data chunk to update the model or learn the data one by one. In the former, the model may miss the changes in the data distribution, and in the latter, the model may suffer from inefficiency and instability. To address these issues, we introduce a novel ensemble approach based on the Broad Learning System (BLS), where mini chunks are used at each update. BLS is an effective lightweight neural architecture recently developed for incremental learning. Although it is fast, it requires huge data chunks for effective updates, and is unable to handle dynamic changes observed in data streams. Our proposed approach named Broad Ensemble Learning System (BELS) uses a novel updating method that significantly improves best-in-class model accuracy. It employs an ensemble of output layers to address the limitations of BLS and handle drifts. Our model tracks the changes in the accuracy of the ensemble components and react to these changes. We present the mathematical derivation of BELS, perform comprehensive experiments with 20 datasets that demonstrate the adaptability of our model to various drift types, and provide hyperparameter and ablation analysis of our proposed model. Our experiments show that the proposed approach outperforms nine state-of-the-art baselines and supplies an overall improvement of 13.28% in terms of average prequential accuracy. △ Less

Submitted 14 March, 2023; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: Submitted to IEEE Access

arXiv:2109.07611 [pdf, other]

On-the-Fly Ensemble Pruning in Evolving Data Streams

Authors: Sanem Elbasi, Alican Büyükçakır, Hamed Bonab, Fazli Can

Abstract: Ensemble pruning is the process of selecting a subset of componentclassifiers from an ensemble which performs at least as well as theoriginal ensemble while reducing storage and computational costs.Ensemble pruning in data streams is a largely unexplored area ofresearch. It requires analysis of ensemble components as they arerunning on the stream, and differentiation of useful classifiers fromredu… ▽ More Ensemble pruning is the process of selecting a subset of componentclassifiers from an ensemble which performs at least as well as theoriginal ensemble while reducing storage and computational costs.Ensemble pruning in data streams is a largely unexplored area ofresearch. It requires analysis of ensemble components as they arerunning on the stream, and differentiation of useful classifiers fromredundant ones. We present CCRP, an on-the-fly ensemble prun-ing method for multi-class data stream classification empoweredby an imbalance-aware fusion of class-wise component rankings.CCRP aims that the resulting pruned ensemble contains the bestperforming classifier for each target class and hence, reduces the ef-fects of class imbalance. The conducted experiments on real-worldand synthetic data streams demonstrate that different types of en-sembles that integrate CCRP as their pruning scheme consistentlyyield on par or superior performance with 20% to 90% less averagememory consumption. Lastly, we validate the proposed pruningscheme by comparing our approach against pruning schemes basedon ensemble weights and basic rank fusion methods. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: 5 pages, 2 figures

arXiv:2001.11639 [pdf, other]

ParkingSticker: A Real-World Object Detection Dataset

Authors: Caroline Potts, Ethem F. Can, Aysu Ezen-Can, Xiangqian Hu

Abstract: We present a new and challenging object detection dataset, ParkingSticker, which mimics the type of data available in industry problems more closely than popular existing datasets like PASCAL VOC. ParkingSticker contains 1,871 images that come from a security camera's video footage. The objective is to identify parking stickers on cars approaching a gate that the security camera faces. Bounding bo… ▽ More We present a new and challenging object detection dataset, ParkingSticker, which mimics the type of data available in industry problems more closely than popular existing datasets like PASCAL VOC. ParkingSticker contains 1,871 images that come from a security camera's video footage. The objective is to identify parking stickers on cars approaching a gate that the security camera faces. Bounding boxes are drawn around parking stickers in the images. The parking stickers are much smaller on average than the objects in other popular object detection datasets; this makes ParkingSticker a challenging test for object detection methods. This dataset also very realistically represents the data available in many industry problems where a customer presents a few video frames and asks for a solution to a very difficult problem. Performance of various object detection pipelines using a YOLOv2 architecture are presented and indicate that identifying the parking stickers in ParkingSticker is challenging yet feasible. We believe that this dataset will challenge researchers to solve a real-world problem with real-world constraints such as non-ideal camera positioning and small object-size-to-image-size ratios. △ Less

Submitted 12 February, 2020; v1 submitted 30 January, 2020; originally announced January 2020.

Comments: 8 pages, 8 figures; Updated authors

arXiv:2001.05857 [pdf, other]

The Effect of Data Ordering in Image Classification

Authors: Ethem F. Can, Aysu Ezen-Can

Abstract: The success stories from deep learning models increase every day spanning different tasks from image classification to natural language understanding. With the increasing popularity of these models, scientists spend more and more time finding the optimal parameters and best model architectures for their tasks. In this paper, we focus on the ingredient that feeds these machines: the data. We hypoth… ▽ More The success stories from deep learning models increase every day spanning different tasks from image classification to natural language understanding. With the increasing popularity of these models, scientists spend more and more time finding the optimal parameters and best model architectures for their tasks. In this paper, we focus on the ingredient that feeds these machines: the data. We hypothesize that the data ordering affects how well a model performs. To that end, we conduct experiments on an image classification task using ImageNet dataset and show that some data orderings are better than others in terms of obtaining higher classification accuracies. Experimental results show that independent of model architecture, learning rate and batch size, ordering of the data significantly affects the outcome. We show these findings using different metrics: NDCG, accuracy @ 1 and accuracy @ 5. Our goal here is to show that not only parameters and model architectures but also the data ordering has a say in obtaining better results. △ Less

Submitted 8 January, 2020; originally announced January 2020.

Journal ref: Under consideration at Pattern Recognition Letters 2020

arXiv:1901.04787 [pdf, ps, other]

A Tweet Dataset Annotated for Named Entity Recognition and Stance Detection

Authors: Dilek Küçük, Fazli Can

Abstract: Annotated datasets in different domains are critical for many supervised learning-based solutions to related problems and for the evaluation of the proposed solutions. Topics in natural language processing (NLP) similarly require annotated datasets to be used for such purposes. In this paper, we target at two NLP problems, named entity recognition and stance detection, and present the details of a… ▽ More Annotated datasets in different domains are critical for many supervised learning-based solutions to related problems and for the evaluation of the proposed solutions. Topics in natural language processing (NLP) similarly require annotated datasets to be used for such purposes. In this paper, we target at two NLP problems, named entity recognition and stance detection, and present the details of a tweet dataset in Turkish annotated for named entity and stance information. Within the course of the current study, both the named entity and stance annotations of the included tweets are made publicly available, although previously the dataset has been publicly shared with stance annotations only. We believe that this dataset will be useful for uncovering the possible relationships between named entity recognition and stance detection in tweets. △ Less

Submitted 16 January, 2019; v1 submitted 15 January, 2019; originally announced January 2019.

Comments: 4 pages; resource URLs are made properly accessible (by clicking them)

arXiv:1809.09994 [pdf, other]

doi 10.1145/3269206.3271774

A Novel Online Stacked Ensemble for Multi-Label Stream Classification

Authors: Alican Büyükçakır, Hamed Bonab, Fazli Can

Abstract: As data streams become more prevalent, the necessity for online algorithms that mine this transient and dynamic data becomes clearer. Multi-label data stream classification is a supervised learning problem where each instance in the data stream is classified into one or more pre-defined sets of labels. Many methods have been proposed to tackle this problem, including but not limited to ensemble-ba… ▽ More As data streams become more prevalent, the necessity for online algorithms that mine this transient and dynamic data becomes clearer. Multi-label data stream classification is a supervised learning problem where each instance in the data stream is classified into one or more pre-defined sets of labels. Many methods have been proposed to tackle this problem, including but not limited to ensemble-based methods. Some of these ensemble-based methods are specifically designed to work with certain multi-label base classifiers; some others employ online bagging schemes to build their ensembles. In this study, we introduce a novel online and dynamically-weighted stacked ensemble for multi-label classification, called GOOWE-ML, that utilizes spatial modeling to assign optimal weights to its component classifiers. Our model can be used with any existing incremental multi-label classification algorithm as its base classifier. We conduct experiments with 4 GOOWE-ML-based multi-label ensembles and 7 baseline models on 7 real-world datasets from diverse areas of interest. Our experiments show that GOOWE-ML ensembles yield consistently better results in terms of predictive performance in almost all of the datasets, with respect to the other prominent ensemble models. △ Less

Submitted 26 September, 2018; originally announced September 2018.

Comments: 10 pages, 4 figures. To be appeared in ACM CIKM 2018, in Torino, Italy

arXiv:1806.04511 [pdf, other]

Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data

Authors: Ethem F. Can, Aysu Ezen-Can, Fazli Can

Abstract: Sentiment analysis is a widely studied NLP task where the goal is to determine opinions, emotions, and evaluations of users towards a product, an entity or a service that they are reviewing. One of the biggest challenges for sentiment analysis is that it is highly language dependent. Word embeddings, sentiment lexicons, and even annotated data are language specific. Further, optimizing models for… ▽ More Sentiment analysis is a widely studied NLP task where the goal is to determine opinions, emotions, and evaluations of users towards a product, an entity or a service that they are reviewing. One of the biggest challenges for sentiment analysis is that it is highly language dependent. Word embeddings, sentiment lexicons, and even annotated data are language specific. Further, optimizing models for each language is very time consuming and labor intensive especially for recurrent neural network models. From a resource perspective, it is very challenging to collect data for different languages. In this paper, we look for an answer to the following research question: can a sentiment analysis model trained on a language be reused for sentiment analysis in other languages, Russian, Spanish, Turkish, and Dutch, where the data is more limited? Our goal is to build a single model in the language with the largest dataset available for the task, and reuse it for languages that have limited resources. For this purpose, we train a sentiment analysis model using recurrent neural networks with reviews in English. We then translate reviews in other languages and reuse this model to evaluate the sentiments. Experimental results show that our robust approach of single model trained on English reviews statistically significantly outperforms the baselines in several different languages. △ Less

Submitted 8 June, 2018; originally announced June 2018.

Comments: ACM SIGIR 2018 Workshop on Learning from Limited or Noisy Data (LND4IR'18)

arXiv:1803.08910 [pdf, ps, other]

Stance Detection on Tweets: An SVM-based Approach

Authors: Dilek Küçük, Fazli Can

Abstract: Stance detection is a subproblem of sentiment analysis where the stance of the author of a piece of natural language text for a particular target (either explicitly stated in the text or not) is explored. The stance output is usually given as Favor, Against, or Neither. In this paper, we target at stance detection on sports-related tweets and present the performance results of our SVM-based stance… ▽ More Stance detection is a subproblem of sentiment analysis where the stance of the author of a piece of natural language text for a particular target (either explicitly stated in the text or not) is explored. The stance output is usually given as Favor, Against, or Neither. In this paper, we target at stance detection on sports-related tweets and present the performance results of our SVM-based stance classifiers on such tweets. First, we describe three versions of our proprietary tweet data set annotated with stance information, all of which are made publicly available for research purposes. Next, we evaluate SVM classifiers using different feature sets for stance detection on this data set. The employed features are based on unigrams, bigrams, hashtags, external links, emoticons, and lastly, named entities. The results indicate that joint use of the features based on unigrams, hashtags, and named entities by SVM classifiers is a plausible approach for stance detection problem on sports-related tweets. △ Less

Submitted 23 March, 2018; originally announced March 2018.

Comments: 13 pages

arXiv:1709.02925 [pdf, other]

Less Is More: A Comprehensive Framework for the Number of Components of Ensemble Classifiers

Authors: Hamed Bonab, Fazli Can

Abstract: The number of component classifiers chosen for an ensemble greatly impacts the prediction ability. In this paper, we use a geometric framework for a priori determining the ensemble size, which is applicable to most of existing batch and online ensemble classifiers. There are only a limited number of studies on the ensemble size examining Majority Voting (MV) and Weighted Majority Voting (WMV). Alm… ▽ More The number of component classifiers chosen for an ensemble greatly impacts the prediction ability. In this paper, we use a geometric framework for a priori determining the ensemble size, which is applicable to most of existing batch and online ensemble classifiers. There are only a limited number of studies on the ensemble size examining Majority Voting (MV) and Weighted Majority Voting (WMV). Almost all of them are designed for batch-mode, hardly addressing online environments. Big data dimensions and resource limitations, in terms of time and memory, make determination of ensemble size crucial, especially for online environments. For the MV aggregation rule, our framework proves that the more strong components we add to the ensemble, the more accurate predictions we can achieve. For the WMV aggregation rule, our framework proves the existence of an ideal number of components, which is equal to the number of class labels, with the premise that components are completely independent of each other and strong enough. While giving the exact definition for a strong and independent classifier in the context of an ensemble is a challenging task, our proposed geometric framework provides a theoretical explanation of diversity and its impact on the accuracy of predictions. We conduct a series of experimental evaluations to show the practical value of our theorems and existing challenges. △ Less

Submitted 29 September, 2018; v1 submitted 9 September, 2017; originally announced September 2017.

Comments: This is an extended version of the work presented as a short paper at the Conference on Information and Knowledge Management (CIKM), 2016

arXiv:1709.02800 [pdf, other]

GOOWE: Geometrically Optimum and Online-Weighted Ensemble Classifier for Evolving Data Streams

Authors: Hamed R. Bonab, Fazli Can

Abstract: Designing adaptive classifiers for an evolving data stream is a challenging task due to the data size and its dynamically changing nature. Combining individual classifiers in an online setting, the ensemble approach, is a well-known solution. It is possible that a subset of classifiers in the ensemble outperforms others in a time-varying fashion. However, optimum weight assignment for component cl… ▽ More Designing adaptive classifiers for an evolving data stream is a challenging task due to the data size and its dynamically changing nature. Combining individual classifiers in an online setting, the ensemble approach, is a well-known solution. It is possible that a subset of classifiers in the ensemble outperforms others in a time-varying fashion. However, optimum weight assignment for component classifiers is a problem which is not yet fully addressed in online evolving environments. We propose a novel data stream ensemble classifier, called Geometrically Optimum and Online-Weighted Ensemble (GOOWE), which assigns optimum weights to the component classifiers using a sliding window containing the most recent data instances. We map vote scores of individual classifiers and true class labels into a spatial environment. Based on the Euclidean distance between vote scores and ideal-points, and using the linear least squares (LSQ) solution, we present a novel, dynamic, and online weighting approach. While LSQ is used for batch mode ensemble classifiers, it is the first time that we adapt and use it for online environments by providing a spatial modeling of online ensembles. In order to show the robustness of the proposed algorithm, we use real-world datasets and synthetic data generators using the MOA libraries. First, we analyze the impact of our weighting system on prediction accuracy through two scenarios. Second, we compare GOOWE with 8 state-of-the-art ensemble classifiers in a comprehensive experimental environment. Our experiments show that GOOWE provides improved reactions to different types of concept drift compared to our baselines. The statistical tests indicate a significant improvement in accuracy, with conservative time and memory requirements. △ Less

Submitted 7 September, 2017; originally announced September 2017.

Comments: 33 Pages, Accepted for publication in The ACM Transactions on Knowledge Discovery from Data (TKDD) in August 2017

Showing 1–16 of 16 results for author: Can, F