-
Exploring the Influence of Dimensionality Reduction on Anomaly Detection Performance in Multivariate Time Series
Authors:
Mahsun Altin,
Altan Cakir
Abstract:
This paper presents an extensive empirical study on the integration of dimensionality reduction techniques with advanced unsupervised time series anomaly detection models, focusing on the MUTANT and Anomaly-Transformer models. The study involves a comprehensive evaluation across three different datasets: MSL, SMAP, and SWaT. Each dataset poses unique challenges, allowing for a robust assessment of…
▽ More
This paper presents an extensive empirical study on the integration of dimensionality reduction techniques with advanced unsupervised time series anomaly detection models, focusing on the MUTANT and Anomaly-Transformer models. The study involves a comprehensive evaluation across three different datasets: MSL, SMAP, and SWaT. Each dataset poses unique challenges, allowing for a robust assessment of the models' capabilities in varied contexts. The dimensionality reduction techniques examined include PCA, UMAP, Random Projection, and t-SNE, each offering distinct advantages in simplifying high-dimensional data. Our findings reveal that dimensionality reduction not only aids in reducing computational complexity but also significantly enhances anomaly detection performance in certain scenarios. Moreover, a remarkable reduction in training times was observed, with reductions by approximately 300\% and 650\% when dimensionality was halved and minimized to the lowest dimensions, respectively. This efficiency gain underscores the dual benefit of dimensionality reduction in both performance enhancement and operational efficiency. The MUTANT model exhibits notable adaptability, especially with UMAP reduction, while the Anomaly-Transformer demonstrates versatility across various reduction techniques. These insights provide a deeper understanding of the synergistic effects of dimensionality reduction and anomaly detection, contributing valuable perspectives to the field of time series analysis. The study underscores the importance of selecting appropriate dimensionality reduction strategies based on specific model requirements and dataset characteristics, paving the way for more efficient, accurate, and scalable solutions in anomaly detection.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
Modified Query Expansion Through Generative Adversarial Networks for Information Extraction in E-Commerce
Authors:
Altan Cakir,
Mert Gurkan
Abstract:
This work addresses an alternative approach for query expansion (QE) using a generative adversarial network (GAN) to enhance the effectiveness of information search in e-commerce. We propose a modified QE conditional GAN (mQE-CGAN) framework, which resolves keywords by expanding the query with a synthetically generated query that proposes semantic information from text input. We train a sequence-t…
▽ More
This work addresses an alternative approach for query expansion (QE) using a generative adversarial network (GAN) to enhance the effectiveness of information search in e-commerce. We propose a modified QE conditional GAN (mQE-CGAN) framework, which resolves keywords by expanding the query with a synthetically generated query that proposes semantic information from text input. We train a sequence-to-sequence transformer model as the generator to produce keywords and use a recurrent neural network model as the discriminator to classify an adversarial output with the generator. With the modified CGAN framework, various forms of semantic insights gathered from the query document corpus are introduced to the generation process. We leverage these insights as conditions for the generator model and discuss their effectiveness for the query expansion task. Our experiments demonstrate that the utilization of condition structures within the mQE-CGAN framework can increase the semantic similarity between generated sequences and reference documents up to nearly 10% compared to baseline models
△ Less
Submitted 30 December, 2022;
originally announced January 2023.
-
Unsupervised Behaviour Analysis of News Consumption in Turkish Media
Authors:
Didem Makaroglu,
Altan Cakir,
Behcet Ugur Toreyin
Abstract:
Clickstream data, which come with a massive volume generated by human activities on websites, have become a prominent feature for identifying readers' characteristics by newsrooms after the digitization of news outlets. Although the nature of clickstream data has a similar logic within websites, it has inherent limitations in recognizing human behaviours when looking from a broad perspective, whic…
▽ More
Clickstream data, which come with a massive volume generated by human activities on websites, have become a prominent feature for identifying readers' characteristics by newsrooms after the digitization of news outlets. Although the nature of clickstream data has a similar logic within websites, it has inherent limitations in recognizing human behaviours when looking from a broad perspective, which brings the need to limit the problem in niche areas. This study investigates the anonymized readers' click activities on the organizations' websites to identify news consumption patterns following referrals from Twitter,who incidentally reach but propensity is mainly routed news content. Methodologies for ensemble cluster analysis with mixed-type embedding strategies are applied and compared to find similar reader groups and interests independent of time. Various internal validation perspectives are used to determine the optimality of the quality of clusters, where the Calinski Harabasz Index (CHI) is found to give a generalizable result. Our findings demonstrate that clustering a mixed-type dataset approaches the optimal internal validation scores, which we define to discriminate the clusters and algorithms considering applied strategies when embedded by Uniform Manifold Approximation and Projection (UMAP) and using a consensus function as a key to access the most applicable hyperparameter configurations in the given ensemble rather than using consensus function results directly. Evaluation of the resulting clusters highlights specific clusters repeatedly present in the separated monthly samples by Adjusted Mutual Information scores greater than 0.5, which provide insights to the news organizations and overcome the degradation of the modeling behaviours due to the change in the interest over time.
△ Less
Submitted 8 October, 2022; v1 submitted 4 February, 2022;
originally announced February 2022.
-
Cloud Based Big Data DNS Analytics at Turknet
Authors:
Altan Cakir,
Yousef Alkhanafseh,
Esra Karabiyik,
Erhan Kurubas,
Rabia Burcu Bunyak,
Cenk Anil Bahcevan
Abstract:
Domain Name System (DNS) is a hierarchical distributed naming system for computers, services, or any resource connected to the Internet. A DNS resolves queries for URLs into IP addresses for the purpose of locating computer services and devices worldwide. As of now, analytical applications with a vast amount of DNS data are a challenging problem. Clustering the features of domain traffic from a DN…
▽ More
Domain Name System (DNS) is a hierarchical distributed naming system for computers, services, or any resource connected to the Internet. A DNS resolves queries for URLs into IP addresses for the purpose of locating computer services and devices worldwide. As of now, analytical applications with a vast amount of DNS data are a challenging problem. Clustering the features of domain traffic from a DNS data has given necessity to the need for more sophisticated analytics platforms and tools because of the sensitivity of the data characterization. In this study, a cloud based big data application, based on Apache Spark, on DNS data is proposed, as well as a periodic trend pattern based on traffic to partition numerous domain names and region into separate groups by the characteristics of their query traffic time series. Preliminary experimental results on a Turknet DNS data in daily operations are discussed with business intelligence applications.
△ Less
Submitted 8 July, 2020;
originally announced July 2020.
-
An Evaluation of Recent Neural Sequence Tagging Models in Turkish Named Entity Recognition
Authors:
Gizem Aras,
Didem Makaroglu,
Seniz Demir,
Altan Cakir
Abstract:
Named entity recognition (NER) is an extensively studied task that extracts and classifies named entities in a text. NER is crucial not only in downstream language processing applications such as relation extraction and question answering but also in large scale big data operations such as real-time analysis of online digital media content. Recent research efforts on Turkish, a less studied langua…
▽ More
Named entity recognition (NER) is an extensively studied task that extracts and classifies named entities in a text. NER is crucial not only in downstream language processing applications such as relation extraction and question answering but also in large scale big data operations such as real-time analysis of online digital media content. Recent research efforts on Turkish, a less studied language with morphologically rich nature, have demonstrated the effectiveness of neural architectures on well-formed texts and yielded state-of-the art results by formulating the task as a sequence tagging problem. In this work, we empirically investigate the use of recent neural architectures (Bidirectional long short-term memory and Transformer-based networks) proposed for Turkish NER tagging in the same setting. Our results demonstrate that transformer-based networks which can model long-range context overcome the limitations of BiLSTM networks where different input features at the character, subword, and word levels are utilized. We also propose a transformer-based network with a conditional random field (CRF) layer that leads to the state-of-the-art result (95.95\% f-measure) on a common dataset. Our study contributes to the literature that quantifies the impact of transfer learning on processing morphologically rich languages.
△ Less
Submitted 18 May, 2020; v1 submitted 14 May, 2020;
originally announced May 2020.
-
Enabling Big Data Analytics at Manufacturing Fields of Farplas Automotive
Authors:
Ozgun Akin,
Halil Faruk Deniz,
Dogukan Nefis,
Alp Kiziltan,
Altan Cakir
Abstract:
Digitization and data-driven manufacturing process is needed for today's industry. The term Industry 4.0 stands for today industrial digitization which is defined as a new level of organization and control over the entire value chain of the life cycle of products; it is geared towards increasingly individualized customer's high-quality expectations. However, due to the increase in the number of co…
▽ More
Digitization and data-driven manufacturing process is needed for today's industry. The term Industry 4.0 stands for today industrial digitization which is defined as a new level of organization and control over the entire value chain of the life cycle of products; it is geared towards increasingly individualized customer's high-quality expectations. However, due to the increase in the number of connected devices and the variety of data, it has become difficult to store and analyze data with conventional systems. The motivation of this paper is to provide an overview of the understanding of the big data pipeline, providing a real-time on-premise data acquisition, data compression, data storage and processing with Apache Kafka and Apache Spark implementation on Apache Ha-doop cluster, and identifying the challenges and issues occurring with implementation the Farplas manufacturing company, which is one of the biggest Tier 1 automotive supplier in Turkey, to study the new trends and streams related to topics via Industry 4.0.
△ Less
Submitted 24 April, 2020;
originally announced April 2020.
-
Artificial Intelligence Assistance Significantly Improves Gleason Grading of Prostate Biopsies by Pathologists
Authors:
Wouter Bulten,
Maschenka Balkenhol,
Jean-Joël Awoumou Belinga,
Américo Brilhante,
Aslı Çakır,
Xavier Farré,
Katerina Geronatsiou,
Vincent Molinié,
Guilherme Pereira,
Paromita Roy,
Günter Saile,
Paulo Salles,
Ewout Schaafsma,
Joëlle Tschui,
Anne-Marie Vos,
Hester van Boven,
Robert Vink,
Jeroen van der Laak,
Christina Hulsbergen-van de Kaa,
Geert Litjens
Abstract:
While the Gleason score is the most important prognostic marker for prostate cancer patients, it suffers from significant observer variability. Artificial Intelligence (AI) systems, based on deep learning, have proven to achieve pathologist-level performance at Gleason grading. However, the performance of such systems can degrade in the presence of artifacts, foreign tissue, or other anomalies. Pa…
▽ More
While the Gleason score is the most important prognostic marker for prostate cancer patients, it suffers from significant observer variability. Artificial Intelligence (AI) systems, based on deep learning, have proven to achieve pathologist-level performance at Gleason grading. However, the performance of such systems can degrade in the presence of artifacts, foreign tissue, or other anomalies. Pathologists integrating their expertise with feedback from an AI system could result in a synergy that outperforms both the individual pathologist and the system. Despite the hype around AI assistance, existing literature on this topic within the pathology domain is limited. We investigated the value of AI assistance for grading prostate biopsies. A panel of fourteen observers graded 160 biopsies with and without AI assistance. Using AI, the agreement of the panel with an expert reference standard significantly increased (quadratically weighted Cohen's kappa, 0.799 vs 0.872; p=0.018). Our results show the added value of AI systems for Gleason grading, but more importantly, show the benefits of pathologist-AI synergy.
△ Less
Submitted 11 February, 2020;
originally announced February 2020.
-
Simulation Of Logic Circuit Tests On Android-Based Mobile Devices
Authors:
Abdülkadir Çakir,
Ümmüşan Çitak
Abstract:
In this study, an application that can run on Android and Windows-based mobile devices was developed to allow students attending such classes as Numerical/Digital Electronics, Logic Circuits, Basic Electronics Measurement and Electronic Systems in Turkey's Vocation and Technical Education Schools to easily carry out the simulation of logic gates, as well as logic circuit tests performed using logi…
▽ More
In this study, an application that can run on Android and Windows-based mobile devices was developed to allow students attending such classes as Numerical/Digital Electronics, Logic Circuits, Basic Electronics Measurement and Electronic Systems in Turkey's Vocation and Technical Education Schools to easily carry out the simulation of logic gates, as well as logic circuit tests performed using logic gates. A 2D-mobile application that runs on both platforms was developed using the C# language on the Unity3D editor. To assess the usability of the mobile application, a one-hour training session was administered in March of the 2017-2018 academic year to two groups of students from a single class in the sixth grade of an Imam Hatip Secondary School affiliated to the Ministry of National Education. Each of the two groups contained 12 students who were assumed to be equivalent, and who had no prior knowledge of the subject. The training of the first group began with a lecture on basic logic gates using a blackboard, and involved no simulations, while the second group, in addition to being given the same the lecture, received additional training involving demonstrations of the developed mobile application and its simulations. Following the lectures, a written exam was applied to both groups. An evaluation of the exam results revealed that 83 percent of the students who had been given demonstrations of the mobile application were able to perform the circuit task completely, whereas only 50 percent of the other were able to complete the task. It was concluded that the application was both useful and facilitating for to the students, and it was also noted that students who were supported by the mobile application had gained a better grasp of the topic by being able to see and practice the simulations first hand.
△ Less
Submitted 24 May, 2018;
originally announced May 2018.