-
Fast Classification of Large Time Series Datasets
Authors:
Muhammad Marwan Muhammad Fuad
Abstract:
Time series classification (TSC) is the most import task in time series mining as it has several applications in medicine, meteorology, finance cyber security, and many others. With the ever increasing size of time series datasets, several traditional TSC methods are no longer efficient enough to perform this task on such very large datasets. Yet, most recent papers on TSC focus mainly on accuracy…
▽ More
Time series classification (TSC) is the most import task in time series mining as it has several applications in medicine, meteorology, finance cyber security, and many others. With the ever increasing size of time series datasets, several traditional TSC methods are no longer efficient enough to perform this task on such very large datasets. Yet, most recent papers on TSC focus mainly on accuracy by using methods that apply deep learning, for instance, which require extensive computational resources that cannot be applied efficiently to very large datasets. The method we introduce in this paper focuses on these very large time series datasets with the main objective being efficiency. We achieve this through a simplified representation of the time series. This in turn is enhanced by a distance measure that considers only some of the values of the represented time series. The result of this combination is a very efficient representation method for TSC. This has been tested experimentally against another time series method that is particularly popular for its efficiency. The experiments show that our method is not only 4 times faster, on average, but it is also superior in terms of classification accuracy, as it gives better results on 24 out of the 29 tested time series datasets. .
△ Less
Submitted 10 December, 2023;
originally announced December 2023.
-
TSAX is Trending
Authors:
Muhammad Marwan Muhammad Fuad
Abstract:
Time series mining is an important branch of data mining, as time series data is ubiquitous and has many applications in several domains. The main task in time series mining is classification. Time series representation methods play an important role in time series classification and other time series mining tasks. One of the most popular representation methods of time series data is the Symbolic…
▽ More
Time series mining is an important branch of data mining, as time series data is ubiquitous and has many applications in several domains. The main task in time series mining is classification. Time series representation methods play an important role in time series classification and other time series mining tasks. One of the most popular representation methods of time series data is the Symbolic Aggregate approXimation (SAX). The secret behind its popularity is its simplicity and efficiency. SAX has however one major drawback, which is its inability to represent trend information. Several methods have been proposed to enable SAX to capture trend information, but this comes at the expense of complex processing, preprocessing, or post-processing procedures. In this paper we present a new modification of SAX that we call Trending SAX (TSAX), which only adds minimal complexity to SAX, but substantially improves its performance in time series classification. This is validated experimentally on 50 datasets. The results show the superior performance of our method, as it gives a smaller classification error on 39 datasets compared with SAX.
△ Less
Submitted 23 December, 2021;
originally announced December 2021.
-
Extreme-SAX: Extreme Points Based Symbolic Representation for Time Series Classification
Authors:
Muhammad Marwan Muhammad Fuad
Abstract:
Time series classification is an important problem in data mining with several applications in different domains. Because time series data are usually high dimensional, dimensionality reduction techniques have been proposed as an efficient approach to lower their dimensionality. One of the most popular dimensionality reduction techniques of time series data is the Symbolic Aggregate Approximation…
▽ More
Time series classification is an important problem in data mining with several applications in different domains. Because time series data are usually high dimensional, dimensionality reduction techniques have been proposed as an efficient approach to lower their dimensionality. One of the most popular dimensionality reduction techniques of time series data is the Symbolic Aggregate Approximation (SAX), which is inspired by algorithms from text mining and bioinformatics. SAX is simple and efficient because it uses precomputed distances. The disadvantage of SAX is its inability to accurately represent important points in the time series. In this paper we present Extreme-SAX (E-SAX), which uses only the extreme points of each segment to represent the time series. E-SAX has exactly the same simplicity and efficiency of the original SAX, yet it gives better results in time series classification than the original SAX, as we show in extensive experiments on a variety of time series datasets.
△ Less
Submitted 1 October, 2020;
originally announced October 2020.
-
Modifying the Symbolic Aggregate Approximation Method to Capture Segment Trend Information
Authors:
Muhammad Marwan Muhammad Fuad
Abstract:
The Symbolic Aggregate approXimation (SAX) is a very popular symbolic dimensionality reduction technique of time series data, as it has several advantages over other dimensionality reduction techniques. One of its major advantages is its efficiency, as it uses precomputed distances. The other main advantage is that in SAX the distance measure defined on the reduced space lower bounds the distance…
▽ More
The Symbolic Aggregate approXimation (SAX) is a very popular symbolic dimensionality reduction technique of time series data, as it has several advantages over other dimensionality reduction techniques. One of its major advantages is its efficiency, as it uses precomputed distances. The other main advantage is that in SAX the distance measure defined on the reduced space lower bounds the distance measure defined on the original space. This enables SAX to return exact results in query-by-content tasks. Yet SAX has an inherent drawback, which is its inability to capture segment trend information. Several researchers have attempted to enhance SAX by proposing modifications to include trend information. However, this comes at the expense of giving up on one or more of the advantages of SAX. In this paper we investigate three modifications of SAX to add trend capturing ability to it. These modifications retain the same features of SAX in terms of simplicity, efficiency, as well as the exact results it returns. They are simple procedures based on a different segmentation of the time series than that used in classic-SAX. We test the performance of these three modifications on 45 time series datasets of different sizes, dimensions, and nature, on a classification task and we compare it to that of classic-SAX. The results we obtained show that one of these modifications manages to outperform classic-SAX and that another one slightly gives better results than classic-SAX.
△ Less
Submitted 1 October, 2020;
originally announced October 2020.
-
Impact on the Productivity of Remotely Working IT Professionals of Bangladesh during the Coronavirus Disease 2019
Authors:
Kishan Kumar Ganguly,
Noshin Tahsin,
Mridha Md. Nafis Fuad,
Toukir Ahammed,
Moumita Asad,
Syed Fatiul Huq,
A. T. M. Fazlay Rabbi,
Kazi Sakib
Abstract:
Similar to the rest of the world, the recent pandemic situation has forced the IT professionals of Bangladesh to adopt remote work. The aim of this study is to find out whether remote work can be continued even after the lockdown is lifted. As work from home may change various productivity related aspects of the employees, i.e., team dynamics and company dynamics, it is necessary to understand the…
▽ More
Similar to the rest of the world, the recent pandemic situation has forced the IT professionals of Bangladesh to adopt remote work. The aim of this study is to find out whether remote work can be continued even after the lockdown is lifted. As work from home may change various productivity related aspects of the employees, i.e., team dynamics and company dynamics, it is necessary to understand the nature of the change during WFH. Conducting a survey, we asked the IT professionals of Bangladesh how they perceive their level of productivity during WFH and how the factors related to productivity have changed. We analyzed the change and identified the areas affected by WFH. We discovered that resource and workspace related issues, emotional well-being of the employees have been hampered the most during WFH. We believe that the findings from this study will help to decide how to resolve those issues and will help to understand whether WFH can be continued even after the lockdown is lifted.
△ Less
Submitted 11 September, 2020; v1 submitted 26 August, 2020;
originally announced August 2020.
-
Applying Nature-Inspired Optimization Algorithms for Selecting Important Timestamps to Reduce Time Series Dimensionality
Authors:
Muhammad Marwan Muhammad Fuad
Abstract:
Time series data account for a major part of data supply available today. Time series mining handles several tasks such as classification, clustering, query-by-content, prediction, and others. Performing data mining tasks on raw time series is inefficient as these data are high-dimensional by nature. Instead, time series are first pre-processed using several techniques before different data mining…
▽ More
Time series data account for a major part of data supply available today. Time series mining handles several tasks such as classification, clustering, query-by-content, prediction, and others. Performing data mining tasks on raw time series is inefficient as these data are high-dimensional by nature. Instead, time series are first pre-processed using several techniques before different data mining tasks can be performed on them. In general, there are two main approaches to reduce time series dimensionality, the first is what we call landmark methods. These methods are based on finding characteristic features in the target time series. The second is based on data transformations. These methods transform the time series from the original space into a reduced space, where they can be managed more efficiently. The method we present in this paper applies a third approach, as it projects a time series onto a lower-dimensional space by selecting important points in the time series. The novelty of our method is that these points are not chosen according to a geometric criterion, which is subjective in most cases, but through an optimization process. The other important characteristic of our method is that these important points are selected on a dataset-level and not on a single time series-level. The direct advantage of this strategy is that the distance defined on the low-dimensional space lower bounds the original distance applied to raw data. This enables us to apply the popular GEMINI algorithm. The promising results of our experiments on a wide variety of time series datasets, using different optimizers, and applied to the two major data mining tasks, validate our new method.
△ Less
Submitted 9 December, 2018;
originally announced December 2018.
-
One-Step or Two-Step Optimization and the Overfitting Phenomenon: A Case Study on Time Series Classification
Authors:
Muhammad Marwan Muhammad Fuad
Abstract:
For the last few decades, optimization has been develo** at a fast rate. Bio-inspired optimization algorithms are metaheuristics inspired by nature. These algorithms have been applied to solve different problems in engineering, economics, and other domains. Bio-inspired algorithms have also been applied in different branches of information technology such as networking and software engineering.…
▽ More
For the last few decades, optimization has been develo** at a fast rate. Bio-inspired optimization algorithms are metaheuristics inspired by nature. These algorithms have been applied to solve different problems in engineering, economics, and other domains. Bio-inspired algorithms have also been applied in different branches of information technology such as networking and software engineering. Time series data mining is a field of information technology that has its share of these applications too. In previous works we showed how bio-inspired algorithms such as the genetic algorithms and differential evolution can be used to find the locations of the breakpoints used in the symbolic aggregate approximation of time series representation, and in another work we showed how we can utilize the particle swarm optimization, one of the famous bio-inspired algorithms, to set weights to the different segments in the symbolic aggregate approximation representation. In this paper we present, in two different approaches, a new meta optimization process that produces optimal locations of the breakpoints in addition to optimal weights of the segments. The experiments of time series classification task that we conducted show an interesting example of how the overfitting phenomenon, a frequently encountered problem in data mining which happens when the model overfits the training set, can interfere in the optimization process and hide the superior performance of an optimization algorithm.
△ Less
Submitted 16 July, 2014;
originally announced July 2014.
-
Towards Normalizing the Edit Distance Using a Genetic Algorithms Based Scheme
Authors:
Muhammad Marwan Muhammad Fuad
Abstract:
The normalized edit distance is one of the distances derived from the edit distance. It is useful in some applications because it takes into account the lengths of the two strings compared. The normalized edit distance is not defined in terms of edit operations but rather in terms of the edit path. In this paper we propose a new derivative of the edit distance that also takes into consideration th…
▽ More
The normalized edit distance is one of the distances derived from the edit distance. It is useful in some applications because it takes into account the lengths of the two strings compared. The normalized edit distance is not defined in terms of edit operations but rather in terms of the edit path. In this paper we propose a new derivative of the edit distance that also takes into consideration the lengths of the two strings, but the new distance is related directly to the edit distance. The particularity of the new distance is that it uses the genetic algorithms to set the values of the parameters it uses. We conduct experiments to test the new distance and we obtain promising results.
△ Less
Submitted 5 December, 2013;
originally announced December 2013.
-
Particle Swarm Optimization of Information-Content Weighting of Symbolic Aggregate Approximation
Authors:
Muhammad Marwan Muhammad Fuad
Abstract:
Bio-inspired optimization algorithms have been gaining more popularity recently. One of the most important of these algorithms is particle swarm optimization (PSO). PSO is based on the collective intelligence of a swam of particles. Each particle explores a part of the search space looking for the optimal position and adjusts its position according to two factors; the first is its own experience a…
▽ More
Bio-inspired optimization algorithms have been gaining more popularity recently. One of the most important of these algorithms is particle swarm optimization (PSO). PSO is based on the collective intelligence of a swam of particles. Each particle explores a part of the search space looking for the optimal position and adjusts its position according to two factors; the first is its own experience and the second is the collective experience of the whole swarm. PSO has been successfully used to solve many optimization problems. In this work we use PSO to improve the performance of a well-known representation method of time series data which is the symbolic aggregate approximation (SAX). As with other time series representation methods, SAX results in loss of information when applied to represent time series. In this paper we use PSO to propose a new minimum distance WMD for SAX to remedy this problem. Unlike the original minimum distance, the new distance sets different weights to different segments of the time series according to their information content. This weighted minimum distance enhances the performance of SAX as we show through experiments using different time series datasets.
△ Less
Submitted 5 December, 2013;
originally announced December 2013.
-
ABC-SG: A New Artificial Bee Colony Algorithm-Based Distance of Sequential Data Using Sigma Grams
Authors:
Muhammad Marwan Muhammad Fuad
Abstract:
The problem of similarity search is one of the main problems in computer science. This problem has many applications in text-retrieval, web search, computational biology, bioinformatics and others. Similarity between two data objects can be depicted using a similarity measure or a distance metric. There are numerous distance metrics in the literature, some are used for a particular data type, and…
▽ More
The problem of similarity search is one of the main problems in computer science. This problem has many applications in text-retrieval, web search, computational biology, bioinformatics and others. Similarity between two data objects can be depicted using a similarity measure or a distance metric. There are numerous distance metrics in the literature, some are used for a particular data type, and others are more general. In this paper we present a new distance metric for sequential data which is based on the sum of n-grams. The novelty of our distance is that these n-grams are weighted using artificial bee colony; a recent optimization algorithm based on the collective intelligence of a swarm of bees on their search for nectar. This algorithm has been used in optimizing a large number of numerical problems. We validate the new distance experimentally.
△ Less
Submitted 4 December, 2013;
originally announced December 2013.
-
Towards a faster symbolic aggregate approximation method
Authors:
Muhammad Marwan Muhammad Fuad,
Pierre-François Marteau
Abstract:
The similarity search problem is one of the main problems in time series data mining. Traditionally, this problem was tackled by sequentially comparing the given query against all the time series in the database, and returning all the time series that are within a predetermined threshold of that query. But the large size and the high dimensionality of time series databases that are in use nowadays…
▽ More
The similarity search problem is one of the main problems in time series data mining. Traditionally, this problem was tackled by sequentially comparing the given query against all the time series in the database, and returning all the time series that are within a predetermined threshold of that query. But the large size and the high dimensionality of time series databases that are in use nowadays make that scenario inefficient. There are many representation techniques that aim at reducing the dimensionality of time series so that the search can be handled faster at a lower-dimensional space level. The symbolic aggregate approximation (SAX) is one of the most competitive methods in the literature. In this paper we present a new method that improves the performance of SAX by adding to it another exclusion condition that increases the exclusion power. This method is based on using two representations of the time series: one of SAX and the other is based on an optimal approximation of the time series. Pre-computed distances are calculated and stored offline to be used online to exclude a wide range of the search space using two exclusion conditions. We conduct experiments which show that the new method is faster than SAX.
△ Less
Submitted 24 January, 2013;
originally announced January 2013.
-
Self-Healing by Means of Runtime Execution Profiling
Authors:
Mohammad Muztaba Fuad,
Debzani Deb,
**suk Baek
Abstract:
A self-healing application brings itself into a stable state after a failure put the software into an unstable state. For such self-healing software application, finding fix for a previously unseen fault is a grand challenge. Asking the user to provide fixes for every fault is bad for productivity, especially when the users are non-savvy in technical aspect of computing. If failure scenarios come…
▽ More
A self-healing application brings itself into a stable state after a failure put the software into an unstable state. For such self-healing software application, finding fix for a previously unseen fault is a grand challenge. Asking the user to provide fixes for every fault is bad for productivity, especially when the users are non-savvy in technical aspect of computing. If failure scenarios come into existence, the user wants the runtime environment to handle those situations autonomically. This paper presents a new technique of finding self-healing actions by matching a fault scenario to already established fault models. By profiling and capturing runtime parameters and execution pathWays, stable execution models are established and later are used to match with an unstable execution scenario. Experimentation and results are presented that showed that even with additional overheads; this technique can prove beneficial for autonomically healing faults and reliving system administrators from mundane troubleshooting situations.
△ Less
Submitted 26 March, 2012;
originally announced March 2012.
-
Agent Based Processing of Global Evaluation Function
Authors:
M. Shahriar Hossain,
M. Muztaba Fuad,
Md. Mahbubul Alam Joarder
Abstract:
Load balancing across a networked environment is a monotonous job. Moreover, if the job to be distributed is a constraint satisfying one, the distribution of load demands core intelligence. This paper proposes parallel processing through Global Evaluation Function by means of randomly initialized agents for solving Constraint Satisfaction Problems. A potential issue about the number of agents in a…
▽ More
Load balancing across a networked environment is a monotonous job. Moreover, if the job to be distributed is a constraint satisfying one, the distribution of load demands core intelligence. This paper proposes parallel processing through Global Evaluation Function by means of randomly initialized agents for solving Constraint Satisfaction Problems. A potential issue about the number of agents in a machine under the invocation of distribution is discussed here for securing the maximum benefit from Global Evaluation and parallel processing. The proposed system is compared with typical solution that shows an exclusive outcome supporting the nobility of parallel implementation of Global Evaluation Function with certain number of agents in each invoked machine.
△ Less
Submitted 29 March, 2011;
originally announced March 2011.
-
Triangular Dynamic Architecture for Distributed Computing in a LAN Environment
Authors:
M. Shahriar Hossain,
Kazi Muhammad Najmul Hasan Khan,
M. Muztaba Fuad,
Debzani Deb
Abstract:
A computationally intensive large job, granulized to concurrent pieces and operating in a dynamic environment should reduce the total processing time. However, distributing jobs across a networked environment is a tedious and difficult task. Job distribution in a Local Area Network based on Triangular Dynamic Architecture (TDA) is a mechanism that establishes a dynamic environment for job distribu…
▽ More
A computationally intensive large job, granulized to concurrent pieces and operating in a dynamic environment should reduce the total processing time. However, distributing jobs across a networked environment is a tedious and difficult task. Job distribution in a Local Area Network based on Triangular Dynamic Architecture (TDA) is a mechanism that establishes a dynamic environment for job distribution, load balancing and distributed processing with minimum interaction from the user. This paper introduces TDA and discusses its architecture and shows the benefits gained by utilizing such architecture in a distributed computing environment.
△ Less
Submitted 29 March, 2011;
originally announced March 2011.
-
Load Balancing in a Networked Environment through Homogenization
Authors:
M. Shahriar Hossain,
M. Muztaba Fuad,
Debzani Deb,
Kazi Muhammad Najmul Hasan Khan,
Md. Mahbubul Alam Joarder
Abstract:
Distributed processing across a networked environment suffers from unpredictable behavior of speedup due to heterogeneous nature of the hardware and software in the remote machines. It is challenging to get a better performance from a distributed system by distributing task in an intelligent manner such that the heterogeneous nature of the system do not have any effect on the speedup ratio. This p…
▽ More
Distributed processing across a networked environment suffers from unpredictable behavior of speedup due to heterogeneous nature of the hardware and software in the remote machines. It is challenging to get a better performance from a distributed system by distributing task in an intelligent manner such that the heterogeneous nature of the system do not have any effect on the speedup ratio. This paper introduces homogenization, a technique that distributes and balances the workload in such a manner that the user gets the highest speedup possible from a distributed environment. Along with providing better performance, homogenization is totally transparent to the user and requires no interaction with the system.
△ Less
Submitted 29 March, 2011;
originally announced March 2011.
-
The Extended Edit Distance Metric
Authors:
Muhammad Marwan Muhammad Fuad,
Pierre-François Marteau
Abstract:
Similarity search is an important problem in information retrieval. This similarity is based on a distance. Symbolic representation of time series has attracted many researchers recently, since it reduces the dimensionality of these high dimensional data objects. We propose a new distance metric that is applied to symbolic data objects and we test it on time series data bases in a classification…
▽ More
Similarity search is an important problem in information retrieval. This similarity is based on a distance. Symbolic representation of time series has attracted many researchers recently, since it reduces the dimensionality of these high dimensional data objects. We propose a new distance metric that is applied to symbolic data objects and we test it on time series data bases in a classification task. We compare it to other distances that are well known in the literature for symbolic data objects. We also prove, mathematically, that our distance is metric.
△ Less
Submitted 28 September, 2007;
originally announced September 2007.