-
Self-Labeling the Job Shop Scheduling Problem
Authors:
Andrea Corsini,
Angelo Porrello,
Simone Calderara,
Mauro Dell'Amico
Abstract:
In this work, we propose a Self-Supervised training strategy specifically designed for combinatorial problems. One of the main obstacles in applying supervised paradigms to such problems is the requirement of expensive target solutions as ground-truth, often produced with costly exact solvers. Inspired by Semi- and Self-Supervised learning, we show that it is possible to easily train generative mo…
▽ More
In this work, we propose a Self-Supervised training strategy specifically designed for combinatorial problems. One of the main obstacles in applying supervised paradigms to such problems is the requirement of expensive target solutions as ground-truth, often produced with costly exact solvers. Inspired by Semi- and Self-Supervised learning, we show that it is possible to easily train generative models by sampling multiple solutions and using the best one according to the problem objective as a pseudo-label. In this way, we iteratively improve the model generation capability by relying only on its self-supervision, completely removing the need for optimality information. We prove the effectiveness of this Self-Labeling strategy on the Job Shop Scheduling (JSP), a complex combinatorial problem that is receiving much attention from the Reinforcement Learning community. We propose a generative model based on the well-known Pointer Network and train it with our strategy. Experiments on two popular benchmarks demonstrate the potential of this approach as the resulting models outperform constructive heuristics and current state-of-the-art Reinforcement Learning proposals.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
Learning the Quality of Machine Permutations in Job Shop Scheduling
Authors:
Andrea Corsini,
Simone Calderara,
Mauro Dell'Amico
Abstract:
In recent years, the power demonstrated by Machine Learning (ML) has increasingly attracted the interest of the optimization community that is starting to leverage ML for enhancing and automating the design of algorithms. One combinatorial optimization problem recently tackled with ML is the Job Shop scheduling Problem (JSP). Most of the works on the JSP using ML focus on Deep Reinforcement Learni…
▽ More
In recent years, the power demonstrated by Machine Learning (ML) has increasingly attracted the interest of the optimization community that is starting to leverage ML for enhancing and automating the design of algorithms. One combinatorial optimization problem recently tackled with ML is the Job Shop scheduling Problem (JSP). Most of the works on the JSP using ML focus on Deep Reinforcement Learning (DRL), and only a few of them leverage supervised learning techniques. The recurrent reasons for avoiding supervised learning seem to be the difficulty in casting the right learning task, i.e., what is meaningful to predict, and how to obtain labels. Therefore, we first propose a novel supervised learning task that aims at predicting the quality of machine permutations. Then, we design an original methodology to estimate this quality, and we use these estimations to create an accurate sequential deep learning model (binary accuracy above 95%). Finally, we empirically demonstrate the value of predicting the quality of machine permutations by enhancing the performance of a simple Tabu Search algorithm inspired by the works in the literature.
△ Less
Submitted 16 September, 2022; v1 submitted 7 July, 2022;
originally announced July 2022.
-
Unsupervised Detection and Clustering of Malicious TLS Flows
Authors:
Gibran Gomez,
Platon Kotzias,
Matteo Dell'Amico,
Leyla Bilge,
Juan Caballero
Abstract:
Malware abuses TLS to encrypt its malicious traffic, preventing examination by content signatures and deep packet inspection. Network detection of malicious TLS flows is an important, but challenging, problem. Prior works have proposed supervised machine learning detectors using TLS features. However, by trying to represent all malicious traffic, supervised binary detectors produce models that are…
▽ More
Malware abuses TLS to encrypt its malicious traffic, preventing examination by content signatures and deep packet inspection. Network detection of malicious TLS flows is an important, but challenging, problem. Prior works have proposed supervised machine learning detectors using TLS features. However, by trying to represent all malicious traffic, supervised binary detectors produce models that are too loose, thus introducing errors. Furthermore, they do not distinguish flows generated by different malware. On the other hand, supervised multi-class detectors produce tighter models and can classify flows by malware family, but require family labels, which are not available for many samples.
To address these limitations, this work proposes a novel unsupervised approach to detect and cluster malicious TLS flows. Our approach takes as input network traces from sandboxes. It clusters similar TLS flows using 90 features that capture properties of the TLS client, TLS server, certificate, and encrypted payload; and uses the clusters to build an unsupervised detector that can assign a malicious flow to the cluster it belongs to, or determine it is benign. We evaluate our approach using 972K traces from a commercial sandbox and 35M TLS flows from a research network. Our clustering shows very high precision and recall with an F1 score of 0.993. We compare our unsupervised detector with two state-of-the-art approaches, showing that it outperforms both. The false detection rate of our detector is 0.032% measured over four months of traffic.
△ Less
Submitted 23 December, 2022; v1 submitted 8 September, 2021;
originally announced September 2021.
-
Benchmark Instances and Optimal Solutions for the Traveling Salesman Problem with Drone
Authors:
Mauro Dell'Amico,
Roberto Montemanni,
Stefano Novellani
Abstract:
The use of drones in logistics is gaining more and more interest, and drones are becoming a more viable and common way of distributing parcels in an urban environment. As a consequence, there is a flourishing production of articles in the field of operational optimization of the combined use of trucks and drones for fulfilling customers requests. The aim is minimizing the total time required to se…
▽ More
The use of drones in logistics is gaining more and more interest, and drones are becoming a more viable and common way of distributing parcels in an urban environment. As a consequence, there is a flourishing production of articles in the field of operational optimization of the combined use of trucks and drones for fulfilling customers requests. The aim is minimizing the total time required to service all the customers, since this has obvious economical impacts. However in the literature there is not yet a widely recognized basic model, and there are not well assessed sets of instances and optimal solutions that can be considered as a benchmark to prove the effectiveness of new solution methods. The aim of this paper is to fill this gap. On one side we will clearly describe some of the most common components of the truck/drone routing problems and we will define nine basic problem settings, by combining these components. On the other side we will consider some of the instances used by many researchers and we will provide optimal solutions for all the problem settings previously identified. Instances and detailed solutions are then organized into benchmarks made publicly available as validation tools for future research methods.
△ Less
Submitted 28 July, 2021;
originally announced July 2021.
-
FISHDBC: Flexible, Incremental, Scalable, Hierarchical Density-Based Clustering for Arbitrary Data and Distance
Authors:
Matteo Dell'Amico
Abstract:
FISHDBC is a flexible, incremental, scalable, and hierarchical density-based clustering algorithm. It is flexible because it empowers users to work on arbitrary data, skip** the feature extraction step that usually transforms raw data in numeric arrays letting users define an arbitrary distance function instead. It is incremental and scalable: it avoids the $\mathcal O(n^2)$ performance of other…
▽ More
FISHDBC is a flexible, incremental, scalable, and hierarchical density-based clustering algorithm. It is flexible because it empowers users to work on arbitrary data, skip** the feature extraction step that usually transforms raw data in numeric arrays letting users define an arbitrary distance function instead. It is incremental and scalable: it avoids the $\mathcal O(n^2)$ performance of other approaches in non-metric spaces and requires only lightweight computation to update the clustering when few items are added. It is hierarchical: it produces a "flat" clustering which can be expanded to a tree structure, so that users can group and/or divide clusters in sub- or super-clusters when data exploration requires so. It is density-based and approximates HDBSCAN*, an evolution of DBSCAN.
△ Less
Submitted 16 October, 2019;
originally announced October 2019.
-
Scheduling With Inexact Job Sizes: The Merits of Shortest Processing Time First
Authors:
Matteo Dell'Amico
Abstract:
It is well known that size-based scheduling policies, which take into account job size (i.e., the time it takes to run them), can perform very desirably in terms of both response time and fairness. Unfortunately, the requirement of knowing a priori the exact job size is a major obstacle which is frequently insurmountable in practice. Often, it is possible to get a coarse estimation of job size, bu…
▽ More
It is well known that size-based scheduling policies, which take into account job size (i.e., the time it takes to run them), can perform very desirably in terms of both response time and fairness. Unfortunately, the requirement of knowing a priori the exact job size is a major obstacle which is frequently insurmountable in practice. Often, it is possible to get a coarse estimation of job size, but unfortunately analytical results with inexact job sizes are challenging to obtain, and simulation-based studies show that several size-based algorithm are severely impacted by job estimation errors. For example, Shortest Remaining Processing Time (SRPT), which yields optimal mean sojourn time when job sizes are known exactly, can drastically underperform when it is fed inexact job sizes.
Some algorithms have been proposed to better handle size estimation errors, but they are somewhat complex and this makes their analysis challenging. We consider Shortest Processing Time (SPT), a simplification of SRPT that skips the update of "remaining" job size and results in a preemptive algorithm that simply schedules the job with the shortest estimated processing time. When job size is inexact, SPT performs comparably to the best known algorithms in the presence of errors, while being definitely simpler. In this work, SPT is evaluated through simulation, showing near-optimal performance in many cases, with the hope that its simplicity can open the way to analytical evaluation even when inexact inputs are considered.
△ Less
Submitted 10 July, 2019;
originally announced July 2019.
-
The Supermarket Model with Known and Predicted Service Times
Authors:
Michael Mitzenmacher,
Matteo Dell'Amico
Abstract:
The supermarket model refers to a system with a large number of queues, where new customers choose d queues at random and join the one with the fewest customers. This model demonstrates the power of even small amounts of choice, as compared to simply joining a queue chosen uniformly at random, for load balancing systems. In this work we perform simulation-based studies to consider variations where…
▽ More
The supermarket model refers to a system with a large number of queues, where new customers choose d queues at random and join the one with the fewest customers. This model demonstrates the power of even small amounts of choice, as compared to simply joining a queue chosen uniformly at random, for load balancing systems. In this work we perform simulation-based studies to consider variations where service times for a customer are predicted, as might be done in modern settings using machine learning techniques or related mechanisms. Our primary takeaway is that using even seemingly weak predictions of service times can yield significant benefits over blind First In First Out queueing in this context. However, some care must be taken when using predicted service time information to both choose a queue and order elements for service within a queue; while in many cases using the information for both choosing and ordering is beneficial, in many of our simulation settings we find that simply using the number of jobs to choose a queue is better when using predicted service times to order jobs in a queue. In our simulations, we evaluate both synthetic and real-world workloads--in the latter, service times are predicted by machine learning. Our results provide practical guidance for the design of real-world systems; moreover, we leave many natural theoretical open questions for future work, validating their relevance to real-world situations.
△ Less
Submitted 17 February, 2022; v1 submitted 23 May, 2019;
originally announced May 2019.
-
Enhanced arc-flow formulations to minimize weighted completion time on identical parallel machines
Authors:
Arthur Kramer,
Mauro Dell'Amico,
Manuel Iori
Abstract:
We consider the problem of scheduling a set of jobs on a set of identical parallel machines, with the aim of minimizing the total weighted completion time. The problem has been solved in the literature with a number of mathematical formulations, some of which require the implementation of tailored branch-and-price methods. In our work, we solve the problem instead by means of new arc-flow formulat…
▽ More
We consider the problem of scheduling a set of jobs on a set of identical parallel machines, with the aim of minimizing the total weighted completion time. The problem has been solved in the literature with a number of mathematical formulations, some of which require the implementation of tailored branch-and-price methods. In our work, we solve the problem instead by means of new arc-flow formulations, by first representing it on a capacitated network and then invoking a mixed integer linear model with a pseudo-polynomial number of variables and constraints. According to our computational tests, existing formulations from the literature can solve to proven optimality benchmark instances with up to 100 jobs, whereas our most performing arc-flow formulation solves all instances with up to 400 jobs and provides very low gap for larger instances with up to 1000 jobs.
△ Less
Submitted 31 August, 2018;
originally announced August 2018.
-
On Fair Size-Based Scheduling
Authors:
Matteo Dell'Amico,
Damiano Carra,
Pietro Michiardi
Abstract:
By executing jobs serially rather than in parallel, size-based scheduling policies can shorten time needed to complete jobs; however, major obstacles to their applicability are fairness guarantees and the fact that job sizes are rarely known exactly a-priori. Here, we introduce the Pri family of size-based scheduling policies; Pri simulates any reference scheduler and executes jobs in the order of…
▽ More
By executing jobs serially rather than in parallel, size-based scheduling policies can shorten time needed to complete jobs; however, major obstacles to their applicability are fairness guarantees and the fact that job sizes are rarely known exactly a-priori. Here, we introduce the Pri family of size-based scheduling policies; Pri simulates any reference scheduler and executes jobs in the order of their simulated completion: we show that these schedulers give strong fairness guarantees, since no job completes later in Pri than in the reference policy. In addition, we introduce PSBS, a practical implementation of such a scheduler: it works online (i.e., without needing knowledge of jobs submitted in the future), it has an efficient O(log n) implementation and it allows setting priorities to jobs. Most importantly, unlike earlier size-based policies, the performance of PSBS degrades gracefully with errors, leading to performances that are close to optimal in a variety of realistic use cases.
△ Less
Submitted 30 June, 2015;
originally announced June 2015.
-
PSBS: Practical Size-Based Scheduling
Authors:
Matteo Dell'Amico,
Damiano Carra,
Pietro Michiardi
Abstract:
Size-based schedulers have very desirable performance properties: optimal or near-optimal response time can be coupled with strong fairness guarantees. Despite this, such systems are very rarely implemented in practical settings, because they require knowing a priori the amount of work needed to complete jobs: this assumption is very difficult to satisfy in concrete systems. It is definitely more…
▽ More
Size-based schedulers have very desirable performance properties: optimal or near-optimal response time can be coupled with strong fairness guarantees. Despite this, such systems are very rarely implemented in practical settings, because they require knowing a priori the amount of work needed to complete jobs: this assumption is very difficult to satisfy in concrete systems. It is definitely more likely to inform the system with an estimate of the job sizes, but existing studies point to somewhat pessimistic results if existing scheduler policies are used based on imprecise job size estimations. We take the goal of designing scheduling policies that are explicitly designed to deal with inexact job sizes: first, we show that existing size-based schedulers can have bad performance with inexact job size information when job sizes are heavily skewed; we show that this issue, and the pessimistic results shown in the literature, are due to problematic behavior when large jobs are underestimated. Once the problem is identified, it is possible to amend existing size-based schedulers to solve the issue. We generalize FSP -- a fair and efficient size-based scheduling policy -- in order to solve the problem highlighted above; in addition, our solution deals with different job weights (that can be assigned to a job independently from its size). We provide an efficient implementation of the resulting protocol, which we call Practical Size-Based Scheduler (PSBS). Through simulations evaluated on synthetic and real workloads, we show that PSBS has near-optimal performance in a large variety of cases with inaccurate size information, that it performs fairly and it handles correctly job weights. We believe that this work shows that PSBS is indeed pratical, and we maintain that it could inspire the design of schedulers in a wide array of real-world use cases.
△ Less
Submitted 6 August, 2015; v1 submitted 22 October, 2014;
originally announced October 2014.
-
On User Availability Prediction and Network Applications
Authors:
Matteo Dell'Amico,
Maurizio Filippone,
Pietro Michiardi,
Yves Roudier
Abstract:
User connectivity patterns in network applications are known to be heterogeneous, and to follow periodic (daily and weekly) patterns. In many cases, the regularity and the correlation of those patterns is problematic: for network applications, many connected users create peaks of demand; in contrast, in peer-to-peer scenarios, having few users online results in a scarcity of available resources. O…
▽ More
User connectivity patterns in network applications are known to be heterogeneous, and to follow periodic (daily and weekly) patterns. In many cases, the regularity and the correlation of those patterns is problematic: for network applications, many connected users create peaks of demand; in contrast, in peer-to-peer scenarios, having few users online results in a scarcity of available resources. On the other hand, since connectivity patterns exhibit a periodic behavior, they are to some extent predictable. This work shows how this can be exploited to anticipate future user connectivity and to have applications proactively responding to it. We evaluate the probability that any given user will be online at any given time, and assess the prediction on six-month availability traces from three different Internet applications. Building upon this, we show how our probabilistic approach makes it easy to evaluate and optimize the performance in a number of diverse network application models, and to use them to optimize systems. In particular, we show how this approach can be used in distributed hash tables, friend-to-friend storage, and cache pre-loading for social networks, resulting in substantial gains in data availability and system efficiency at negligible costs.
△ Less
Submitted 30 April, 2014;
originally announced April 2014.
-
Revisiting Size-Based Scheduling with Estimated Job Sizes
Authors:
Matteo Dell'Amico,
Damiano Carra,
Mario Pastorelli,
Pietro Michiardi
Abstract:
We study size-based schedulers, and focus on the impact of inaccurate job size information on response time and fairness. Our intent is to revisit previous results, which allude to performance degradation for even small errors on job size estimates, thus limiting the applicability of size-based schedulers.
We show that scheduling performance is tightly connected to workload characteristics: in t…
▽ More
We study size-based schedulers, and focus on the impact of inaccurate job size information on response time and fairness. Our intent is to revisit previous results, which allude to performance degradation for even small errors on job size estimates, thus limiting the applicability of size-based schedulers.
We show that scheduling performance is tightly connected to workload characteristics: in the absence of large skew in the job size distribution, even extremely imprecise estimates suffice to outperform size-oblivious disciplines. Instead, when job sizes are heavily skewed, known size-based disciplines suffer.
In this context, we show -- for the first time -- the dichotomy of over-estimation versus under-estimation. The former is, in general, less problematic than the latter, as its effects are localized to individual jobs. Instead, under-estimation leads to severe problems that may affect a large number of jobs.
We present an approach to mitigate these problems: our technique requires no complex modifications to original scheduling policies and performs very well. To support our claim, we proceed with a simulation-based evaluation that covers an unprecedented large parameter space, which takes into account a variety of synthetic and real workloads.
As a consequence, we show that size-based scheduling is practical and outperforms alternatives in a wide array of use-cases, even in presence of inaccurate size information.
△ Less
Submitted 25 July, 2014; v1 submitted 24 March, 2014;
originally announced March 2014.
-
OS-Assisted Task Preemption for Hadoop
Authors:
Mario Pastorelli,
Matteo Dell'Amico,
Pietro Michiardi
Abstract:
This work introduces a new task preemption primitive for Hadoop, that allows tasks to be suspended and resumed exploiting existing memory management mechanisms readily available in modern operating systems. Our technique fills the gap that exists between the two extremes cases of killing tasks (which waste work) or waiting for their completion (which introduces latency): experimental results indic…
▽ More
This work introduces a new task preemption primitive for Hadoop, that allows tasks to be suspended and resumed exploiting existing memory management mechanisms readily available in modern operating systems. Our technique fills the gap that exists between the two extremes cases of killing tasks (which waste work) or waiting for their completion (which introduces latency): experimental results indicate superior performance and very small overheads when compared to existing alternatives.
△ Less
Submitted 10 February, 2014;
originally announced February 2014.
-
A Simulator for Data-Intensive Job Scheduling
Authors:
Matteo Dell'Amico
Abstract:
Despite the fact that size-based schedulers can give excellent results in terms of both average response times and fairness, data-intensive computing execution engines generally do not employ size-based schedulers, mainly because of the fact that job size is not known a priori.
In this work, we perform a simulation-based analysis of the performance of size-based schedulers when they are employed…
▽ More
Despite the fact that size-based schedulers can give excellent results in terms of both average response times and fairness, data-intensive computing execution engines generally do not employ size-based schedulers, mainly because of the fact that job size is not known a priori.
In this work, we perform a simulation-based analysis of the performance of size-based schedulers when they are employed with the workload of typical data-intensive schedules and with approximated size estimations. We show results that are very promising: even when size estimation is very imprecise, response times of size-based schedulers can be definitely smaller than those of simple scheduling techniques such as processor sharing or FIFO.
△ Less
Submitted 21 August, 2013; v1 submitted 25 June, 2013;
originally announced June 2013.
-
Practical Size-based Scheduling for MapReduce Workloads
Authors:
Mario Pastorelli,
Antonio Barbuzzi,
Damiano Carra,
Matteo Dell'Amico,
Pietro Michiardi
Abstract:
We present the Hadoop Fair Sojourn Protocol (HFSP) scheduler, which implements a size-based scheduling discipline for Hadoop. The benefits of size-based scheduling disciplines are well recognized in a variety of contexts (computer networks, operating systems, etc...), yet, their practical implementation for a system such as Hadoop raises a number of important challenges. With HFSP, which is availa…
▽ More
We present the Hadoop Fair Sojourn Protocol (HFSP) scheduler, which implements a size-based scheduling discipline for Hadoop. The benefits of size-based scheduling disciplines are well recognized in a variety of contexts (computer networks, operating systems, etc...), yet, their practical implementation for a system such as Hadoop raises a number of important challenges. With HFSP, which is available as an open-source project, we address issues related to job size estimation, resource management and study the effects of a variety of preemption strategies. Although the architecture underlying HFSP is suitable for any size-based scheduling discipline, in this work we revisit and extend the Fair Sojourn Protocol, which solves problems related to job starvation that affect FIFO, Processor Sharing and a range of size-based disciplines. Our experiments, in which we compare HFSP to standard Hadoop schedulers, pinpoint at a significant decrease in average job sojourn times - a metric that accounts for the total time a job spends in the system, including waiting and serving times - for realistic workloads that we generate according to production traces available in literature.
△ Less
Submitted 3 May, 2013; v1 submitted 12 February, 2013;
originally announced February 2013.
-
Adaptive Redundancy Management for Durable P2P Backup
Authors:
Matteo Dell'Amico,
Pietro Michiardi,
Laszlo Toka,
Pasquale Cataldi
Abstract:
We design and analyze the performance of a redundancy management mechanism for Peer-to-Peer backup applications. Armed with the realization that a backup system has peculiar requirements -- namely, data is read over the network only during restore processes caused by data loss -- redundancy management targets data durability rather than attempting to make each piece of information availabile at an…
▽ More
We design and analyze the performance of a redundancy management mechanism for Peer-to-Peer backup applications. Armed with the realization that a backup system has peculiar requirements -- namely, data is read over the network only during restore processes caused by data loss -- redundancy management targets data durability rather than attempting to make each piece of information availabile at any time.
In our approach each peer determines, in an on-line manner, an amount of redundancy sufficient to counter the effects of peer deaths, while preserving acceptable data restore times. Our experiments, based on trace-driven simulations, indicate that our mechanism can reduce the redundancy by a factor between two and three with respect to redundancy policies aiming for data availability. These results imply an according increase in storage capacity and decrease in time to complete backups, at the expense of longer times required to restore data. We believe this is a very reasonable price to pay, given the nature of the application.
We complete our work with a discussion on practical issues, and their solutions, related to which encoding technique is more suitable to support our scheme.
△ Less
Submitted 17 January, 2014; v1 submitted 11 January, 2012;
originally announced January 2012.
-
Back To The Future: On Predicting User Uptime
Authors:
Matteo Dell'Amico,
Pietro Michiardi,
Yves Roudier
Abstract:
Correlation in user connectivity patterns is generally considered a problem for system designers, since it results in peaks of demand and also in the scarcity of resources for peer-to-peer applications. The other side of the coin is that these connectivity patterns are often predictable and that, to some extent, they can be dealt with proactively.
In this work, we build predictors aiming to dete…
▽ More
Correlation in user connectivity patterns is generally considered a problem for system designers, since it results in peaks of demand and also in the scarcity of resources for peer-to-peer applications. The other side of the coin is that these connectivity patterns are often predictable and that, to some extent, they can be dealt with proactively.
In this work, we build predictors aiming to determine the probability that any given user will be online at any given time in the future. We evaluate the quality of these predictors on various large traces from instant messaging and file sharing applications.
We also illustrate how availability prediction can be applied to enhance the behavior of peer-to-peer applications: we show through simulation how data availability is substantially increased in a distributed hash table simply by adjusting data placement policies according to peer availability prediction and without requiring any additional storage from any peer.
△ Less
Submitted 4 October, 2010;
originally announced October 2010.
-
On Scheduling and Redundancy for P2P Backup
Authors:
Laszlo Toka,
Matteo Dell'Amico,
Pietro Michiardi
Abstract:
An online backup system should be quick and reliable in both saving and restoring users' data. To do so in a peer-to-peer implementation, data transfer scheduling and the amount of redundancy must be chosen wisely. We formalize the problem of exchanging multiple pieces of data with intermittently available peers, and we show that random scheduling completes transfers nearly optimally in terms of…
▽ More
An online backup system should be quick and reliable in both saving and restoring users' data. To do so in a peer-to-peer implementation, data transfer scheduling and the amount of redundancy must be chosen wisely. We formalize the problem of exchanging multiple pieces of data with intermittently available peers, and we show that random scheduling completes transfers nearly optimally in terms of duration as long as the system is sufficiently large. Moreover, we propose an adaptive redundancy scheme that improves performance and decreases resource usage while kee** the risks of data loss low. Extensive simulations show that our techniques are effective in a realistic trace-driven scenario with heterogeneous bandwidth.
△ Less
Submitted 7 September, 2010;
originally announced September 2010.
-
Measuring Password Strength: An Empirical Analysis
Authors:
Matteo Dell'Amico,
Pietro Michiardi,
Yves Roudier
Abstract:
We present an in-depth analysis on the strength of the almost 10,000 passwords from users of an instant messaging server in Italy. We estimate the strength of those passwords, and compare the effectiveness of state-of-the-art attack methods such as dictionaries and Markov chain-based techniques.
We show that the strength of passwords chosen by users varies enormously, and that the cost of atta…
▽ More
We present an in-depth analysis on the strength of the almost 10,000 passwords from users of an instant messaging server in Italy. We estimate the strength of those passwords, and compare the effectiveness of state-of-the-art attack methods such as dictionaries and Markov chain-based techniques.
We show that the strength of passwords chosen by users varies enormously, and that the cost of attacks based on password strength grows very quickly when the attacker wants to obtain a higher success percentage. In accordance with existing studies we observe that, in the absence of measures for enforcing password strength, weak passwords are common. On the other hand we discover that there will always be a subset of users with extremely strong passwords that are very unlikely to be broken.
The results of our study will help in evaluating the security of password-based authentication means, and they provide important insights for inspiring new and better proactive password checkers and password recovery tools.
△ Less
Submitted 20 July, 2009;
originally announced July 2009.