-
Warm-starting Push-Relabel
Authors:
Sami Davies,
Sergei Vassilvitskii,
Yuyan Wang
Abstract:
Push-Relabel is one of the most celebrated network flow algorithms. Maintaining a pre-flow that saturates a cut, it enjoys better theoretical and empirical running time than other flow algorithms, such as Ford-Fulkerson. In practice, Push-Relabel is even faster than what theoretical guarantees can promise, in part because of the use of good heuristics for seeding and updating the iterative algorit…
▽ More
Push-Relabel is one of the most celebrated network flow algorithms. Maintaining a pre-flow that saturates a cut, it enjoys better theoretical and empirical running time than other flow algorithms, such as Ford-Fulkerson. In practice, Push-Relabel is even faster than what theoretical guarantees can promise, in part because of the use of good heuristics for seeding and updating the iterative algorithm. However, it remains unclear how to run Push-Relabel on an arbitrary initialization that is not necessarily a pre-flow or cut-saturating. We provide the first theoretical guarantees for warm-starting Push-Relabel with a predicted flow, where our learning-augmented version benefits from fast running time when the predicted flow is close to an optimal flow, while maintaining robust worst-case guarantees. Interestingly, our algorithm uses the gap relabeling heuristic, which has long been employed in practice, even though prior to our work there was no rigorous theoretical justification for why it can lead to run-time improvements. We then provide experiments that show our warm-started Push-Relabel also works well in practice.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Online Flexible Busy Time Scheduling on Heterogeneous Machines
Authors:
Gruia Calinescu,
Sami Davies,
Samir Khuller,
Shirley Zhang
Abstract:
We study the online busy time scheduling model on heterogeneous machines. In our setting, unit-length jobs arrive online with a deadline that is known to the algorithm at the job's arrival time. An algorithm has access to machines, each with different associated capacities and costs. The goal is to schedule jobs on machines before their deadline, so that the total cost incurred by the scheduling a…
▽ More
We study the online busy time scheduling model on heterogeneous machines. In our setting, unit-length jobs arrive online with a deadline that is known to the algorithm at the job's arrival time. An algorithm has access to machines, each with different associated capacities and costs. The goal is to schedule jobs on machines before their deadline, so that the total cost incurred by the scheduling algorithm is minimized. Relatively little is known about online busy time scheduling when machines are heterogeneous (i.e., have different costs and capacities), despite this being the most practical model for clients using cloud computing services. We make significant progress in understanding this model by designing an 8-competitive algorithm for the problem on unit-length jobs, and providing a lower bound on the competitive ratio of 2. We further prove that our lower bound is tight in the natural setting when jobs have agreeable deadlines.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
Supervised learning of spatial features with STDP and homeostasis using Spiking Neural Networks on SpiNNaker
Authors:
Sergio Davies,
Andrew Gait,
Andrew Rowley,
Alessandro Di Nuovo
Abstract:
Artificial Neural Networks (ANN) have gained significant popularity thanks to their ability to learn using the well-known backpropagation algorithm. Conversely, Spiking Neural Networks (SNNs), despite having broader capabilities than ANNs, have always posed challenges in the training phase. This paper shows a new method to perform supervised learning on SNNs, using Spike Timing Dependent Plasticit…
▽ More
Artificial Neural Networks (ANN) have gained significant popularity thanks to their ability to learn using the well-known backpropagation algorithm. Conversely, Spiking Neural Networks (SNNs), despite having broader capabilities than ANNs, have always posed challenges in the training phase. This paper shows a new method to perform supervised learning on SNNs, using Spike Timing Dependent Plasticity (STDP) and homeostasis, aiming at training the network to identify spatial patterns. Spatial patterns refer to spike patterns without a time component, where all spike events occur simultaneously. The method is tested using the SpiNNaker digital architecture. A SNN is trained to recognise one or multiple patterns and performance metrics are extracted to measure the performance of the network. Some considerations are drawn from the results showing that, in the case of a single trained pattern, the network behaves as the ideal detector, with 100% accuracy in detecting the trained pattern. However, as the number of trained patterns on a single network increases, the accuracy of identification is linked to the similarities between these patterns. This method of training an SNN to detect spatial patterns may be applied to pattern recognition in static images or traffic analysis in computer networks, where each network packet represents a spatial pattern. It will be stipulated that the homeostatic factor may enable the network to detect patterns with some degree of similarity, rather than only perfectly matching patterns.The principles outlined in this article serve as the fundamental building blocks for more complex systems that utilise both spatial and temporal patterns by converting specific features of input signals into spikes.One example of such a system is a computer network packet classifier, tasked with real-time identification of packet streams based on features within the packet content
△ Less
Submitted 24 June, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
Data journeys in popular science: Producing climate change and COVID-19 data visualizations at Scientific American
Authors:
Kathleen Gregory,
Laura Koesten,
Regina Schuster,
Torsten Möller,
Sarah Davies
Abstract:
Vast amounts of (open) data are increasingly used to make arguments about crisis topics such as climate change and global pandemics. Data visualizations are central to bringing these viewpoints to broader publics. However, visualizations often conceal the many contexts involved in their production, ranging from decisions made in research labs about collecting and sharing data to choices made in ed…
▽ More
Vast amounts of (open) data are increasingly used to make arguments about crisis topics such as climate change and global pandemics. Data visualizations are central to bringing these viewpoints to broader publics. However, visualizations often conceal the many contexts involved in their production, ranging from decisions made in research labs about collecting and sharing data to choices made in editorial rooms about which data stories to tell. In this paper, we examine how data visualizations about climate change and COVID-19 are produced in popular science magazines, using Scientific American, an established English-language popular science magazine, as a case study. To do this, we apply the analytical concept of data journeys (Leonelli, 2020) in a mixed methods study that centers on interviews with Scientific American staff and is supplemented by a visualization analysis of selected charts. In particular, we discuss the affordances of working with open data, the role of collaborative data practices, and how the magazine works to counter misinformation and increase transparency. This work provides an empirical contribution by providing insight into the data (visualization) practices of science communicators and demonstrating how the concept of data journeys can be used as an analytical framework.
△ Less
Submitted 27 March, 2024; v1 submitted 27 October, 2023;
originally announced October 2023.
-
Simultaneously Approximating All $\ell_p$-norms in Correlation Clustering
Authors:
Sami Davies,
Benjamin Moseley,
Heather Newman
Abstract:
This paper considers correlation clustering on unweighted complete graphs. We give a combinatorial algorithm that returns a single clustering solution that is simultaneously $O(1)$-approximate for all $\ell_p$-norms of the disagreement vector; in other words, a combinatorial $O(1)$-approximation of the all-norms objective for correlation clustering. This is the first proof that minimal sacrifice i…
▽ More
This paper considers correlation clustering on unweighted complete graphs. We give a combinatorial algorithm that returns a single clustering solution that is simultaneously $O(1)$-approximate for all $\ell_p$-norms of the disagreement vector; in other words, a combinatorial $O(1)$-approximation of the all-norms objective for correlation clustering. This is the first proof that minimal sacrifice is needed in order to optimize different norms of the disagreement vector. In addition, our algorithm is the first combinatorial approximation algorithm for the $\ell_2$-norm objective, and more generally the first combinatorial algorithm for the $\ell_p$-norm objective when $1 < p < \infty$. It is also faster than all previous algorithms that minimize the $\ell_p$-norm of the disagreement vector, with run-time $O(n^ω)$, where $O(n^ω)$ is the time for matrix multiplication on $n \times n$ matrices. When the maximum positive degree in the graph is at most $Δ$, this can be improved to a run-time of $O(nΔ^2 \log n)$.
△ Less
Submitted 9 March, 2024; v1 submitted 3 August, 2023;
originally announced August 2023.
-
Majority Voting Approach to Ransomware Detection
Authors:
Simon R Davies,
Richard Macfarlane,
William J Buchanan
Abstract:
Crypto-ransomware remains a significant threat to governments and companies alike, with high-profile cyber security incidents regularly making headlines. Many different detection systems have been proposed as solutions to the ever-changing dynamic landscape of ransomware detection. In the majority of cases, these described systems propose a method based on the result of a single test performed on…
▽ More
Crypto-ransomware remains a significant threat to governments and companies alike, with high-profile cyber security incidents regularly making headlines. Many different detection systems have been proposed as solutions to the ever-changing dynamic landscape of ransomware detection. In the majority of cases, these described systems propose a method based on the result of a single test performed on either the executable code, the process under investigation, its behaviour, or its output. In a small subset of ransomware detection systems, the concept of a scorecard is employed where multiple tests are performed on various aspects of a process under investigation and their results are then analysed using machine learning. The purpose of this paper is to propose a new majority voting approach to ransomware detection by develo** a method that uses a cumulative score derived from discrete tests based on calculations using algorithmic rather than heuristic techniques. The paper describes 23 candidate tests, as well as 9 Windows API tests which are validated to determine both their accuracy and viability for use within a ransomware detection system. Using a cumulative score calculation approach to ransomware detection has several benefits, such as the immunity to the occasional inaccuracy of individual tests when making its final classification. The system can also leverage multiple tests that can be both comprehensive and complimentary in an attempt to achieve a broader, deeper, and more robust analysis of the program under investigation. Additionally, the use of multiple collaborative tests also significantly hinders ransomware from masking or modifying its behaviour in an attempt to bypass detection.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
What is the message? Perspectives on Visual Data Communication
Authors:
Laura Koesten,
Kathleen Gregory,
Regina Schuster,
Christian Knoll,
Sarah Davies,
Torsten Möller
Abstract:
Data visualizations are used to communicate messages to diverse audiences. It is unclear whether interpretations of these visualizations match the messages their creators aim to convey. In a mixed-methods study, we investigate how data in the popular science magazine Scientific American are visually communicated and understood. We first analyze visualizations about climate change and pandemics pub…
▽ More
Data visualizations are used to communicate messages to diverse audiences. It is unclear whether interpretations of these visualizations match the messages their creators aim to convey. In a mixed-methods study, we investigate how data in the popular science magazine Scientific American are visually communicated and understood. We first analyze visualizations about climate change and pandemics published in the magazine over a fifty-year period. Acting as chart readers, we then interpret visualizations with and without textual elements, identifying takeaway messages and creating field notes. Finally, we compare a sample of our interpreted messages to the intended messages of chart producers, drawing on interviews conducted with magazine staff. These data allow us to explore understanding visualizations through three perspectives: that of the charts, visualization readers, and visualization producers. Building on our findings from a thematic analysis, we present in-depth insights into data visualization sensemaking, particularly regarding the role of messages and textual elements; we propose a message typology, and we consider more broadly how messages can be conceptualized and understood.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
Predictive Flows for Faster Ford-Fulkerson
Authors:
Sami Davies,
Benjamin Moseley,
Sergei Vassilvitskii,
Yuyan Wang
Abstract:
Recent work has shown that leveraging learned predictions can improve the running time of algorithms for bipartite matching and similar combinatorial problems. In this work, we build on this idea to improve the performance of the widely used Ford-Fulkerson algorithm for computing maximum flows by seeding Ford-Fulkerson with predicted flows. Our proposed method offers strong theoretical performance…
▽ More
Recent work has shown that leveraging learned predictions can improve the running time of algorithms for bipartite matching and similar combinatorial problems. In this work, we build on this idea to improve the performance of the widely used Ford-Fulkerson algorithm for computing maximum flows by seeding Ford-Fulkerson with predicted flows. Our proposed method offers strong theoretical performance in terms of the quality of the prediction. We then consider image segmentation, a common use-case of flows in computer vision, and complement our theoretical analysis with strong empirical results.
△ Less
Submitted 1 March, 2023;
originally announced March 2023.
-
Fast Combinatorial Algorithms for Min Max Correlation Clustering
Authors:
Sami Davies,
Benjamin Moseley,
Heather Newman
Abstract:
We introduce fast algorithms for correlation clustering with respect to the Min Max objective that provide constant factor approximations on complete graphs. Our algorithms are the first purely combinatorial approximation algorithms for this problem. We construct a novel semi-metric on the set of vertices, which we call the correlation metric, that indicates to our clustering algorithms whether pa…
▽ More
We introduce fast algorithms for correlation clustering with respect to the Min Max objective that provide constant factor approximations on complete graphs. Our algorithms are the first purely combinatorial approximation algorithms for this problem. We construct a novel semi-metric on the set of vertices, which we call the correlation metric, that indicates to our clustering algorithms whether pairs of nodes should be in the same cluster. The paper demonstrates empirically that, compared to prior work, our algorithms sacrifice little in the objective quality to obtain significantly better run-time. Moreover, our algorithms scale to larger networks that are effectively intractable for known algorithms.
△ Less
Submitted 30 January, 2023;
originally announced January 2023.
-
Comparison of Entropy Calculation Methods for Ransomware Encrypted File Identification
Authors:
Simon R Davies,
Richard Macfarlane,
William J. Buchanan
Abstract:
Ransomware is a malicious class of software that utilises encryption to implement an attack on system availability. The target's data remains encrypted and is held captive by the attacker until a ransom demand is met. A common approach used by many crypto-ransomware detection techniques is to monitor file system activity and attempt to identify encrypted files being written to disk, often using a…
▽ More
Ransomware is a malicious class of software that utilises encryption to implement an attack on system availability. The target's data remains encrypted and is held captive by the attacker until a ransom demand is met. A common approach used by many crypto-ransomware detection techniques is to monitor file system activity and attempt to identify encrypted files being written to disk, often using a file's entropy as an indicator of encryption. However, often in the description of these techniques, little or no discussion is made as to why a particular entropy calculation technique is selected or any justification given as to why one technique is selected over the alternatives. The Shannon method of entropy calculation is the most commonly-used technique when it comes to file encryption identification in crypto-ransomware detection techniques. Overall, correctly encrypted data should be indistinguishable from random data, so apart from the standard mathematical entropy calculations such as Chi-Square, Shannon Entropy and Serial Correlation, the test suites used to validate the output from pseudo-random number generators would also be suited to perform this analysis. he hypothesis being that there is a fundamental difference between different entropy methods and that the best methods may be used to better detect ransomware encrypted files. The paper compares the accuracy of 53 distinct tests in being able to differentiate between encrypted data and other file types. The testing is broken down into two phases, the first phase is used to identify potential candidate tests, and a second phase where these candidates are thoroughly evaluated. To ensure that the tests were sufficiently robust, the NapierOne dataset is used. This dataset contains thousands of examples of the most commonly used file types, as well as examples of files that have been encrypted by crypto-ransomware.
△ Less
Submitted 24 October, 2022;
originally announced October 2022.
-
Robust Factorizations and Colorings of Tensor Graphs
Authors:
Joshua Brakensiek,
Sami Davies
Abstract:
Since the seminal result of Karger, Motwani, and Sudan, algorithms for approximate 3-coloring have primarily centered around SDP-based rounding. However, it is likely that important combinatorial or algebraic insights are needed in order to break the $n^{o(1)}$ threshold. One way to develop new understanding in graph coloring is to study special subclasses of graphs. For instance, Blum studied the…
▽ More
Since the seminal result of Karger, Motwani, and Sudan, algorithms for approximate 3-coloring have primarily centered around SDP-based rounding. However, it is likely that important combinatorial or algebraic insights are needed in order to break the $n^{o(1)}$ threshold. One way to develop new understanding in graph coloring is to study special subclasses of graphs. For instance, Blum studied the 3-coloring of random graphs, and Arora and Ge studied the 3-coloring of graphs with low threshold-rank.
In this work, we study graphs which arise from a tensor product, which appear to be novel instances of the 3-coloring problem. We consider graphs of the form $H = (V,E)$ with $V =V( K_3 \times G)$ and $E = E(K_3 \times G) \setminus E'$, where $E' \subseteq E(K_3 \times G)$ is any edge set such that no vertex has more than an $ε$ fraction of its edges in $E'$. We show that one can construct $\widetilde{H} = K_3 \times \widetilde{G}$ with $V(\widetilde{H}) = V(H)$ that is close to $H$. For arbitrary $G$, $\widetilde{H}$ satisfies $|E(H) ΔE(\widetilde{H})| \leq O(ε|E(H)|)$. Additionally when $G$ is a mild expander, we provide a 3-coloring for $H$ in polynomial time. These results partially generalize an exact tensor factorization algorithm of Imrich. On the other hand, without any assumptions on $G$, we show that it is NP-hard to 3-color $H$.
△ Less
Submitted 27 November, 2023; v1 submitted 18 July, 2022;
originally announced July 2022.
-
Balancing Flow Time and Energy Consumption
Authors:
Sami Davies,
Samir Khuller,
Shirley Zhang
Abstract:
In this paper, we study the following batch scheduling model: find a schedule that minimizes total flow time for $n$ uniform length jobs, with release times and deadlines, where the machine is only actively processing jobs in at most $k$ synchronized batches of size at most $B$. Prior work on such batch scheduling models has considered only feasibility with no regard to the flow time of the schedu…
▽ More
In this paper, we study the following batch scheduling model: find a schedule that minimizes total flow time for $n$ uniform length jobs, with release times and deadlines, where the machine is only actively processing jobs in at most $k$ synchronized batches of size at most $B$. Prior work on such batch scheduling models has considered only feasibility with no regard to the flow time of the schedule. However, algorithms that minimize the cost from the scheduler's perspective -- such as ones that minimize the active time of the processor -- can result in schedules where the total flow time is arbitrarily high \cite{ChangGabowKhuller}. Such schedules are not valuable from the perspective of the client. In response, our work provides dynamic programs which minimize flow time subject to active time constraints. Our main contribution focuses on jobs with agreeable deadlines; for such job instances, we introduce dynamic programs that achieve runtimes of O$(B \cdot k \cdot n)$ for unit jobs and O$(B \cdot k \cdot n^5)$ for uniform length jobs. These results improve upon our modification of a different, classical dynamic programming approach by Baptiste. While the modified DP works when deadlines are non-agreeable, this solution is more expensive, with runtime $O(B \cdot k^2 \cdot n^7)$ \cite{Baptiste00}.
△ Less
Submitted 2 June, 2022;
originally announced June 2022.
-
NapierOne: A modern mixed file data set alternative to Govdocs1
Authors:
Simon R Davies,
Richard Macfarlane,
William J Buchanan
Abstract:
It was found when reviewing the ransomware detection research literature that almost no proposal provided enough detail on how the test data set was created, or sufficient description of its actual content, to allow it to be recreated by other researchers interested in reconstructing their environment and validating the research results. A modern cybersecurity mixed file data set called NapierOne…
▽ More
It was found when reviewing the ransomware detection research literature that almost no proposal provided enough detail on how the test data set was created, or sufficient description of its actual content, to allow it to be recreated by other researchers interested in reconstructing their environment and validating the research results. A modern cybersecurity mixed file data set called NapierOne is presented, primarily aimed at, but not limited to, ransomware detection and forensic analysis research. NapierOne was designed to address this deficiency in reproducibility and improve consistency by facilitating research replication and repeatability. The methodology used in the creation of this data set is also described in detail. The data set was inspired by the Govdocs1 data set and it is intended that NapierOne be used as a complement to this original data set.
An investigation was performed with the goal of determining the common files types currently in use. No specific research was found that explicitly provided this information, so an alternative consensus approach was employed. This involved combining the findings from multiple sources of file type usage into an overall ranked list. After which 5000 real-world example files were gathered, and a specific data subset created, for each of the common file types identified. In some circumstances, multiple data subsets were created for a specific file type, each subset representing a specific characteristic for that file type. For example, there are multiple data subsets for the ZIP file type with each subset containing examples of a specific compression method. Ransomware execution tends to produce files that have high entropy, so examples of file types that naturally have this attribute are also present.
△ Less
Submitted 20 January, 2022;
originally announced January 2022.
-
Lower Bounds on the Total Variation Distance Between Mixtures of Two Gaussians
Authors:
Sami Davies,
Arya Mazumdar,
Soumyabrata Pal,
Cyrus Rashtchian
Abstract:
Mixtures of high dimensional Gaussian distributions have been studied extensively in statistics and learning theory. While the total variation distance appears naturally in the sample complexity of distribution learning, it is analytically difficult to obtain tight lower bounds for mixtures. Exploiting a connection between total variation distance and the characteristic function of the mixture, we…
▽ More
Mixtures of high dimensional Gaussian distributions have been studied extensively in statistics and learning theory. While the total variation distance appears naturally in the sample complexity of distribution learning, it is analytically difficult to obtain tight lower bounds for mixtures. Exploiting a connection between total variation distance and the characteristic function of the mixture, we provide fairly tight functional approximations. This enables us to derive new lower bounds on the total variation distance between pairs of two-component Gaussian mixtures that have a shared covariance matrix.
△ Less
Submitted 9 March, 2022; v1 submitted 2 September, 2021;
originally announced September 2021.
-
Differential Area Analysis for Ransomware Attack Detection within Mixed File Datasets
Authors:
Simon R Davies,
Richard Macfarlane,
William J Buchanan
Abstract:
The threat from ransomware continues to grow both in the number of affected victims as well as the cost incurred by the people and organisations impacted in a successful attack. In the majority of cases, once a victim has been attacked there remain only two courses of action open to them; either pay the ransom or lose their data. One common behaviour shared between all crypto ransomware strains is…
▽ More
The threat from ransomware continues to grow both in the number of affected victims as well as the cost incurred by the people and organisations impacted in a successful attack. In the majority of cases, once a victim has been attacked there remain only two courses of action open to them; either pay the ransom or lose their data. One common behaviour shared between all crypto ransomware strains is that at some point during their execution they will attempt to encrypt the users' files. Previous research Penrose et al. (2013); Zhao et al. (2011) has highlighted the difficulty in differentiating between compressed and encrypted files using Shannon entropy as both file types exhibit similar values. One of the experiments described in this paper shows a unique characteristic for the Shannon entropy of encrypted file header fragments. This characteristic was used to differentiate between encrypted files and other high entropy files such as archives. This discovery was leveraged in the development of a file classification model that used the differential area between the entropy curve of a file under analysis and one generated from random data. When comparing the entropy plot values of a file under analysis against one generated by a file containing purely random numbers, the greater the correlation of the plots is, the higher the confidence that the file under analysis contains encrypted data.
△ Less
Submitted 28 June, 2021;
originally announced June 2021.
-
On the Hardness of Scheduling With Non-Uniform Communication Delays
Authors:
Sami Davies,
Janardhan Kulkarni,
Thomas Rothvoss,
Sai Sandeep,
Jakub Tarnawski,
Yihao Zhang
Abstract:
In the scheduling with non-uniform communication delay problem, the input is a set of jobs with precedence constraints. Associated with every precedence constraint between a pair of jobs is a communication delay, the time duration the scheduler has to wait between the two jobs if they are scheduled on different machines. The objective is to assign the jobs to machines to minimize the makespan of t…
▽ More
In the scheduling with non-uniform communication delay problem, the input is a set of jobs with precedence constraints. Associated with every precedence constraint between a pair of jobs is a communication delay, the time duration the scheduler has to wait between the two jobs if they are scheduled on different machines. The objective is to assign the jobs to machines to minimize the makespan of the schedule. Despite being a fundamental problem in theory and a consequential problem in practice, the approximability of scheduling problems with communication delays is not very well understood. One of the top ten open problems in scheduling theory, in the influential list by Schuurman and Woeginger and its latest update by Bansal, asks if the problem admits a constant factor approximation algorithm. In this paper, we answer the question in negative by proving that there is a logarithmic hardness for the problem under the standard complexity theory assumption that NP-complete problems do not admit quasi-polynomial time algorithms.
Our hardness result is obtained using a surprisingly simple reduction from a problem that we call Unique Machine Precedence constraints Scheduling (UMPS). We believe that this problem is of central importance in understanding the hardness of many scheduling problems and conjecture that it is very hard to approximate. Among other things, our conjecture implies a logarithmic hardness of related machine scheduling with precedences, a long-standing open problem in scheduling theory and approximation algorithms.
△ Less
Submitted 30 April, 2021;
originally announced May 2021.
-
Evaluation of Live Forensic Techniques in Ransomware Attack Mitigation
Authors:
Simon R. Davies,
Richard Macfarlane,
William J. Buchanan
Abstract:
Memory was captured from a system infected by ransomware and its contents was examined using live forensic tools, with the intent of identifying the symmetric encryption keys being used. NotPetya, Bad Rabbit and Phobos hybrid ransomware samples were tested during the investigation. If keys were discovered, the following two steps were also performed. Firstly, a timeline was manually created by com…
▽ More
Memory was captured from a system infected by ransomware and its contents was examined using live forensic tools, with the intent of identifying the symmetric encryption keys being used. NotPetya, Bad Rabbit and Phobos hybrid ransomware samples were tested during the investigation. If keys were discovered, the following two steps were also performed. Firstly, a timeline was manually created by combining data from multiple sources to illustrate the ransomware's behaviour as well as showing when the encryption keys were present in memory and how long they remained there. Secondly, an attempt was made to decrypt the files encrypted by the ransomware using the found keys. In all cases, the investigation was able to confirm that it was possible to identify the encryption keys used. A description of how these found keys were then used to successfully decrypt files that had been encrypted during the execution of the ransomware is also given. The resulting generated timelines provided a excellent way to visualise the behaviour of the ransomware and the encryption key management practices it employed, and from a forensic investigation and possible mitigation point of view, when the encryption keys are in memory.
△ Less
Submitted 19 December, 2020; v1 submitted 15 December, 2020;
originally announced December 2020.
-
Approximate Trace Reconstruction
Authors:
Sami Davies,
Miklos Z. Racz,
Cyrus Rashtchian,
Benjamin G. Schiffer
Abstract:
In the usual trace reconstruction problem, the goal is to exactly reconstruct an unknown string of length $n$ after it passes through a deletion channel many times independently, producing a set of traces (i.e., random subsequences of the string). We consider the relaxed problem of approximate reconstruction. Here, the goal is to output a string that is close to the original one in edit distance w…
▽ More
In the usual trace reconstruction problem, the goal is to exactly reconstruct an unknown string of length $n$ after it passes through a deletion channel many times independently, producing a set of traces (i.e., random subsequences of the string). We consider the relaxed problem of approximate reconstruction. Here, the goal is to output a string that is close to the original one in edit distance while using much fewer traces than is needed for exact reconstruction. We present several algorithms that can approximately reconstruct strings that belong to certain classes, where the estimate is within $n/\mathrm{polylog}(n)$ edit distance, and where we only use $\mathrm{polylog}(n)$ traces (or sometimes just a single trace). These classes contain strings that require a linear number of traces for exact reconstruction and which are quite different from a typical random string. From a technical point of view, our algorithms approximately reconstruct consecutive substrings of the unknown string by aligning dense regions of traces and using a run of a suitable length to approximate each region. To complement our algorithms, we present a general black-box lower bound for approximate reconstruction, building on a lower bound for distinguishing between two candidate input strings in the worst case. In particular, this shows that approximating to within $n^{1/3 - δ}$ edit distance requires $n^{1 + 3δ/2}/\mathrm{polylog}(n)$ traces for $0< δ< 1/3$ in the worst case.
△ Less
Submitted 16 December, 2020; v1 submitted 11 December, 2020;
originally announced December 2020.
-
Scheduling with Communication Delays via LP Hierarchies and Clustering
Authors:
Sami Davies,
Janardhan Kulkarni,
Thomas Rothvoss,
Jakub Tarnawski,
Yihao Zhang
Abstract:
We consider the classic problem of scheduling jobs with precedence constraints on identical machines to minimize makespan, in the presence of communication delays. In this setting, denoted by $\mathsf{P} \mid \mathsf{prec}, c \mid C_{\mathsf{max}}$, if two dependent jobs are scheduled on different machines, then at least $c$ units of time must pass between their executions. Despite its relevance t…
▽ More
We consider the classic problem of scheduling jobs with precedence constraints on identical machines to minimize makespan, in the presence of communication delays. In this setting, denoted by $\mathsf{P} \mid \mathsf{prec}, c \mid C_{\mathsf{max}}$, if two dependent jobs are scheduled on different machines, then at least $c$ units of time must pass between their executions. Despite its relevance to many applications, this model remains one of the most poorly understood in scheduling theory. Even for a special case where an unlimited number of machines is available, the best known approximation ratio is $2/3 \cdot (c+1)$, whereas Graham's greedy list scheduling algorithm already gives a $(c+1)$-approximation in that setting. An outstanding open problem in the top-10 list by Schuurman and Woeginger and its recent update by Bansal asks whether there exists a constant-factor approximation algorithm.
In this work we give a polynomial-time $O(\log c \cdot \log m)$-approximation algorithm for this problem, where $m$ is the number of machines and $c$ is the communication delay. Our approach is based on a Sherali-Adams lift of a linear programming relaxation and a randomized clustering of the semimetric space induced by this lift.
△ Less
Submitted 20 April, 2020;
originally announced April 2020.
-
Reconstructing Trees from Traces
Authors:
Sami Davies,
Miklos Z. Racz,
Cyrus Rashtchian
Abstract:
We study the problem of learning a node-labeled tree given independent traces from an appropriately defined deletion channel. This problem, tree trace reconstruction, generalizes string trace reconstruction, which corresponds to the tree being a path. For many classes of trees, including complete trees and spiders, we provide algorithms that reconstruct the labels using only a polynomial number of…
▽ More
We study the problem of learning a node-labeled tree given independent traces from an appropriately defined deletion channel. This problem, tree trace reconstruction, generalizes string trace reconstruction, which corresponds to the tree being a path. For many classes of trees, including complete trees and spiders, we provide algorithms that reconstruct the labels using only a polynomial number of traces. This exhibits a stark contrast to known results on string trace reconstruction, which require exponentially many traces, and where a central open problem is to determine whether a polynomial number of traces suffice. Our techniques combine novel combinatorial and complex analytic methods.
△ Less
Submitted 18 September, 2020; v1 submitted 13 February, 2019;
originally announced February 2019.
-
Supervised learning of an opto-magnetic neural network with ultrashort laser pulses
Authors:
A. Chakravarty,
J. H. Mentink,
C. S. Davies,
K. T. Yamada,
A. V. Kimel,
Th. Rasing
Abstract:
The explosive growth of data and its related energy consumption is pushing the need to develop energy-efficient brain-inspired schemes and materials for data processing and storage. Here, we demonstrate experimentally that Co/Pt films can be used as artificial synapses by manipulating their magnetization state using circularly-polarized ultrashort optical pulses at room temperature. We also show a…
▽ More
The explosive growth of data and its related energy consumption is pushing the need to develop energy-efficient brain-inspired schemes and materials for data processing and storage. Here, we demonstrate experimentally that Co/Pt films can be used as artificial synapses by manipulating their magnetization state using circularly-polarized ultrashort optical pulses at room temperature. We also show an efficient implementation of supervised perceptron learning on an opto-magnetic neural network, built from such magnetic synapses. Importantly, we demonstrate that the optimization of synaptic weights can be achieved using a global feedback mechanism, such that the learning does not rely on external storage or additional optimization schemes. These results suggest there is high potential for realizing artificial neural networks using optically-controlled magnetization in technologically relevant materials, that can learn not only fast but also energy-efficient.
△ Less
Submitted 28 May, 2019; v1 submitted 4 November, 2018;
originally announced November 2018.
-
A Tale of Santa Claus, Hypergraphs and Matroids
Authors:
Sami Davies,
Thomas Rothvoss,
Yihao Zhang
Abstract:
A well-known problem in scheduling and approximation algorithms is the Santa Claus problem. Suppose that Santa Claus has a set of gifts, and he wants to distribute them among a set of children so that the least happy child is made as happy as possible. Here, the value that a child $i$ has for a present $j$ is of the form $p_{ij} \in \{ 0,p_j\}$. A polynomial time algorithm by Annamalai et al. give…
▽ More
A well-known problem in scheduling and approximation algorithms is the Santa Claus problem. Suppose that Santa Claus has a set of gifts, and he wants to distribute them among a set of children so that the least happy child is made as happy as possible. Here, the value that a child $i$ has for a present $j$ is of the form $p_{ij} \in \{ 0,p_j\}$. A polynomial time algorithm by Annamalai et al. gives a $12.33$-approximation and is based on a modification of Haxell's hypergraph matching argument.
In this paper, we introduce a matroid version of the Santa Claus problem. Our algorithm is also based on Haxell's augmenting tree, but with the introduction of the matroid structure we solve a more general problem with cleaner methods. Our result can then be used as a blackbox to obtain a $(4+\varepsilon)$-approximation for Santa Claus. This factor also compares against a natural, compact LP for Santa Claus.
△ Less
Submitted 7 May, 2019; v1 submitted 18 July, 2018;
originally announced July 2018.
-
A General Approach to State Complexity of Operations: Formalization and Limitations
Authors:
Sylvie Davies
Abstract:
The state complexity of the result of a regular operation is often positively correlated with the number of distinct transformations induced by letters in the minimal deterministic finite automaton of the input languages. That is, more transformations in the inputs means higher state complexity in the output. When this correlation holds, the state complexity of a unary operation can be maximized u…
▽ More
The state complexity of the result of a regular operation is often positively correlated with the number of distinct transformations induced by letters in the minimal deterministic finite automaton of the input languages. That is, more transformations in the inputs means higher state complexity in the output. When this correlation holds, the state complexity of a unary operation can be maximized using languages in which there is one letter corresponding to each possible transformation; for operations of higher arity, we can use $m$-tuples of languages in which there is one letter corresponding to each possible $m$-tuple of transformations. In this way, a small set of languages can be used as witnesses for many common regular operations, eliminating the need to search for witnesses -- though at the expense of using very large alphabets. We formalize this approach and examine its limitations. We define a class of "uniform" operations for which this approach works; the class is closed under composition and includes common operations such as star, concatenation, reversal, union, and complement. Our main result is that the worst-case state complexity of a uniform operation can be determined by considering a finite set of witnesses, and this set depends only on the arity of the operation and the state complexities of the inputs.
△ Less
Submitted 5 September, 2018; v1 submitted 21 June, 2018;
originally announced June 2018.
-
State Complexity of Pattern Matching in Regular Languages
Authors:
Janusz A. Brzozowski,
Sylvie Davies,
Abhishek Madan
Abstract:
In a simple pattern matching problem one has a pattern $w$ and a text $t$, which are words over a finite alphabet $Σ$. One may ask whether $w$ occurs in $t$, and if so, where? More generally, we may have a set $P$ of patterns and a set $T$ of texts, where $P$ and $T$ are regular languages. We are interested whether any word of $T$ begins with a word of $P$, ends with a word of $P$, has a word of…
▽ More
In a simple pattern matching problem one has a pattern $w$ and a text $t$, which are words over a finite alphabet $Σ$. One may ask whether $w$ occurs in $t$, and if so, where? More generally, we may have a set $P$ of patterns and a set $T$ of texts, where $P$ and $T$ are regular languages. We are interested whether any word of $T$ begins with a word of $P$, ends with a word of $P$, has a word of $P$ as a factor, or has a word of $P$ as a subsequence. Thus we are interested in the languages $(PΣ^*)\cap T$, $(Σ^*P)\cap T$, $(Σ^* PΣ^*)\cap T$, and $(Σ^* \mathbin{\operatorname{shu}} P)\cap T$, where $\operatorname{shu}$ is the shuffle operation. The state complexity $κ(L)$ of a regular language $L$ is the number of states in the minimal deterministic finite automaton recognizing $L$. We derive the following upper bounds on the state complexities of our pattern-matching languages, where $κ(P)\le m$, and $κ(T)\le n$: $κ((PΣ^*)\cap T) \le mn$; $κ((Σ^*P)\cap T) \le 2^{m-1}n$; $κ((Σ^*PΣ^*)\cap T) \le (2^{m-2}+1)n$; and $κ((Σ^*\mathbin{\operatorname{shu}} P)\cap T) \le (2^{m-2}+1)n$. We prove that these bounds are tight, and that to meet them, the alphabet must have at least two letters in the first three cases, and at least $m-1$ letters in the last case. We also consider the special case where $P$ is a single word $w$, and obtain the following tight upper bounds: $κ((wΣ^*)\cap T_n) \le m+n-1$; $κ((Σ^*w)\cap T_n) \le (m-1)n-(m-2)$; $κ((Σ^*wΣ^*)\cap T_n) \le (m-1)n$; and $κ((Σ^*\mathbin{\operatorname{shu}} w)\cap T_n) \le (m-1)n$. For unary languages, we have a tight upper bound of $m+n-2$ in all eight of the aforementioned cases.
△ Less
Submitted 4 November, 2018; v1 submitted 12 June, 2018;
originally announced June 2018.
-
Most Complex Deterministic Union-Free Regular Languages
Authors:
Janusz A. Brzozowski,
Sylvie Davies
Abstract:
A regular language $L$ is union-free if it can be represented by a regular expression without the union operation. A union-free language is deterministic if it can be accepted by a deterministic one-cycle-free-path finite automaton; this is an automaton which has one final state and exactly one cycle-free path from any state to the final state. Jirásková and Masopust proved that the state complexi…
▽ More
A regular language $L$ is union-free if it can be represented by a regular expression without the union operation. A union-free language is deterministic if it can be accepted by a deterministic one-cycle-free-path finite automaton; this is an automaton which has one final state and exactly one cycle-free path from any state to the final state. Jirásková and Masopust proved that the state complexities of the basic operations reversal, star, product, and boolean operations in deterministic union-free languages are exactly the same as those in the class of all regular languages. To prove that the bounds are met they used five types of automata, involving eight types of transformations of the set of states of the automata. We show that for each $n\ge 3$ there exists one ternary witness of state complexity $n$ that meets the bound for reversal and product. Moreover, the restrictions of this witness to binary alphabets meet the bounds for star and boolean operations. We also show that the tight upper bounds on the state complexity of binary operations that take arguments over different alphabets are the same as those for arbitrary regular languages. Furthermore, we prove that the maximal syntactic semigroup of a union-free language has $n^n$ elements, as in the case of regular languages, and that the maximal state complexities of atoms of union-free languages are the same as those for regular languages. Finally, we prove that there exists a most complex union-free language that meets the bounds for all these complexity measures. Altogether this proves that the complexity measures above cannot distinguish union-free languages from regular languages.
△ Less
Submitted 2 January, 2018; v1 submitted 24 November, 2017;
originally announced November 2017.
-
A New Technique for Reachability of States in Concatenation Automata
Authors:
Sylvie Davies
Abstract:
We present a new technique for demonstrating the reachability of states in deterministic finite automata representing the concatenation of two languages. Such demonstrations are a necessary step in establishing the state complexity of the concatenation of two languages, and thus in establishing the state complexity of concatenation as an operation. Typically, ad-hoc induction arguments are used to…
▽ More
We present a new technique for demonstrating the reachability of states in deterministic finite automata representing the concatenation of two languages. Such demonstrations are a necessary step in establishing the state complexity of the concatenation of two languages, and thus in establishing the state complexity of concatenation as an operation. Typically, ad-hoc induction arguments are used to show particular states are reachable in concatenation automata. We prove some results that seem to capture the essence of many of these induction arguments. Using these results, reachability proofs in concatenation automata can often be done more simply and without using induction directly.
△ Less
Submitted 17 October, 2017; v1 submitted 13 October, 2017;
originally announced October 2017.
-
State Complexity of Reversals of Deterministic Finite Automata with Output
Authors:
Sylvie Davies
Abstract:
We investigate the worst-case state complexity of reversals of deterministic finite automata with output (DFAOs). In these automata, each state is assigned some output value, rather than simply being labelled final or non-final. This directly generalizes the well-studied problem of determining the worst-case state complexity of reversals of ordinary deterministic finite automata. If a DFAO has…
▽ More
We investigate the worst-case state complexity of reversals of deterministic finite automata with output (DFAOs). In these automata, each state is assigned some output value, rather than simply being labelled final or non-final. This directly generalizes the well-studied problem of determining the worst-case state complexity of reversals of ordinary deterministic finite automata. If a DFAO has $n$ states and $k$ possible output values, there is a known upper bound of $k^n$ for the state complexity of reversal. We show this bound can be reached with a ternary input alphabet. We conjecture it cannot be reached with a binary input alphabet except when $k = 2$, and give a lower bound for the case $3 \le k < n$. We prove that the state complexity of reversal depends solely on the transition monoid of the DFAO and the map** that assigns output values to states.
△ Less
Submitted 17 October, 2017; v1 submitted 19 May, 2017;
originally announced May 2017.
-
Primitivity, Uniform Minimality and State Complexity of Boolean Operations
Authors:
Sylvie Davies
Abstract:
A minimal deterministic finite automaton (DFA) is uniformly minimal if it always remains minimal when the final state set is replaced by a non-empty proper subset of the state set. We prove that a permutation DFA is uniformly minimal if and only if its transition monoid is a primitive group. We use this to study boolean operations on group languages, which are recognized by direct products of perm…
▽ More
A minimal deterministic finite automaton (DFA) is uniformly minimal if it always remains minimal when the final state set is replaced by a non-empty proper subset of the state set. We prove that a permutation DFA is uniformly minimal if and only if its transition monoid is a primitive group. We use this to study boolean operations on group languages, which are recognized by direct products of permutation DFAs. A direct product cannot be uniformly minimal, except in the trivial case where one of the DFAs in the product is a one-state DFA. However, non-trivial direct products can satisfy a weaker condition we call uniform boolean minimality, where only final state sets used to recognize boolean operations are considered. We give sufficient conditions for a direct product of two DFAs to be uniformly boolean minimal, which in turn gives sufficient conditions for pairs of group languages to have maximal state complexity under all binary boolean operations ("maximal boolean complexity"). In the case of permutation DFAs with one final state, we give necessary and sufficient conditions for pairs of group languages to have maximal boolean complexity. Our results demonstrate a connection between primitive groups and automata with strong minimality properties.
△ Less
Submitted 26 March, 2018; v1 submitted 2 February, 2017;
originally announced February 2017.
-
Most Complex Non-Returning Regular Languages
Authors:
Janusz A. Brzozowski,
Sylvie Davies
Abstract:
A regular language $L$ is non-returning if in the minimal deterministic finite automaton accepting it there are no transitions into the initial state. Eom, Han and Jirásková derived upper bounds on the state complexity of boolean operations and Kleene star, and proved that these bounds are tight using two different binary witnesses. They derived upper bounds for concatenation and reversal using th…
▽ More
A regular language $L$ is non-returning if in the minimal deterministic finite automaton accepting it there are no transitions into the initial state. Eom, Han and Jirásková derived upper bounds on the state complexity of boolean operations and Kleene star, and proved that these bounds are tight using two different binary witnesses. They derived upper bounds for concatenation and reversal using three different ternary witnesses. These five witnesses use a total of six different transformations. We show that for each $n\ge 4$ there exists a ternary witness of state complexity $n$ that meets the bound for reversal and that at least three letters are needed to meet this bound. Moreover, the restrictions of this witness to binary alphabets meet the bounds for product, star, and boolean operations. We also derive tight upper bounds on the state complexity of binary operations that take arguments with different alphabets. We prove that the maximal syntactic semigroup of a non-returning language has $(n-1)^n$ elements and requires at least $\binom{n}{2}$ generators. We find the maximal state complexities of atoms of non-returning languages. Finally, we show that there exists a most complex non-returning language that meets the bounds for all these complexity measures.
△ Less
Submitted 14 January, 2017;
originally announced January 2017.
-
Most Complex Regular Ideal Languages
Authors:
Janusz Brzozowski,
Sylvie Davies,
Bo Yang Victor Liu
Abstract:
A right ideal (left ideal, two-sided ideal) is a non-empty language $L$ over an alphabet $Σ$ such that $L=LΣ^*$ ($L=Σ^*L$, $L=Σ^*LΣ^*$). Let $k=3$ for right ideals, 4 for left ideals and 5 for two-sided ideals. We show that there exist sequences ($L_n \mid n \ge k $) of right, left, and two-sided regular ideals, where $L_n$ has quotient complexity (state complexity) $n$, such that $L_n$ is most co…
▽ More
A right ideal (left ideal, two-sided ideal) is a non-empty language $L$ over an alphabet $Σ$ such that $L=LΣ^*$ ($L=Σ^*L$, $L=Σ^*LΣ^*$). Let $k=3$ for right ideals, 4 for left ideals and 5 for two-sided ideals. We show that there exist sequences ($L_n \mid n \ge k $) of right, left, and two-sided regular ideals, where $L_n$ has quotient complexity (state complexity) $n$, such that $L_n$ is most complex in its class under the following measures of complexity: the size of the syntactic semigroup, the quotient complexities of the left quotients of $L_n$, the number of atoms (intersections of complemented and uncomplemented left quotients), the quotient complexities of the atoms, and the quotient complexities of reversal, star, product (concatenation), and all binary boolean operations. In that sense, these ideals are "most complex" languages in their classes, or "universal witnesses" to the complexity of the various operations.
△ Less
Submitted 13 October, 2016; v1 submitted 31 October, 2015;
originally announced November 2015.
-
Quotient Complexities of Atoms in Regular Ideal Languages
Authors:
Janusz Brzozowski,
Sylvie Davies
Abstract:
A (left) quotient of a language $L$ by a word $w$ is the language $w^{-1}L=\{x\mid wx\in L\}$. The quotient complexity of a regular language $L$ is the number of quotients of $L$; it is equal to the state complexity of $L$, which is the number of states in a minimal deterministic finite automaton accepting $L$. An atom of $L$ is an equivalence class of the relation in which two words are equivalen…
▽ More
A (left) quotient of a language $L$ by a word $w$ is the language $w^{-1}L=\{x\mid wx\in L\}$. The quotient complexity of a regular language $L$ is the number of quotients of $L$; it is equal to the state complexity of $L$, which is the number of states in a minimal deterministic finite automaton accepting $L$. An atom of $L$ is an equivalence class of the relation in which two words are equivalent if for each quotient, they either are both in the quotient or both not in it; hence it is a non-empty intersection of complemented and uncomplemented quotients of $L$. A right (respectively, left and two-sided) ideal is a language $L$ over an alphabet $Σ$ that satisfies $L=LΣ^*$ (respectively, $L=Σ^*L$ and $L=Σ^*LΣ^*$). We compute the maximal number of atoms and the maximal quotient complexities of atoms of right, left and two-sided regular ideals.
△ Less
Submitted 23 May, 2015; v1 submitted 7 March, 2015;
originally announced March 2015.
-
Mix-nets: Factored Mixtures of Gaussians in Bayesian Networks With Mixed Continuous And Discrete Variables
Authors:
Scott Davies,
Andrew Moore
Abstract:
Recently developed techniques have made it possible to quickly learn accurate probability density functions from data in low-dimensional continuous space. In particular, mixtures of Gaussians can be fitted to data very quickly using an accelerated EM algorithm that employs multiresolution kd-trees (Moore, 1999). In this paper, we propose a kind of Bayesian networks in which low-dimensional mixture…
▽ More
Recently developed techniques have made it possible to quickly learn accurate probability density functions from data in low-dimensional continuous space. In particular, mixtures of Gaussians can be fitted to data very quickly using an accelerated EM algorithm that employs multiresolution kd-trees (Moore, 1999). In this paper, we propose a kind of Bayesian networks in which low-dimensional mixtures of Gaussians over different subsets of the domain's variables are combined into a coherent joint probability model over the entire domain. The network is also capable of modeling complex dependencies between discrete variables and continuous variables without requiring discretization of the continuous variables. We present efficient heuristic algorithms for automatically learning these networks from data, and perform comparative experiments illustrated how well these networks model real scientific data and synthetic data. We also briefly discuss some possible improvements to the networks, as well as possible applications.
△ Less
Submitted 16 January, 2013;
originally announced January 2013.
-
Interpolating Conditional Density Trees
Authors:
Scott Davies,
Andrew Moore
Abstract:
Joint distributions over many variables are frequently modeled by decomposing them into products of simpler, lower-dimensional conditional distributions, such as in sparsely connected Bayesian networks. However, automatically learning such models can be very computationally expensive when there are many datapoints and many continuous variables with complex nonlinear relationships, particularly wh…
▽ More
Joint distributions over many variables are frequently modeled by decomposing them into products of simpler, lower-dimensional conditional distributions, such as in sparsely connected Bayesian networks. However, automatically learning such models can be very computationally expensive when there are many datapoints and many continuous variables with complex nonlinear relationships, particularly when no good ways of decomposing the joint distribution are known a priori. In such situations, previous research has generally focused on the use of discretization techniques in which each continuous variable has a single discretization that is used throughout the entire network. \ In this paper, we present and compare a wide variety of tree-based algorithms for learning and evaluating conditional density estimates over continuous variables. These trees can be thought of as discretizations that vary according to the particular interactions being modeled; however, the density within a given leaf of the tree need not be assumed constant, and we show that such nonuniform leaf densities lead to more accurate density estimation. We have developed Bayesian network structure-learning algorithms that employ these tree-based conditional density representations, and we show that they can be used to practically learn complex joint probability models over dozens of continuous variables from thousands of datapoints. We focus on finding models that are simultaneously accurate, fast to learn, and fast to evaluate once they are learned.
△ Less
Submitted 12 December, 2012;
originally announced January 2013.