-
Leveraging Ontologies to Document Bias in Data
Authors:
Mayra Russo,
Maria-Esther Vidal
Abstract:
Machine Learning (ML) systems are capable of reproducing and often amplifying undesired biases. This puts emphasis on the importance of operating under practices that enable the study and understanding of the intrinsic characteristics of ML pipelines, prompting the emergence of documentation frameworks with the idea that ``any remedy for bias starts with awareness of its existence''. However, a re…
▽ More
Machine Learning (ML) systems are capable of reproducing and often amplifying undesired biases. This puts emphasis on the importance of operating under practices that enable the study and understanding of the intrinsic characteristics of ML pipelines, prompting the emergence of documentation frameworks with the idea that ``any remedy for bias starts with awareness of its existence''. However, a resource that can formally describe these pipelines in terms of biases detected is still amiss. To fill this gap, we present the Doc-BiasO ontology, a resource that aims to create an integrated vocabulary of biases defined in the \textit{fair-ML} literature and their measures, as well as to incorporate relevant terminology and the relationships between them. Overseeing ontology engineering best practices, we re-use existing vocabulary on machine learning and AI, to foster knowledge sharing and interoperability between the actors concerned with its research, development, regulation, among others. Overall, our main objective is to contribute towards clarifying existing terminology on bias research as it rapidly expands to all areas of AI and to improve the interpretation of bias in data and downstream impact.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
A Simple and Optimal Sublinear Algorithm for Mean Estimation
Authors:
Beatrice Bertolotti,
Matteo Russo,
Chris Schwiegelshohn
Abstract:
We study the sublinear mean estimation problem. Specifically, we aim to output a point minimizing the sum of squared Euclidean distances. We show that a multiplicative $(1+\varepsilon)$ approximation can be found with probability $1-δ$ using $O(\varepsilon^{-1}\log δ^{-1})$ many independent random samples. We also provide a matching lower bound.
We study the sublinear mean estimation problem. Specifically, we aim to output a point minimizing the sum of squared Euclidean distances. We show that a multiplicative $(1+\varepsilon)$ approximation can be found with probability $1-δ$ using $O(\varepsilon^{-1}\log δ^{-1})$ many independent random samples. We also provide a matching lower bound.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Learning-Based Link Anomaly Detection in Continuous-Time Dynamic Graphs
Authors:
Tim Poštuvan,
Claas Grohnfeldt,
Michele Russo,
Giulio Lovisotto
Abstract:
Anomaly detection in continuous-time dynamic graphs is an emerging field yet under-explored in the context of learning-based approaches. In this paper, we pioneer structured analyses of link-level anomalies and graph representation learning for identifying anomalous links in these graphs. First, we introduce a fine-grain taxonomy for edge-level anomalies leveraging structural, temporal, and contex…
▽ More
Anomaly detection in continuous-time dynamic graphs is an emerging field yet under-explored in the context of learning-based approaches. In this paper, we pioneer structured analyses of link-level anomalies and graph representation learning for identifying anomalous links in these graphs. First, we introduce a fine-grain taxonomy for edge-level anomalies leveraging structural, temporal, and contextual graph properties. We present a method for generating and injecting such typed anomalies into graphs. Next, we introduce a novel method to generate continuous-time dynamic graphs with consistent patterns across time, structure, and context. To allow temporal graph methods to learn the link anomaly detection task, we extend the generic link prediction setting by: (1) conditioning link existence on contextual edge attributes; and (2) refining the training regime to accommodate diverse perturbations in the negative edge sampler. Building on this, we benchmark methods for anomaly detection. Comprehensive experiments on synthetic and real-world datasets -- featuring synthetic and labeled organic anomalies and employing six state-of-the-art learning methods -- validate our taxonomy and generation processes for anomalies and benign graphs, as well as our approach to adapting link prediction methods for anomaly detection. Our results further reveal that different learning methods excel in capturing different aspects of graph normality and detecting different types of anomalies. We conclude with a comprehensive list of findings highlighting opportunities for future research.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
A Declarative System for Optimizing AI Workloads
Authors:
Chunwei Liu,
Matthew Russo,
Michael Cafarella,
Lei Cao,
Peter Baille Chen,
Zui Chen,
Michael Franklin,
Tim Kraska,
Samuel Madden,
Gerardo Vitagliano
Abstract:
A long-standing goal of data management systems has been to build systems which can compute quantitative insights over large corpora of unstructured data in a cost-effective manner. Until recently, it was difficult and expensive to extract facts from company documents, data from scientific papers, or metrics from image and video corpora. Today's models can accomplish these tasks with high accuracy…
▽ More
A long-standing goal of data management systems has been to build systems which can compute quantitative insights over large corpora of unstructured data in a cost-effective manner. Until recently, it was difficult and expensive to extract facts from company documents, data from scientific papers, or metrics from image and video corpora. Today's models can accomplish these tasks with high accuracy. However, a programmer who wants to answer a substantive AI-powered query must orchestrate large numbers of models, prompts, and data operations. For even a single query, the programmer has to make a vast number of decisions such as the choice of model, the right inference method, the most cost-effective inference hardware, the ideal prompt design, and so on. The optimal set of decisions can change as the query changes and as the rapidly-evolving technical landscape shifts. In this paper we present Palimpzest, a system that enables anyone to process AI-powered analytical queries simply by defining them in a declarative language. The system uses its cost optimization framework to implement the query plan with the best trade-offs between runtime, financial cost, and output data quality. We describe the workload of AI-powered analytics tasks, the optimization methods that Palimpzest uses, and the prototype system itself. We evaluate Palimpzest on tasks in Legal Discovery, Real Estate Search, and Medical Schema Matching. We show that even our simple prototype offers a range of appealing plans, including one that is 3.3x faster and 2.9x cheaper than the baseline method, while also offering better data quality. With parallelism enabled, Palimpzest can produce plans with up to a 90.3x speedup at 9.1x lower cost relative to a single-threaded GPT-4 baseline, while obtaining an F1-score within 83.5% of the baseline. These require no additional work by the user.
△ Less
Submitted 29 May, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Contracts with Inspections
Authors:
Tomer Ezra,
Stefano Leonardi,
Matteo Russo
Abstract:
In the classical principal-agent hidden-action model, a principal delegates the execution of a costly task to an agent for which he can choose among actions with different costs and different success probabilities to accomplish the task. To incentivize the agent to exert effort, the principal can commit to a contract, which is the amount of payment based on the task's success. A crucial assumption…
▽ More
In the classical principal-agent hidden-action model, a principal delegates the execution of a costly task to an agent for which he can choose among actions with different costs and different success probabilities to accomplish the task. To incentivize the agent to exert effort, the principal can commit to a contract, which is the amount of payment based on the task's success. A crucial assumption of this model is that the principal can only base the payment on the outcome but not on the agent's chosen action.
In this work, we relax the hidden-action assumption and introduce a new model where the principal is allowed to inspect subsets of actions at some cost that depends on the inspected subset. If the principal discovers that the agent did not select the agreed-upon action through the inspection, the principal can withhold payment. This relaxation of the model introduces a broader strategy space for the principal, who now faces a tradeoff between positive incentives (increasing payment) and negative incentives (increasing inspection).
We show how to find the best deterministic incentive-compatible inspection scheme for all monotone inspection cost functions. We then turn to randomized inspection schemes and show that one can efficiently find the best randomized incentive-compatible inspection scheme when the inspection cost function is submodular. We complement this result by showing that it is impossible to efficiently find the optimal randomized inspection scheme for the more general case of XOS inspection cost functions.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
Empowering machine learning models with contextual knowledge for enhancing the detection of eating disorders in social media posts
Authors:
José Alberto Benítez-Andrades,
María Teresa García-Ordás,
Mayra Russo,
Ahmad Sakor,
Luis Daniel Fernandes Rotger,
Maria-Esther Vidal
Abstract:
Social networks are vital for information sharing, especially in the health sector for discussing diseases and treatments. These platforms, however, often feature posts as brief texts, posing challenges for Artificial Intelligence (AI) in understanding context. We introduce a novel hybrid approach combining community-maintained knowledge graphs (like Wikidata) with deep learning to enhance the cat…
▽ More
Social networks are vital for information sharing, especially in the health sector for discussing diseases and treatments. These platforms, however, often feature posts as brief texts, posing challenges for Artificial Intelligence (AI) in understanding context. We introduce a novel hybrid approach combining community-maintained knowledge graphs (like Wikidata) with deep learning to enhance the categorization of social media posts. This method uses advanced entity recognizers and linkers (like Falcon 2.0) to connect short post entities to knowledge graphs. Knowledge graph embeddings (KGEs) and contextualized word embeddings (like BERT) are then employed to create rich, context-based representations of these posts.
Our focus is on the health domain, particularly in identifying posts related to eating disorders (e.g., anorexia, bulimia) to aid healthcare providers in early diagnosis. We tested our approach on a dataset of 2,000 tweets about eating disorders, finding that merging word embeddings with knowledge graph information enhances the predictive models' reliability. This methodology aims to assist health experts in spotting patterns indicative of mental disorders, thereby improving early detection and accurate diagnosis for personalized medicine.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Low-Distortion Clustering with Ordinal and Limited Cardinal Information
Authors:
Jakob Burkhardt,
Ioannis Caragiannis,
Karl Fehrs,
Matteo Russo,
Chris Schwiegelshohn,
Sudarshan Shyam
Abstract:
Motivated by recent work in computational social choice, we extend the metric distortion framework to clustering problems. Given a set of $n$ agents located in an underlying metric space, our goal is to partition them into $k$ clusters, optimizing some social cost objective. The metric space is defined by a distance function $d$ between the agent locations. Information about $d$ is available only…
▽ More
Motivated by recent work in computational social choice, we extend the metric distortion framework to clustering problems. Given a set of $n$ agents located in an underlying metric space, our goal is to partition them into $k$ clusters, optimizing some social cost objective. The metric space is defined by a distance function $d$ between the agent locations. Information about $d$ is available only implicitly via $n$ rankings, through which each agent ranks all other agents in terms of their distance from her. Still, we would like to evaluate clustering algorithms in terms of social cost objectives that are defined using $d$. This is done using the notion of distortion, which measures how far from optimality a clustering can be, taking into account all underlying metrics that are consistent with the ordinal information available. Unfortunately, the most important clustering objectives do not admit algorithms with finite distortion. To sidestep this disappointing fact, we follow two alternative approaches: We first explore whether resource augmentation can be beneficial. We consider algorithms that use more than $k$ clusters but compare their social cost to that of the optimal $k$-clusterings. We show that using exponentially (in terms of $k$) many clusters, we can get low (constant or logarithmic) distortion for the $k$-center and $k$-median objectives. Interestingly, such an exponential blowup is shown to be necessary. More importantly, we explore whether limited cardinal information can be used to obtain better results. Somewhat surprisingly, for $k$-median and $k$-center, we show that a number of queries that is polynomial in $k$ and only logarithmic in $n$ (i.e., only sublinear in the number of agents for the most relevant scenarios in practice) is enough to get constant distortion.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
On Finding Optimal (Dynamic) Arborescences
Authors:
Joaquim Espada,
Alexandre P. Francisco,
Tatiana Rocher,
Luís M. S. Russo,
Cátia Vaz
Abstract:
Let G = (V, E) be a directed and weighted graph with vertex set V of size n and edge set E of size m, such that each edge (u, v) \in E has a real-valued weight w(u, c). An arborescence in G is a subgraph T = (V, E') such that for a vertex u \in V, the root, there is a unique path in T from u to any other vertex v \in V. The weight of T is the sum of the weights of its edges. In this paper, given G…
▽ More
Let G = (V, E) be a directed and weighted graph with vertex set V of size n and edge set E of size m, such that each edge (u, v) \in E has a real-valued weight w(u, c). An arborescence in G is a subgraph T = (V, E') such that for a vertex u \in V, the root, there is a unique path in T from u to any other vertex v \in V. The weight of T is the sum of the weights of its edges. In this paper, given G, we are interested in finding an arborescence in G with minimum weight, i.e., an optimal arborescence. Furthermore, when G is subject to changes, namely edge insertions and deletions, we are interested in efficiently maintaining a dynamic arborescence in G. This is a well known problem with applications in several domains such as network design optimization and in phylogenetic inference. In this paper we revisit algorithmic ideas proposed by several authors for this problem, we provide detailed pseudo-code as well as implementation details, and we present experimental results on large scale-free networks and on phylogenetic inference. Our implementation is publicly available at \url{https://gitlab.com/espadas/optimal-arborescences}.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Submodular Norms with Applications To Online Facility Location and Stochastic Probing
Authors:
Kalen Patton,
Matteo Russo,
Sahil Singla
Abstract:
Optimization problems often involve vector norms, which has led to extensive research on develo** algorithms that can handle objectives beyond the $\ell_p$ norms. Our work introduces the concept of submodular norms, which are a versatile type of norms that possess marginal properties similar to submodular set functions. We show that submodular norms can accurately represent or approximate well-k…
▽ More
Optimization problems often involve vector norms, which has led to extensive research on develo** algorithms that can handle objectives beyond the $\ell_p$ norms. Our work introduces the concept of submodular norms, which are a versatile type of norms that possess marginal properties similar to submodular set functions. We show that submodular norms can accurately represent or approximate well-known classes of norms, such as $\ell_p$ norms, ordered norms, and symmetric norms. Furthermore, we establish that submodular norms can be applied to optimization problems such as online facility location, stochastic probing, and generalized load balancing. This allows us to develop a logarithmic-competitive algorithm for online facility location with symmetric norms, to prove a logarithmic adaptivity gap for stochastic probing with symmetric norms, and to give an alternative poly-logarithmic approximation algorithm for generalized load balancing with outer $\ell_1$ norm and inner symmetric norms.
△ Less
Submitted 6 October, 2023;
originally announced October 2023.
-
Accelerating Aggregation Queries on Unstructured Streams of Data
Authors:
Matthew Russo,
Tatsunori Hashimoto,
Daniel Kang,
Yi Sun,
Matei Zaharia
Abstract:
Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it…
▽ More
Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it requires access to the entire dataset before a query can be submitted or is specific to video. Thus, to the best of our knowledge, no prior work addresses the problem of efficiently answering queries over multiple modalities of streams.
In this work we propose InQuest, a system for accelerating aggregation queries on unstructured streams of data with statistical guarantees on query accuracy. InQuest leverages inexpensive approximation models ("proxies") and sampling techniques to limit the execution of an expensive high-precision model (an "oracle") to a subset of the stream. It then uses the oracle predictions to compute an approximate query answer in real-time. We theoretically analyzed InQuest and show that the expected error of its query estimates converges on stationary streams at a rate inversely proportional to the oracle budget. We evaluated our algorithm on six real-world video and text datasets and show that InQuest achieves the same root mean squared error (RMSE) as two streaming baselines with up to 5.0x fewer oracle invocations. We further show that InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle invocations than a state-of-the-art batch setting algorithm.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
Quantum Kernel Estimation With Neutral Atoms For Supervised Classification: A Gate-Based Approach
Authors:
Marco Russo,
Edoardo Giusto,
Bartolomeo Montrucchio
Abstract:
Quantum Kernel Estimation (QKE) is a technique based on leveraging a quantum computer to estimate a kernel function that is classically difficult to calculate, which is then used by a classical computer for training a Support Vector Machine (SVM). Given the high number of 2-local operators necessary for realizing a feature map** hard to simulate classically, a high qubit connectivity is needed,…
▽ More
Quantum Kernel Estimation (QKE) is a technique based on leveraging a quantum computer to estimate a kernel function that is classically difficult to calculate, which is then used by a classical computer for training a Support Vector Machine (SVM). Given the high number of 2-local operators necessary for realizing a feature map** hard to simulate classically, a high qubit connectivity is needed, which is not currently possible on superconducting devices. For this reason, neutral atom quantum computers can be used, since they allow to arrange the atoms with more freedom. Examples of neutral-atom-based QKE can be found in the literature, but they are focused on graph learning and use the analogue approach. In this paper, a general method based on the gate model is presented. After deriving 1-qubit and 2-qubit gates starting from laser pulses, a parameterized sequence for feature map** on 3 qubits is realized. This sequence is then used to empirically compute the kernel matrix starting from a dataset, which is finally used to train the SVM. It is also shown that this process can be generalized up to N qubits taking advantage of the more flexible arrangement of atoms that this technology allows. The accuracy is shown to be high despite the small dataset and the low separation. This is the first paper that not only proposes an algorithm for explicitly deriving a universal set of gates but also presents a method of estimating quantum kernels on neutral atom devices for general problems using the gate model.
△ Less
Submitted 28 July, 2023;
originally announced July 2023.
-
Bound by the Bounty: Collaboratively Sha** Evaluation Processes for Queer AI Harms
Authors:
Organizers of QueerInAI,
Nathan Dennler,
Anaelia Ovalle,
Ashwin Singh,
Luca Soldaini,
Arjun Subramonian,
Huy Tu,
William Agnew,
Avijit Ghosh,
Kyra Yee,
Irene Font Peradejordi,
Zeerak Talat,
Mayra Russo,
Jess de Jesus de Pinho Pinhal
Abstract:
Bias evaluation benchmarks and dataset and model documentation have emerged as central processes for assessing the biases and harms of artificial intelligence (AI) systems. However, these auditing processes have been criticized for their failure to integrate the knowledge of marginalized communities and consider the power dynamics between auditors and the communities. Consequently, modes of bias e…
▽ More
Bias evaluation benchmarks and dataset and model documentation have emerged as central processes for assessing the biases and harms of artificial intelligence (AI) systems. However, these auditing processes have been criticized for their failure to integrate the knowledge of marginalized communities and consider the power dynamics between auditors and the communities. Consequently, modes of bias evaluation have been proposed that engage impacted communities in identifying and assessing the harms of AI systems (e.g., bias bounties). Even so, asking what marginalized communities want from such auditing processes has been neglected. In this paper, we ask queer communities for their positions on, and desires from, auditing processes. To this end, we organized a participatory workshop to critique and redesign bias bounties from queer perspectives. We found that when given space, the scope of feedback from workshop participants goes far beyond what bias bounties afford, with participants questioning the ownership, incentives, and efficacy of bounties. We conclude by advocating for community ownership of bounties and complementing bounties with participatory processes (e.g., co-creation).
△ Less
Submitted 25 July, 2023; v1 submitted 14 July, 2023;
originally announced July 2023.
-
Fair Division with Interdependent Values
Authors:
Georgios Birmpas,
Tomer Ezra,
Stefano Leonardi,
Matteo Russo
Abstract:
We introduce the study of designing allocation mechanisms for fairly allocating indivisible goods in settings with interdependent valuation functions. In our setting, there is a set of goods that needs to be allocated to a set of agents (without disposal). Each agent is given a private signal, and his valuation function depends on the signals of all agents. Without the use of payments, there are s…
▽ More
We introduce the study of designing allocation mechanisms for fairly allocating indivisible goods in settings with interdependent valuation functions. In our setting, there is a set of goods that needs to be allocated to a set of agents (without disposal). Each agent is given a private signal, and his valuation function depends on the signals of all agents. Without the use of payments, there are strong impossibility results for designing strategyproof allocation mechanisms even in settings without interdependent values. Therefore, we turn to design mechanisms that always admit equilibria that are fair with respect to their true signals, despite their potentially distorted perception. To do so, we first extend the definitions of pure Nash equilibrium and well-studied fairness notions in literature to the interdependent setting. We devise simple allocation mechanisms that always admit a fair equilibrium with respect to the true signals. We complement this result by showing that, even for very simple cases with binary additive interdependent valuation functions, no allocation mechanism that always admits an equilibrium, can guarantee that all equilibria are fair with respect to the true signals.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
GCNH: A Simple Method For Representation Learning On Heterophilous Graphs
Authors:
Andrea Cavallo,
Claas Grohnfeldt,
Michele Russo,
Giulio Lovisotto,
Luca Vassio
Abstract:
Graph Neural Networks (GNNs) are well-suited for learning on homophilous graphs, i.e., graphs in which edges tend to connect nodes of the same type. Yet, achievement of consistent GNN performance on heterophilous graphs remains an open research problem. Recent works have proposed extensions to standard GNN architectures to improve performance on heterophilous graphs, trading off model simplicity f…
▽ More
Graph Neural Networks (GNNs) are well-suited for learning on homophilous graphs, i.e., graphs in which edges tend to connect nodes of the same type. Yet, achievement of consistent GNN performance on heterophilous graphs remains an open research problem. Recent works have proposed extensions to standard GNN architectures to improve performance on heterophilous graphs, trading off model simplicity for prediction accuracy. However, these models fail to capture basic graph properties, such as neighborhood label distribution, which are fundamental for learning. In this work, we propose GCN for Heterophily (GCNH), a simple yet effective GNN architecture applicable to both heterophilous and homophilous scenarios. GCNH learns and combines separate representations for a node and its neighbors, using one learned importance coefficient per layer to balance the contributions of center nodes and neighborhoods. We conduct extensive experiments on eight real-world graphs and a set of synthetic graphs with varying degrees of heterophily to demonstrate how the design choices for GCNH lead to a sizable improvement over a vanilla GCN. Moreover, GCNH outperforms state-of-the-art models of much higher complexity on four out of eight benchmarks, while producing comparable results on the remaining datasets. Finally, we discuss and analyze the lower complexity of GCNH, which results in fewer trainable parameters and faster training times than other methods, and show how GCNH mitigates the oversmoothing problem.
△ Less
Submitted 21 April, 2023;
originally announced April 2023.
-
Fully Dynamic Online Selection through Online Contention Resolution Schemes
Authors:
Vashist Avadhanula,
Andrea Celli,
Riccardo Colini-Baldeschi,
Stefano Leonardi,
Matteo Russo
Abstract:
We study fully dynamic online selection problems in an adversarial/stochastic setting that includes Bayesian online selection, prophet inequalities, posted price mechanisms, and stochastic probing problems subject to combinatorial constraints. In the classical ``incremental'' version of the problem, selected elements remain active until the end of the input sequence. On the other hand, in the full…
▽ More
We study fully dynamic online selection problems in an adversarial/stochastic setting that includes Bayesian online selection, prophet inequalities, posted price mechanisms, and stochastic probing problems subject to combinatorial constraints. In the classical ``incremental'' version of the problem, selected elements remain active until the end of the input sequence. On the other hand, in the fully dynamic version of the problem, elements stay active for a limited time interval, and then leave. This models, for example, the online matching of tasks to workers with task/worker-dependent working times, and sequential posted pricing of perishable goods. A successful approach to online selection problems in the adversarial setting is given by the notion of Online Contention Resolution Scheme (OCRS), that uses a priori information to formulate a linear relaxation of the underlying optimization problem, whose optimal fractional solution is rounded online for any adversarial order of the input sequence. Our main contribution is providing a general method for constructing an OCRS for fully dynamic online selection problems. Then, we show how to employ such OCRS to construct no-regret algorithms in a partial information model with semi-bandit feedback and adversarial inputs.
△ Less
Submitted 8 January, 2023;
originally announced January 2023.
-
2-hop Neighbor Class Similarity (2NCS): A graph structural metric indicative of graph neural network performance
Authors:
Andrea Cavallo,
Claas Grohnfeldt,
Michele Russo,
Giulio Lovisotto,
Luca Vassio
Abstract:
Graph Neural Networks (GNNs) achieve state-of-the-art performance on graph-structured data across numerous domains. Their underlying ability to represent nodes as summaries of their vicinities has proven effective for homophilous graphs in particular, in which same-type nodes tend to connect. On heterophilous graphs, in which different-type nodes are likely connected, GNNs perform less consistentl…
▽ More
Graph Neural Networks (GNNs) achieve state-of-the-art performance on graph-structured data across numerous domains. Their underlying ability to represent nodes as summaries of their vicinities has proven effective for homophilous graphs in particular, in which same-type nodes tend to connect. On heterophilous graphs, in which different-type nodes are likely connected, GNNs perform less consistently, as neighborhood information might be less representative or even misleading. On the other hand, GNN performance is not inferior on all heterophilous graphs, and there is a lack of understanding of what other graph properties affect GNN performance.
In this work, we highlight the limitations of the widely used homophily ratio and the recent Cross-Class Neighborhood Similarity (CCNS) metric in estimating GNN performance. To overcome these limitations, we introduce 2-hop Neighbor Class Similarity (2NCS), a new quantitative graph structural property that correlates with GNN performance more strongly and consistently than alternative metrics. 2NCS considers two-hop neighborhoods as a theoretically derived consequence of the two-step label propagation process governing GCN's training-inference process. Experiments on one synthetic and eight real-world graph datasets confirm consistent improvements over existing metrics in estimating the accuracy of GCN- and GAT-based architectures on the node classification task.
△ Less
Submitted 26 December, 2022;
originally announced December 2022.
-
Prophet Inequalities via the Expected Competitive Ratio
Authors:
Tomer Ezra,
Stefano Leonardi,
Rebecca Reiffenhäuser,
Matteo Russo,
Alexandros Tsigonias-Dimitriadis
Abstract:
We consider prophet inequalities under downward-closed constraints. In this problem, a decision-maker makes immediate and irrevocable choices on arriving elements, subject to constraints. Traditionally, performance is compared to the expected offline optimum, called the \textit{Ratio of Expectations} (RoE). However, RoE has limitations as it only guarantees the average performance compared to the…
▽ More
We consider prophet inequalities under downward-closed constraints. In this problem, a decision-maker makes immediate and irrevocable choices on arriving elements, subject to constraints. Traditionally, performance is compared to the expected offline optimum, called the \textit{Ratio of Expectations} (RoE). However, RoE has limitations as it only guarantees the average performance compared to the optimum, and might perform poorly against the realized ex-post optimal value. We study an alternative performance measure, the \textit{Expected Ratio} (EoR), namely the expectation of the ratio between algorithm's and prophet's value. EoR offers robust guarantees, e.g., a constant EoR implies achieving a constant fraction of the offline optimum with constant probability. For the special case of single-choice problems the EoR coincides with the well-studied notion of probability of selecting the maximum. However, the EoR naturally generalizes the probability of selecting the maximum for combinatorial constraints, which are the main focus of this paper. Specifically, we establish two reductions: for every constraint, RoE and the EoR are at most a constant factor apart. Additionally, we show that the EoR is a stronger benchmark than the RoE in that, for every instance (constraint and distribution), the RoE is at least a constant fraction of the EoR, but not vice versa. Both these reductions imply a wealth of EoR results in multiple settings where RoE results are known.
△ Less
Submitted 6 October, 2023; v1 submitted 7 July, 2022;
originally announced July 2022.
-
Active TLS Stack Fingerprinting: Characterizing TLS Server Deployments at Scale
Authors:
Markus Sosnowski,
Johannes Zirngibl,
Patrick Sattler,
Georg Carle,
Claas Grohnfeldt,
Michele Russo,
Daniele Sgandurra
Abstract:
Active measurements can be used to collect server characteristics on a large scale. This kind of metadata can help discovering hidden relations and commonalities among server deployments offering new possibilities to cluster and classify them. As an example, identifying a previously-unknown cybercriminal infrastructures can be a valuable source for cyber-threat intelligence. We propose herein an a…
▽ More
Active measurements can be used to collect server characteristics on a large scale. This kind of metadata can help discovering hidden relations and commonalities among server deployments offering new possibilities to cluster and classify them. As an example, identifying a previously-unknown cybercriminal infrastructures can be a valuable source for cyber-threat intelligence. We propose herein an active measurement-based methodology for acquiring Transport Layer Security (TLS) metadata from servers and leverage it for their fingerprinting. Our fingerprints capture the characteristic behavior of the TLS stack primarily caused by the implementation, configuration, and hardware support of the underlying server. Using an empirical optimization strategy that maximizes information gain from every handshake to minimize measurement costs, we generated 10 general-purpose Client Hellos used as scanning probes to create a large database of TLS configurations used for classifying servers. We fingerprinted 28 million servers from the Alexa and Majestic toplists and two Command and Control (C2) blocklists over a period of 30 weeks with weekly snapshots as foundation for two long-term case studies: classification of Content Delivery Network and C2 servers. The proposed methodology shows a precision of more than 99 % and enables a stable identification of new servers over time. This study describes a new opportunity for active measurements to provide valuable insights into the Internet that can be used in security-relevant use cases.
△ Less
Submitted 30 August, 2023; v1 submitted 27 June, 2022;
originally announced June 2022.
-
A novel multi-layer modular approach for real-time fuzzy-identification of gravitational-wave signals
Authors:
Francesco Pio Barone,
Daniele Dell'Aquila,
Marco Russo
Abstract:
Advanced LIGO and Advanced Virgo ground-based interferometers are instruments capable to detect gravitational wave signals exploiting advanced laser interferometry techniques. The underlying data analysis task consists in identifying specific patterns in noisy timeseries, but it is made extremely complex by the incredibly small amplitude of the target signals. In this scenario, the development of…
▽ More
Advanced LIGO and Advanced Virgo ground-based interferometers are instruments capable to detect gravitational wave signals exploiting advanced laser interferometry techniques. The underlying data analysis task consists in identifying specific patterns in noisy timeseries, but it is made extremely complex by the incredibly small amplitude of the target signals. In this scenario, the development of effective gravitational wave detection algorithms is crucial. We propose a novel layered framework for real-time detection of gravitational waves inspired by speech processing techniques and, in the present implementation, based on a state-of-the-art machine learning approach involving a hybridization of genetic programming and neural networks. The key aspects of the newly proposed framework are: the well structured, layered approach, and the low computational complexity. The paper describes the basic concepts of the framework and the derivation of the first three layers. Even if the layers are based on models derived using a machine learning approach, the proposed layered structure has a universal nature. Compared to more complex approaches, such as convolutional neural networks, which comprise a parameter set of several tens of MB and were tested exclusively for fixed length data samples, our framework has lower accuracy (e.g., it identifies 45% of low signal-to-noise-ration gravitational wave signals, against 65% of the state-of-the-art, at a false alarm probability of $10^{-2}$), but has a much lower computational complexity and a higher degree of modularity. Furthermore, the exploitation of short-term features makes the results of the new framework virtually independent against time-position of gravitational wave signals, simplifying its future exploitation in real-time multi-layer pipelines for gravitational-wave detection with new generation interferometers.
△ Less
Submitted 16 December, 2023; v1 submitted 13 June, 2022;
originally announced June 2022.
-
Range Minimum Queries in Minimal Space
Authors:
Luís M. S. Russo
Abstract:
We consider the problem of computing a sequence of range minimum queries. We assume a sequence of commands that contains values and queries. Our goal is to quickly determine the minimum value that exists between the current position and a previous position $i$. Range minimum queries are used as a sub-routine of several algorithms, namely related to string processing. We propose a data structure th…
▽ More
We consider the problem of computing a sequence of range minimum queries. We assume a sequence of commands that contains values and queries. Our goal is to quickly determine the minimum value that exists between the current position and a previous position $i$. Range minimum queries are used as a sub-routine of several algorithms, namely related to string processing. We propose a data structure that can process these commands sequences. We obtain efficient results for several variations of the problem, in particular we obtain $O(1)$ time per command for the offline version and $O(α(n))$ amortized time for the online version, where $α(n)$ is the inverse Ackermann function and $n$ the number of values in the sequence. This data structure also has very small space requirements, namely $O(\ell)$ where $\ell$ is the maximum number active $i$ positions. We implemented our data structure and show that it is competitive against existing alternatives. We obtain comparable command processing time, in the nano second range, and much smaller space requirements.
△ Less
Submitted 18 February, 2021;
originally announced February 2021.
-
Small Longest Tandem Scattered Subsequences
Authors:
Luís M. S. Russo,
Alexandre P. Francisco
Abstract:
We consider the problem of identifying tandem scattered subsequences within a string. Our algorithm identifies a longest subsequence which occurs twice without overlap in a string. This algorithm is based on the Hunt-Szymanski algorithm, therefore its performance improves if the string is not self similar. This occurs naturally on strings over large alphabets. Our algorithm relies on new results f…
▽ More
We consider the problem of identifying tandem scattered subsequences within a string. Our algorithm identifies a longest subsequence which occurs twice without overlap in a string. This algorithm is based on the Hunt-Szymanski algorithm, therefore its performance improves if the string is not self similar. This occurs naturally on strings over large alphabets. Our algorithm relies on new results for data structures that support dynamic longest increasing sub-sequences. In the process we also obtain improved algorithms for the decremental string comparison problem.
△ Less
Submitted 24 June, 2020;
originally announced June 2020.
-
Hardness of Modern Games
Authors:
Diogo M. Costa,
Alexandre P. Francisco,
Luís M. S. Russo
Abstract:
We consider the complexity properties of modern puzzle games, Hexiom, Cut the Rope and Back to Bed. The complexity of games plays an important role in the type of experience they provide to players. Back to Bed is shown to be PSPACE-Hard and the first two are shown to be NP-Hard. These results give further insight into the structure of these games and the resulting constructions may be useful in f…
▽ More
We consider the complexity properties of modern puzzle games, Hexiom, Cut the Rope and Back to Bed. The complexity of games plays an important role in the type of experience they provide to players. Back to Bed is shown to be PSPACE-Hard and the first two are shown to be NP-Hard. These results give further insight into the structure of these games and the resulting constructions may be useful in further complexity studies.
△ Less
Submitted 21 May, 2020;
originally announced May 2020.
-
Sparsifying Parity-Check Matrices
Authors:
Luís M. S. Russo,
Tobias Dietz,
José Rui Figueira,
Alexandre P. Francisco,
Stefan Ruzika
Abstract:
Parity check matrices (PCMs) are used to define linear error correcting codes and ensure reliable information transmission over noisy channels. The set of codewords of such a code is the null space of this binary matrix.
We consider the problem of minimizing the number of one-entries in parity-check matrices. In the maximum-likelihood (ML) decoding method, the number of ones in PCMs is directly…
▽ More
Parity check matrices (PCMs) are used to define linear error correcting codes and ensure reliable information transmission over noisy channels. The set of codewords of such a code is the null space of this binary matrix.
We consider the problem of minimizing the number of one-entries in parity-check matrices. In the maximum-likelihood (ML) decoding method, the number of ones in PCMs is directly related to the time required to decode messages. We propose a simple matrix row manipulation heuristic which alters the PCM, but not the code itself. We apply simulated annealing and greedy local searches to obtain PCMs with a small number of one entries quickly, i.e. in a couple of minutes or hours when using mainstream hardware. The resulting matrices provide faster ML decoding procedures, especially for large codes.
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
Incremental Multiple Longest Common Sub-Sequences
Authors:
Luís M. S. Russo,
Alexandre P. Francisco,
Tatiana Rocher
Abstract:
We consider the problem of updating the information about multiple longest common sub-sequences. This kind of sub-sequences is used to highlight information that is shared across several information sequences, therefore it is extensively used namely in bioinformatics and computational genomics. In this paper we propose a way to maintain this information when the underlying sequences are subject to…
▽ More
We consider the problem of updating the information about multiple longest common sub-sequences. This kind of sub-sequences is used to highlight information that is shared across several information sequences, therefore it is extensively used namely in bioinformatics and computational genomics. In this paper we propose a way to maintain this information when the underlying sequences are subject to modifications, namely when letters are added and removed from the extremes of the sequence. Experimentally our data structure obtains significant improvements over the state of the art.
△ Less
Submitted 6 May, 2020;
originally announced May 2020.
-
Approximating Optimal Bidirectional Macro Schemes
Authors:
Luís M. S. Russo,
Ana D. Correia,
Gonzalo Navarro,
Alexandre P. Francisco
Abstract:
Lempel-Ziv is an easy-to-compute member of a wide family of so-called macro schemes; it restricts pointers to go in one direction only. Optimal bidirectional macro schemes are NP-complete to find, but they may provide much better compression on highly repetitive sequences. We consider the problem of approximating optimal bidirectional macro schemes. We describe a simulated annealing algorithm that…
▽ More
Lempel-Ziv is an easy-to-compute member of a wide family of so-called macro schemes; it restricts pointers to go in one direction only. Optimal bidirectional macro schemes are NP-complete to find, but they may provide much better compression on highly repetitive sequences. We consider the problem of approximating optimal bidirectional macro schemes. We describe a simulated annealing algorithm that usually converges quickly. Moreover, in some cases, we obtain bidirectional macro schemes that are provably a 2-approximation of the optimal. We test our algorithm on a number of artificial repetitive texts and verify that it is efficient in practice and outperforms Lempel-Ziv, sometimes by a wide margin.
△ Less
Submitted 4 March, 2020;
originally announced March 2020.
-
On dynamic succinct graph representations
Authors:
Miguel E. Coimbra,
Alexandre P. Francisco,
Luís M. S. Russo,
Guillermo de Bernardo,
Susana Ladra,
Gonzalo Navarro
Abstract:
We address the problem of representing dynamic graphs using $k^2$-trees. The $k^2$-tree data structure is one of the succinct data structures proposed for representing static graphs, and binary relations in general. It relies on compact representations of bit vectors. Hence, by relying on compact representations of dynamic bit vectors, we can also represent dynamic graphs. In this paper we follow…
▽ More
We address the problem of representing dynamic graphs using $k^2$-trees. The $k^2$-tree data structure is one of the succinct data structures proposed for representing static graphs, and binary relations in general. It relies on compact representations of bit vectors. Hence, by relying on compact representations of dynamic bit vectors, we can also represent dynamic graphs. In this paper we follow instead the ideas by Munro {\em et al.}, and we present an alternative implementation for representing dynamic graphs using $k^2$-trees. Our experimental results show that this new implementation is competitive in practice.
△ Less
Submitted 6 December, 2019; v1 submitted 8 November, 2019;
originally announced November 2019.
-
Poisoning Attacks with Generative Adversarial Nets
Authors:
Luis Muñoz-González,
Bjarne Pfitzner,
Matteo Russo,
Javier Carnerero-Cano,
Emil C. Lupu
Abstract:
Machine learning algorithms are vulnerable to poisoning attacks: An adversary can inject malicious points in the training dataset to influence the learning process and degrade the algorithm's performance. Optimal poisoning attacks have already been proposed to evaluate worst-case scenarios, modelling attacks as a bi-level optimization problem. Solving these problems is computationally demanding an…
▽ More
Machine learning algorithms are vulnerable to poisoning attacks: An adversary can inject malicious points in the training dataset to influence the learning process and degrade the algorithm's performance. Optimal poisoning attacks have already been proposed to evaluate worst-case scenarios, modelling attacks as a bi-level optimization problem. Solving these problems is computationally demanding and has limited applicability for some models such as deep networks. In this paper we introduce a novel generative model to craft systematic poisoning attacks against machine learning classifiers generating adversarial training examples, i.e. samples that look like genuine data points but that degrade the classifier's accuracy when used for training. We propose a Generative Adversarial Net with three components: generator, discriminator, and the target classifier. This approach allows us to model naturally the detectability constrains that can be expected in realistic attacks and to identify the regions of the underlying data distribution that can be more vulnerable to data poisoning. Our experimental evaluation shows the effectiveness of our attack to compromise machine learning classifiers, including deep networks.
△ Less
Submitted 25 September, 2019; v1 submitted 18 June, 2019;
originally announced June 2019.
-
Order-Preserving Pattern Matching Indeterminate Strings
Authors:
Diogo Costa,
Luís M. S. Russo,
Rui Henriques,
Hideo Bannai,
Alexandre P. Francisco
Abstract:
Given an indeterminate string pattern $p$ and an indeterminate string text $t$, the problem of order-preserving pattern matching with character uncertainties ($μ$OPPM) is to find all substrings of $t$ that satisfy one of the possible orderings defined by $p$. When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM)…
▽ More
Given an indeterminate string pattern $p$ and an indeterminate string text $t$, the problem of order-preserving pattern matching with character uncertainties ($μ$OPPM) is to find all substrings of $t$ that satisfy one of the possible orderings defined by $p$. When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM) problem with diverse applications on time series analysis. Despite its relevance, the exact OPPM problem suffers from two major drawbacks: 1) the inability to deal with indetermination in the text, thus preventing the analysis of noisy time series; and 2) the inability to deal with indetermination in the pattern, thus imposing the strict satisfaction of the orders among all pattern positions. This paper provides the first polynomial algorithm to answer the $μ$OPPM problem when indetermination is observed on the pattern or text. Given two strings with length $m$ and $O(r)$ uncertain characters per string position, we show that the $μ$OPPM problem can be solved in $O(mr\lg r)$ time when one string is indeterminate and $r\in\mathbb{N}^+$. Map**s into satisfiability problems are provided when indetermination is observed on both the pattern and the text, and results concerning the general problem complexity are presented as well, with $μ$OPPM problem proved to be NP-hard in general.
△ Less
Submitted 7 May, 2019;
originally announced May 2019.
-
Linking and Cutting Spanning Trees
Authors:
Luís M. S. Russo,
Andreia Sofia Teixeira,
Alexandre P Francisco
Abstract:
We consider the problem of uniformly generating a spanning tree, of a connected undirected graph. This process is useful to compute statistics, namely for phylogenetic trees. We describe a Markov chain for producing these trees. For cycle graphs we prove that this approach significantly outperforms existing algorithms. For general graphs we obtain no analytical bounds, but experimental results sho…
▽ More
We consider the problem of uniformly generating a spanning tree, of a connected undirected graph. This process is useful to compute statistics, namely for phylogenetic trees. We describe a Markov chain for producing these trees. For cycle graphs we prove that this approach significantly outperforms existing algorithms. For general graphs we obtain no analytical bounds, but experimental results show that the chain still converges quickly. This yields an efficient algorithm, also due to the use of proper fast data structures. To bound the mixing time of the chain we describe a coupling, which we analyse for cycle graphs and simulate for other graphs.
△ Less
Submitted 7 July, 2020; v1 submitted 21 January, 2018;
originally announced January 2018.
-
Cartesian trees and Lyndon trees
Authors:
Maxime Crochemore,
Luis M. S. Russo
Abstract:
The article describes the structural and algorithmic relations between Cartesian trees and Lyndon Trees. This leads to a uniform presentation of the Lyndon table of a word corresponding to the Next Nearest Smaller table of a sequence of numbers. It shows how to efficiently compute runs, that is, maximal periodicities occurring in a word.
The article describes the structural and algorithmic relations between Cartesian trees and Lyndon Trees. This leads to a uniform presentation of the Lyndon table of a word corresponding to the Next Nearest Smaller table of a sequence of numbers. It shows how to efficiently compute runs, that is, maximal periodicities occurring in a word.
△ Less
Submitted 23 December, 2017;
originally announced December 2017.
-
A Study on Splay Trees
Authors:
Luís M. S. Russo
Abstract:
We study the dynamic optimality conjecture, which predicts that splay trees are a form of universally efficient binary search tree, for any access sequence. We reduce this claim to a regular access bound, which seems plausible and might be easier to prove. This approach may be useful to establish dynamic optimality.
We study the dynamic optimality conjecture, which predicts that splay trees are a form of universally efficient binary search tree, for any access sequence. We reduce this claim to a regular access bound, which seems plausible and might be easier to prove. This approach may be useful to establish dynamic optimality.
△ Less
Submitted 7 April, 2020; v1 submitted 10 November, 2015;
originally announced November 2015.
-
Quick HyperVolume
Authors:
Luís M. S. Russo,
Alexandre P. Francisco
Abstract:
We present a new algorithm to calculate exact hypervolumes. Given a set of $d$-dimensional points, it computes the hypervolume of the dominated space. Determining this value is an important subroutine of Multiobjective Evolutionary Algorithms (MOEAs). We analyze the "Quick Hypervolume" (QHV) algorithm theoretically and experimentally. The theoretical results are a significant contribution to the c…
▽ More
We present a new algorithm to calculate exact hypervolumes. Given a set of $d$-dimensional points, it computes the hypervolume of the dominated space. Determining this value is an important subroutine of Multiobjective Evolutionary Algorithms (MOEAs). We analyze the "Quick Hypervolume" (QHV) algorithm theoretically and experimentally. The theoretical results are a significant contribution to the current state of the art. Moreover the experimental performance is also very competitive, compared with existing exact hypervolume algorithms.
A full description of the algorithm is currently submitted to IEEE Transactions on Evolutionary Computation.
△ Less
Submitted 29 November, 2012; v1 submitted 19 July, 2012;
originally announced July 2012.
-
Space-Efficient Data-Analysis Queries on Grids
Authors:
Gonzalo Navarro,
Yakov Nekrich,
Luís M. S. Russo
Abstract:
We consider various data-analysis queries on two-dimensional points. We give new space/time tradeoffs over previous work on geometric queries such as dominance and rectangle visibility, and on semigroup and group queries such as sum, average, variance, minimum and maximum. We also introduce new solutions to queries less frequently considered in the literature such as two-dimensional quantiles, maj…
▽ More
We consider various data-analysis queries on two-dimensional points. We give new space/time tradeoffs over previous work on geometric queries such as dominance and rectangle visibility, and on semigroup and group queries such as sum, average, variance, minimum and maximum. We also introduce new solutions to queries less frequently considered in the literature such as two-dimensional quantiles, majorities, successor/predecessor, mode, and various top-$k$ queries, considering static and dynamic scenarios.
△ Less
Submitted 31 March, 2012; v1 submitted 23 June, 2011;
originally announced June 2011.