-
Evaluating LLP Methods: Challenges and Approaches
Authors:
Gabriel Franco,
Giovanni Comarela,
Mark Crovella
Abstract:
Learning from Label Proportions (LLP) is an established machine learning problem with numerous real-world applications. In this setting, data items are grouped into bags, and the goal is to learn individual item labels, knowing only the features of the data and the proportions of labels in each bag. Although LLP is a well-established problem, it has several unusual aspects that create challenges f…
▽ More
Learning from Label Proportions (LLP) is an established machine learning problem with numerous real-world applications. In this setting, data items are grouped into bags, and the goal is to learn individual item labels, knowing only the features of the data and the proportions of labels in each bag. Although LLP is a well-established problem, it has several unusual aspects that create challenges for benchmarking learning methods. Fundamental complications arise because of the existence of different LLP variants, i.e., dependence structures that can exist between items, labels, and bags. Accordingly, the first algorithmic challenge is the generation of variant-specific datasets capturing the diversity of dependence structures and bag characteristics. The second methodological challenge is model selection, i.e., hyperparameter tuning; due to the nature of LLP, model selection cannot easily use the standard machine learning paradigm. The final benchmarking challenge consists of properly evaluating LLP solution methods across various LLP variants. We note that there is very little consideration of these issues in prior work, and there are no general solutions for these challenges proposed to date. To address these challenges, we develop methods capable of generating LLP datasets meeting the requirements of different variants. We use these methods to generate a collection of datasets encompassing the spectrum of LLP problem characteristics, which can be used in future evaluation studies. Additionally, we develop guidelines for benchmarking LLP algorithms, including the model selection and evaluation steps. Finally, we illustrate the new methods and guidelines by performing an extensive benchmark of a set of well-known LLP algorithms. We show that choosing the best algorithm depends critically on the LLP variant and model selection method, demonstrating the need for our proposed approach.
△ Less
Submitted 29 October, 2023;
originally announced October 2023.
-
Machine Learning-based Early Attack Detection Using Open RAN Intelligent Controller
Authors:
Bruno Missi Xavier,
Merim Dzaferagic,
Diarmuid Collins,
Giovanni Comarela,
Magnos Martinello,
Marco Ruffini
Abstract:
We design and demonstrate a method for early detection of Denial-of-Service attacks. The proposed approach takes advantage of the OpenRAN framework to collect measurements from the air interface (for attack detection) and to dynamically control the operation of the Radio Access Network (RAN). For that purpose, we developed our near-Real Time (RT) RAN Intelligent Controller (RIC) interface. We appl…
▽ More
We design and demonstrate a method for early detection of Denial-of-Service attacks. The proposed approach takes advantage of the OpenRAN framework to collect measurements from the air interface (for attack detection) and to dynamically control the operation of the Radio Access Network (RAN). For that purpose, we developed our near-Real Time (RT) RAN Intelligent Controller (RIC) interface. We apply and analyze a wide range of Machine Learning algorithms to data traffic analysis that satisfy the accuracy and latency requirements set by the near-RT RIC. Our results show that the proposed framework is able to correctly classify genuine vs. malicious traffic with high accuracy (i.e., 95%) in a realistic testbed environment, allowing us to detect attacks already at the Distributed Unit (DU), before malicious traffic even enters the Centralized Unit (CU).
△ Less
Submitted 3 February, 2023;
originally announced February 2023.
-
Tracking Knowledge Propagation Across Wikipedia Languages
Authors:
Roldolfo Valentim,
Giovanni Comarela,
Souneil Park,
Diego Saez-Trumper
Abstract:
In this paper, we present a dataset of inter-language knowledge propagation in Wikipedia. Covering the entire 309 language editions and 33M articles, the dataset aims to track the full propagation history of Wikipedia concepts, and allow follow up research on building predictive models of them. For this purpose, we align all the Wikipedia articles in a language-agnostic manner according to the con…
▽ More
In this paper, we present a dataset of inter-language knowledge propagation in Wikipedia. Covering the entire 309 language editions and 33M articles, the dataset aims to track the full propagation history of Wikipedia concepts, and allow follow up research on building predictive models of them. For this purpose, we align all the Wikipedia articles in a language-agnostic manner according to the concept they cover, which results in 13M propagation instances. To the best of our knowledge, this dataset is the first to explore the full inter-language propagation at a large scale. Together with the dataset, a holistic overview of the propagation and key insights about the underlying structural factors are provided to aid future research. For example, we find that although long cascades are unusual, the propagation tends to continue further once it reaches more than four language editions. We also find that the size of language editions is associated with the speed of propagation. We believe the dataset not only contributes to the prior literature on Wikipedia growth but also enables new use cases such as edit recommendation for addressing knowledge gaps, detection of disinformation, and cultural relationship analysis.
△ Less
Submitted 30 March, 2021;
originally announced March 2021.