-
Time-dependent Personalized PageRank for temporal networks: discrete and continuous scales
Authors:
David Aleja,
Julio Flores,
Eva Primo,
Miguel Romance
Abstract:
In this paper we explore the PageRank of temporal networks on both discrete and continuous time scales in the presence of personalization vectors that vary over time. Also the underlying interplay between the discrete and continuous settings arising from discretization is highlighted. Additionally, localization results that set bounds to the estimated influence of the personalization vector on the…
▽ More
In this paper we explore the PageRank of temporal networks on both discrete and continuous time scales in the presence of personalization vectors that vary over time. Also the underlying interplay between the discrete and continuous settings arising from discretization is highlighted. Additionally, localization results that set bounds to the estimated influence of the personalization vector on the ranking of a particular node are given. The theoretical results are illustrated by means of some real and synthetic examples.
△ Less
Submitted 20 June, 2024;
originally announced July 2024.
-
Applying ranking techniques for estimating influence of Earth variables on temperature forecast error
Authors:
M. Julia Flores,
Melissa Ruiz-Vásquez,
Ana Bastos,
René Orth
Abstract:
This paper describes how to analyze the influence of Earth system variables on the errors when providing temperature forecasts. The initial framework to get the data has been based on previous research work, which resulted in a very interesting discovery. However, the aforementioned study only worked on individual correlations of the variables with respect to the error. This research work is going…
▽ More
This paper describes how to analyze the influence of Earth system variables on the errors when providing temperature forecasts. The initial framework to get the data has been based on previous research work, which resulted in a very interesting discovery. However, the aforementioned study only worked on individual correlations of the variables with respect to the error. This research work is going to re-use the main ideas but introduce three main novelties: (1) applying a data science approach by a few representative locations; (2) taking advantage of the rankings created by Spearman correlation but enriching them with other metrics looking for a more robust ranking of the variables; (3) evaluation of the methodology by learning random forest models for regression with the distinct experimental variations. The main contribution is the framework that shows how to convert correlations into rankings and combine them into an aggregate ranking. We have carried out experiments on five chosen locations to analyze the behavior of this ranking-based methodology. The results show that the specific performance is dependent on the location and season, which is expected, and that this selection technique works properly with Random Forest models but can also improve simpler regression models such as Bayesian Ridge. This work also contributes with an extensive analysis of the results. We can conclude that this selection based on the top-k ranked variables seems promising for this real problem, and it could also be applied in other domains.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
On the Benefits of Fine-Grained Loss Truncation: A Case Study on Factuality in Summarization
Authors:
Lorenzo Jaime Yu Flores,
Arman Cohan
Abstract:
Text summarization and simplification are among the most widely used applications of AI. However, models developed for such tasks are often prone to hallucination, which can result from training on unaligned data. One efficient approach to address this issue is Loss Truncation (LT) (Kang and Hashimoto, 2020), an approach to modify the standard log loss to adaptively remove noisy examples during tr…
▽ More
Text summarization and simplification are among the most widely used applications of AI. However, models developed for such tasks are often prone to hallucination, which can result from training on unaligned data. One efficient approach to address this issue is Loss Truncation (LT) (Kang and Hashimoto, 2020), an approach to modify the standard log loss to adaptively remove noisy examples during training. However, we find that LT alone yields a considerable number of hallucinated entities on various datasets. We study the behavior of the underlying losses between factual and non-factual examples, to understand and refine the performance of LT. We demonstrate that LT's performance is limited when the underlying assumption that noisy targets have higher NLL loss is not satisfied, and find that word-level NLL among entities provides better signal for distinguishing factuality. We then leverage this to propose a fine-grained NLL loss and fine-grained data cleaning strategies, and observe improvements in hallucination reduction across some datasets. Our work is available at https://https://github.com/yale-nlp/fine-grained-lt.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity
Authors:
Eric Khiu,
Hasti Toossi,
David Anugraha,
**yu Liu,
Jiaxu Li,
Juan Armando Parra Flores,
Leandro Acros Roman,
A. Seza Doğruöz,
En-Shiun Annie Lee
Abstract:
Fine-tuning and testing a multilingual large language model is expensive and challenging for low-resource languages (LRLs). While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors: the si…
▽ More
Fine-tuning and testing a multilingual large language model is expensive and challenging for low-resource languages (LRLs). While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors: the size of the fine-tuning corpus, the domain similarity between fine-tuning and testing corpora, and the language similarity between source and target languages. We employ classical regression models to assess how these factors impact the model's performance. Our results indicate that domain similarity has the most critical impact on predicting the performance of Machine Translation models.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
Medical Text Simplification: Optimizing for Readability with Unlikelihood Training and Reranked Beam Search Decoding
Authors:
Lorenzo Jaime Yu Flores,
Heyuan Huang,
Kejian Shi,
Sophie Chheang,
Arman Cohan
Abstract:
Text simplification has emerged as an increasingly useful application of AI for bridging the communication gap in specialized fields such as medicine, where the lexicon is often dominated by technical jargon and complex constructs. Despite notable progress, methods in medical simplification sometimes result in the generated text having lower quality and diversity. In this work, we explore ways to…
▽ More
Text simplification has emerged as an increasingly useful application of AI for bridging the communication gap in specialized fields such as medicine, where the lexicon is often dominated by technical jargon and complex constructs. Despite notable progress, methods in medical simplification sometimes result in the generated text having lower quality and diversity. In this work, we explore ways to further improve the readability of text simplification in the medical domain. We propose (1) a new unlikelihood loss that encourages generation of simpler terms and (2) a reranked beam search decoding method that optimizes for simplicity, which achieve better performance on readability metrics on three datasets. This study's findings offer promising avenues for improving text simplification in the medical field.
△ Less
Submitted 25 October, 2023; v1 submitted 17 October, 2023;
originally announced October 2023.
-
Gotta Catch 'em All: Aggregating CVSS Scores
Authors:
Angel Longueira-Romero,
Jose Luis Flores,
Rosa Iglesias,
Iñaki Garitano
Abstract:
Security metrics are not standardized, but inter-national proposals such as the Common Vulnerability ScoringSystem (CVSS) for quantifying the severity of known vulnerabil-ities are widely used. Many CVSS aggregation mechanisms havebeen proposed in the literature. Nevertheless, factors related tothe context of the System Under Test (SUT) are not taken intoaccount in the aggregation process; vulnera…
▽ More
Security metrics are not standardized, but inter-national proposals such as the Common Vulnerability ScoringSystem (CVSS) for quantifying the severity of known vulnerabil-ities are widely used. Many CVSS aggregation mechanisms havebeen proposed in the literature. Nevertheless, factors related tothe context of the System Under Test (SUT) are not taken intoaccount in the aggregation process; vulnerabilities that in theoryaffect the SUT, but are not exploitable in reality. We propose aCVSS aggregation algorithm that integrates information aboutthe functionality disruption of the SUT, exploitation difficulty,existence of exploits, and the context where the SUT operates.The aggregation algorithm was applied to OpenPLC V3, showingthat it is capable of filtering out vulnerabilities that cannot beexploited in the real conditions of deployment of the particularsystem. Finally, because of the nature of the proposed algorithm,the result can be interpreted in the same way as a normal CVSS.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
Measuring and Predicting the Quality of a Join for Data Discovery
Authors:
Sergi Nadal,
Raquel Panadero,
Javier Flores,
Oscar Romero
Abstract:
We study the problem of discovering joinable datasets at scale. We approach the problem from a learning perspective relying on profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a distributed and parallel fashion. Profiles are then compared, to predict the quality of a join oper…
▽ More
We study the problem of discovering joinable datasets at scale. We approach the problem from a learning perspective relying on profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a distributed and parallel fashion. Profiles are then compared, to predict the quality of a join operation among a pair of attributes from different datasets. In contrast to the state-of-the-art, we define a novel notion of join quality that relies on a metric considering both the containment and cardinality proportion between join candidate attributes. We implement our approach in a system called NextiaJD, and present experiments to show the predictive performance and computational efficiency of our method. Our experiments show that NextiaJD obtains greater predictive performance to that of hash-based methods while we are able to scale-up to larger volumes of data.
△ Less
Submitted 31 May, 2023;
originally announced May 2023.
-
Clustered Federated Learning Architecture for Network Anomaly Detection in Large Scale Heterogeneous IoT Networks
Authors:
Xabier Sáez-de-Cámara,
Jose Luis Flores,
Cristóbal Arellano,
Aitor Urbieta,
Urko Zurutuza
Abstract:
There is a growing trend of cyberattacks against Internet of Things (IoT) devices; moreover, the sophistication and motivation of those attacks is increasing. The vast scale of IoT, diverse hardware and software, and being typically placed in uncontrolled environments make traditional IT security mechanisms such as signature-based intrusion detection and prevention systems challenging to integrate…
▽ More
There is a growing trend of cyberattacks against Internet of Things (IoT) devices; moreover, the sophistication and motivation of those attacks is increasing. The vast scale of IoT, diverse hardware and software, and being typically placed in uncontrolled environments make traditional IT security mechanisms such as signature-based intrusion detection and prevention systems challenging to integrate. They also struggle to cope with the rapidly evolving IoT threat landscape due to long delays between the analysis and publication of the detection rules. Machine learning methods have shown faster response to emerging threats; however, model training architectures like cloud or edge computing face multiple drawbacks in IoT settings, including network overhead and data isolation arising from the large scale and heterogeneity that characterizes these networks.
This work presents an architecture for training unsupervised models for network intrusion detection in large, distributed IoT and Industrial IoT (IIoT) deployments. We leverage Federated Learning (FL) to collaboratively train between peers and reduce isolation and network overhead problems. We build upon it to include an unsupervised device clustering algorithm fully integrated into the FL pipeline to address the heterogeneity issues that arise in FL settings. The architecture is implemented and evaluated using a testbed that includes various emulated IoT/IIoT devices and attackers interacting in a complex network topology comprising 100 emulated devices, 30 switches and 10 routers. The anomaly detection models are evaluated on real attacks performed by the testbed's threat actors, including the entire Mirai malware lifecycle, an additional botnet based on the Merlin command and control server and other red-teaming tools performing scanning activities and multiple attacks targeting the emulated devices.
△ Less
Submitted 27 July, 2023; v1 submitted 28 March, 2023;
originally announced March 2023.
-
LoFT: Enhancing Faithfulness and Diversity for Table-to-Text Generation via Logic Form Control
Authors:
Yilun Zhao,
Zhenting Qi,
Linyong Nan,
Lorenzo Jaime Yu Flores,
Dragomir Radev
Abstract:
Logical Table-to-Text (LT2T) generation is tasked with generating logically faithful sentences from tables. There currently exists two challenges in the field: 1) Faithfulness: how to generate sentences that are factually correct given the table content; 2) Diversity: how to generate multiple sentences that offer different perspectives on the table. This work proposes LoFT, which utilizes logic fo…
▽ More
Logical Table-to-Text (LT2T) generation is tasked with generating logically faithful sentences from tables. There currently exists two challenges in the field: 1) Faithfulness: how to generate sentences that are factually correct given the table content; 2) Diversity: how to generate multiple sentences that offer different perspectives on the table. This work proposes LoFT, which utilizes logic forms as fact verifiers and content planners to control LT2T generation. Experimental results on the LogicNLG dataset demonstrate that LoFT is the first model that addresses unfaithfulness and lack of diversity issues simultaneously. Our code is publicly available at https://github.com/Yale-LILY/LoFT.
△ Less
Submitted 6 February, 2023;
originally announced February 2023.
-
Statistical analysis of word flow among five Indo-European languages
Authors:
Josué Ely Molina,
Jorge Flores,
Carlos Gershenson,
Carlos Pineda
Abstract:
A recent increase in data availability has allowed the possibility to perform different statistical linguistic studies. Here we use the Google Books Ngram dataset to analyze word flow among English, French, German, Italian, and Spanish. We study what we define as ``migrant words'', a type of loanwords that do not change their spelling. We quantify migrant words from one language to another for dif…
▽ More
A recent increase in data availability has allowed the possibility to perform different statistical linguistic studies. Here we use the Google Books Ngram dataset to analyze word flow among English, French, German, Italian, and Spanish. We study what we define as ``migrant words'', a type of loanwords that do not change their spelling. We quantify migrant words from one language to another for different decades, and notice that most migrant words can be aggregated in semantic fields and associated to historic events. We also study the statistical properties of accumulated migrant words and their rank dynamics. We propose a measure of use of migrant words that could be used as a proxy of cultural influence. Our methodology is not exempt of caveats, but our results are encouraging to promote further studies in this direction.
△ Less
Submitted 17 January, 2023;
originally announced January 2023.
-
AquaFeL-PSO: A Monitoring System for Water Resources using Autonomous Surface Vehicles based on Multimodal PSO and Federated Learning
Authors:
Micaela Jara Ten Kathen,
Princy Johnson,
Isabel Jurado Flores,
Daniel Guti errez Reina
Abstract:
The preservation, monitoring, and control of water resources has been a major challenge in recent decades. Water resources must be constantly monitored to know the contamination levels of water. To meet this objective, this paper proposes a water monitoring system using autonomous surface vehicles, equipped with water quality sensors, based on a multimodal particle swarm optimization, and the fede…
▽ More
The preservation, monitoring, and control of water resources has been a major challenge in recent decades. Water resources must be constantly monitored to know the contamination levels of water. To meet this objective, this paper proposes a water monitoring system using autonomous surface vehicles, equipped with water quality sensors, based on a multimodal particle swarm optimization, and the federated learning technique, with Gaussian process as a surrogate model, the AquaFeL-PSO algorithm. The proposed monitoring system has two phases, the exploration phase and the exploitation phase. In the exploration phase, the vehicles examine the surface of the water resource, and with the data acquired by the water quality sensors, a first water quality model is estimated in the central server. In the exploitation phase, the area is divided into action zones using the model estimated in the exploration phase for a better exploitation of the contamination zones. To obtain the final water quality model of the water resource, the models obtained in both phases are combined. The results demonstrate the efficiency of the proposed path planner in obtaining water quality models of the pollution zones, with a 14$\%$ improvement over the other path planners compared, and the entire water resource, obtaining a 400$\%$ better model, as well as in detecting pollution peaks, the improvement in this case study is 4,000$\%$. It was also proven that the results obtained by applying the federated learning technique are very similar to the results of a centralized system.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic N-Gram Rule Generation for Spelling Normalization in Filipino
Authors:
Lorenzo Jaime Yu Flores,
Dragomir Radev
Abstract:
With 84.75 million Filipinos online, the ability for models to process online text is crucial for develo** Filipino NLP applications. To this end, spelling correction is a crucial preprocessing step for downstream processing. However, the lack of data prevents the use of language models for this task. In this paper, we propose an N-Gram + Damerau Levenshtein distance model with automatic rule ex…
▽ More
With 84.75 million Filipinos online, the ability for models to process online text is crucial for develo** Filipino NLP applications. To this end, spelling correction is a crucial preprocessing step for downstream processing. However, the lack of data prevents the use of language models for this task. In this paper, we propose an N-Gram + Damerau Levenshtein distance model with automatic rule extraction. We train the model on 300 samples, and show that despite limited training data, it achieves good performance and outperforms other deep learning approaches in terms of accuracy and edit distance. Moreover, the model (1) requires little compute power, (2) trains in little time, thus allowing for retraining, and (3) is easily interpretable, allowing for direct troubleshooting, highlighting the success of traditional approaches over more complex deep learning models in settings where data is unavailable.
△ Less
Submitted 5 November, 2022; v1 submitted 6 October, 2022;
originally announced October 2022.
-
Gotham Testbed: a Reproducible IoT Testbed for Security Experiments and Dataset Generation
Authors:
Xabier Sáez-de-Cámara,
Jose Luis Flores,
Cristóbal Arellano,
Aitor Urbieta,
Urko Zurutuza
Abstract:
The growing adoption of the Internet of Things (IoT) has brought a significant increase in attacks targeting those devices. Machine learning (ML) methods have shown promising results for intrusion detection; however, the scarcity of IoT datasets remains a limiting factor in develo** ML-based security systems for IoT scenarios. Static datasets get outdated due to evolving IoT architectures and th…
▽ More
The growing adoption of the Internet of Things (IoT) has brought a significant increase in attacks targeting those devices. Machine learning (ML) methods have shown promising results for intrusion detection; however, the scarcity of IoT datasets remains a limiting factor in develo** ML-based security systems for IoT scenarios. Static datasets get outdated due to evolving IoT architectures and threat landscape; meanwhile, the testbeds used to generate them are rarely published. This paper presents the Gotham testbed, a reproducible and flexible security testbed extendable to accommodate new emulated devices, services or attackers. Gotham is used to build an IoT scenario composed of 100 emulated devices communicating via MQTT, CoAP and RTSP protocols, among others, in a topology composed of 30 switches and 10 routers. The scenario presents three threat actors, including the entire Mirai botnet lifecycle and additional red-teaming tools performing DoS, scanning, and attacks targeting IoT protocols. The testbed has many purposes, including a cyber range, testing security solutions, and capturing network and application data to generate datasets. We hope that researchers can leverage and adapt Gotham to include other devices, state-of-the-art attacks and topologies to share scenarios and datasets that reflect the current IoT settings and threat landscape.
△ Less
Submitted 27 July, 2023; v1 submitted 28 July, 2022;
originally announced July 2022.
-
Open Set Classification of Untranscribed Handwritten Documents
Authors:
José Ramón Prieto,
Juan José Flores,
Enrique Vidal,
Alejandro H. Toselli,
David Garrido,
Carlos Alonso
Abstract:
Huge amounts of digital page images of important manuscripts are preserved in archives worldwide. The amounts are so large that it is generally unfeasible for archivists to adequately tag most of the documents with the required metadata so as to low proper organization of the archives and effective exploration by scholars and the general public. The class or ``typology'' of a document is perhaps t…
▽ More
Huge amounts of digital page images of important manuscripts are preserved in archives worldwide. The amounts are so large that it is generally unfeasible for archivists to adequately tag most of the documents with the required metadata so as to low proper organization of the archives and effective exploration by scholars and the general public. The class or ``typology'' of a document is perhaps the most important tag to be included in the metadata. The technical problem is one of automatic classification of documents, each consisting of a set of untranscribed handwritten text images, by the textual contents of the images. The approach considered is based on ``probabilistic indexing'', a relatively novel technology which allows to effectively represent the intrinsic word-level uncertainty exhibited by handwritten text images. We assess the performance of this approach on a large collection of complex notarial manuscripts from the Spanish Archivo Hostórico Provincial de Cádiz, with promising results.
△ Less
Submitted 20 June, 2022;
originally announced June 2022.
-
R2D2: Robust Data-to-Text with Replacement Detection
Authors:
Linyong Nan,
Lorenzo Jaime Yu Flores,
Yilun Zhao,
Yixin Liu,
Luke Benson,
Wei** Zou,
Dragomir Radev
Abstract:
Unfaithful text generation is a common problem for text generation systems. In the case of Data-to-Text (D2T) systems, the factuality of the generated text is particularly crucial for any real-world applications. We introduce R2D2, a training framework that addresses unfaithful Data-to-Text generation by training a system both as a generator and a faithfulness discriminator with additional replace…
▽ More
Unfaithful text generation is a common problem for text generation systems. In the case of Data-to-Text (D2T) systems, the factuality of the generated text is particularly crucial for any real-world applications. We introduce R2D2, a training framework that addresses unfaithful Data-to-Text generation by training a system both as a generator and a faithfulness discriminator with additional replacement detection and unlikelihood learning tasks. To facilitate such training, we propose two methods for sampling unfaithful sentences. We argue that the poor entity retrieval capability of D2T systems is one of the primary sources of unfaithfulness, so in addition to the existing metrics, we further propose NER-based metrics to evaluate the fidelity of D2T generations. Our experimental results show that R2D2 systems could effectively mitigate the unfaithful text generation, and they achieve new state-of-the-art results on FeTaQA, LogicNLG, and ToTTo, all with significant improvements.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
An Adversarial Benchmark for Fake News Detection Models
Authors:
Lorenzo Jaime Yu Flores,
Yiding Hao
Abstract:
With the proliferation of online misinformation, fake news detection has gained importance in the artificial intelligence community. In this paper, we propose an adversarial benchmark that tests the ability of fake news detectors to reason about real-world facts. We formulate adversarial attacks that target three aspects of "understanding": compositional semantics, lexical relations, and sensitivi…
▽ More
With the proliferation of online misinformation, fake news detection has gained importance in the artificial intelligence community. In this paper, we propose an adversarial benchmark that tests the ability of fake news detectors to reason about real-world facts. We formulate adversarial attacks that target three aspects of "understanding": compositional semantics, lexical relations, and sensitivity to modifiers. We test our benchmark using BERT classifiers fine-tuned on the LIAR arXiv:arch-ive/1705648 and Kaggle Fake-News datasets, and show that both models fail to respond to changes in compositional and lexical meaning. Our results strengthen the need for such models to be used in conjunction with other fact checking methods.
△ Less
Submitted 3 January, 2022;
originally announced January 2022.
-
A Novel Model for Vulnerability Analysis through Enhanced Directed Graphs and Quantitative Metrics
Authors:
Ángel Longueira-Romero,
Rosa Iglesias,
Jose Luis Flores,
Iñaki Garitano
Abstract:
Industrial components are of high importance because they control critical infrastructures that form the lifeline of modern societies. However, the rapid evolution of industrial components, together with the new paradigm of Industry 4.0, and the new connectivity features that will be introduced by the 5G technology, all increase the likelihood of security incidents. These incidents are caused by t…
▽ More
Industrial components are of high importance because they control critical infrastructures that form the lifeline of modern societies. However, the rapid evolution of industrial components, together with the new paradigm of Industry 4.0, and the new connectivity features that will be introduced by the 5G technology, all increase the likelihood of security incidents. These incidents are caused by the vulnerabilities present in these devices. In addition, although international standards define tasks to assess vulnerabilities, they do not specify any particular method. Having a secure design is important, but is also complex, costly, and an extra factor to manage during the lifespan of the device. This paper presents a model to analyze the known vulnerabilities of industrial components over time. The proposed model is based on two main elements: a directed graph representation of the internal structure of the component, and a set of quantitative metrics that are based on international security standards; such as, the Common Vulnerability Scoring System (CVSS). This model is applied throughout the entire lifespan of a device to track vulnerabilities, identify new requirements, root causes, and test cases. The proposed model also helps to prioritize patching activities. To test its potential, the proposed model is applied to the OpenPLC project. The results show that most of the root causes of these vulnerabilities are related to memory buffer operations and are concentrated in the \textit{libssl} library. Consequently, new requirements and test cases were generated from the obtained data.
△ Less
Submitted 13 December, 2021;
originally announced December 2021.
-
Keep It Unbiased: A Comparison Between Estimation of Distribution Algorithms and Deep Learning for Human Interaction-Free Side-Channel Analysis
Authors:
Unai Rioja,
Lejla Batina,
Igor Armendariz,
Jose Luis Flores
Abstract:
Evaluating side-channel analysis (SCA) security is a complex process, involving applying several techniques whose success depends on human engineering. Therefore, it is crucial to avoid a false sense of confidence provided by non-optimal (failing) attacks. Different alternatives have emerged lately trying to mitigate human dependency, among which deep learning (DL) attacks are the most studied tod…
▽ More
Evaluating side-channel analysis (SCA) security is a complex process, involving applying several techniques whose success depends on human engineering. Therefore, it is crucial to avoid a false sense of confidence provided by non-optimal (failing) attacks. Different alternatives have emerged lately trying to mitigate human dependency, among which deep learning (DL) attacks are the most studied today. DL promise to simplify the procedure by e.g. evading the need for point of interest selection or the capability of bypassing noise and desynchronization, among other shortcuts. However, including DL in the equation comes at a price, since working with neural networks is not straightforward in this context. Recently, an alternative has appeared with the potential to mitigate this dependence without adding extra complexity: Estimation of Distribution Algorithm-based SCA. In this paper, we compare these two relevant methods, supporting our findings by experiments on various datasets.
△ Less
Submitted 26 November, 2021;
originally announced November 2021.
-
Statistical Properties of Rankings in Sports and Games
Authors:
José Antonio Morales,
Jorge Flores,
Carlos Gershenson,
Carlos Pineda
Abstract:
Any collection can be ranked. Sports and games are common examples of ranked systems: players and teams are constantly ranked using different methods. The statistical properties of rankings have been studied for almost a century in a variety of fields. More recently, data availability has allowed us to study rank dynamics: how elements of a ranking change in time. Here, we study the rank distribut…
▽ More
Any collection can be ranked. Sports and games are common examples of ranked systems: players and teams are constantly ranked using different methods. The statistical properties of rankings have been studied for almost a century in a variety of fields. More recently, data availability has allowed us to study rank dynamics: how elements of a ranking change in time. Here, we study the rank distributions and rank dynamics of twelve datasets from different sports and games. To study rank dynamics, we consider measures we have defined previously: rank diversity, change probability, rank entropy, and rank complexity. We also introduce a new measure that we call ``system closure'' that reflects how many elements enter or leave the rankings in time. We use a random walk model to reproduce the observed rank dynamics, showing that a simple mechanism can generate similar statistical properties as the ones observed in the datasets. Our results show that, while rank distributions vary considerably for different rankings, rank dynamics have similar behaviors, independently of the nature and competitiveness of the sport or game and its ranking method. Our results also suggest that our measures of rank dynamics are general and applicable for complex systems of different natures.
△ Less
Submitted 5 November, 2021;
originally announced November 2021.
-
GeSERA: General-domain Summary Evaluation by Relevance Analysis
Authors:
Jessica López Espejel,
Gaël de Chalendar,
Jorge Garcia Flores,
Thierry Charnois,
Ivan Vladimir Meza Ruiz
Abstract:
We present GeSERA, an open-source improved version of SERA for evaluating automatic extractive and abstractive summaries from the general domain. SERA is based on a search engine that compares candidate and reference summaries (called queries) against an information retrieval document base (called index). SERA was originally designed for the biomedical domain only, where it showed a better correla…
▽ More
We present GeSERA, an open-source improved version of SERA for evaluating automatic extractive and abstractive summaries from the general domain. SERA is based on a search engine that compares candidate and reference summaries (called queries) against an information retrieval document base (called index). SERA was originally designed for the biomedical domain only, where it showed a better correlation with manual methods than the widely used lexical-based ROUGE method. In this paper, we take out SERA from the biomedical domain to the general one by adapting its content-based method to successfully evaluate summaries from the general domain. First, we improve the query reformulation strategy with POS Tags analysis of general-domain corpora. Second, we replace the biomedical index used in SERA with two article collections from AQUAINT-2 and Wikipedia. We conduct experiments with TAC2008, TAC2009, and CNNDM datasets. Results show that, in most cases, GeSERA achieves higher correlations with manual evaluation methods than SERA, while it reduces its gap with ROUGE for general-domain summary evaluation. GeSERA even surpasses ROUGE in two cases of TAC2009. Finally, we conduct extensive experiments and provide a comprehensive study of the impact of human annotators and the index size on summary evaluation with SERA and GeSERA.
△ Less
Submitted 7 October, 2021;
originally announced October 2021.
-
Auto-tune POIs: Estimation of distribution algorithms for efficient side-channel analysis
Authors:
Unai Rioja,
Lejla Batina,
Jose Luis Flores,
Igor Armendariz
Abstract:
Due to the constant increase and versatility of IoT devices that should keep sensitive information private, Side-Channel Analysis (SCA) attacks on embedded devices are gaining visibility in the industrial field. The integration and validation of countermeasures against SCA can be an expensive and cumbersome process, especially for the less experienced ones, and current certification procedures req…
▽ More
Due to the constant increase and versatility of IoT devices that should keep sensitive information private, Side-Channel Analysis (SCA) attacks on embedded devices are gaining visibility in the industrial field. The integration and validation of countermeasures against SCA can be an expensive and cumbersome process, especially for the less experienced ones, and current certification procedures require to attack the devices under test using multiple SCA techniques and attack vectors, often implying a high degree of complexity. The goal of this paper is to ease one of the most crucial and tedious steps of profiling attacks i.e. the points of interest (POI) selection and hence assist the SCA evaluation process. To this end, we introduce the usage of Estimation of Distribution Algorithms (EDAs) in the SCA field in order to automatically tune the point of interest selection. We showcase our approach on several experimental use cases, including attacks on unprotected and protected AES implementations over distinct copies of the same device, dismissing in this way the portability issue.
△ Less
Submitted 20 January, 2021; v1 submitted 24 December, 2020;
originally announced December 2020.
-
Scalable Data Discovery Using Profiles
Authors:
Javier Flores,
Sergi Nadal,
Oscar Romero
Abstract:
We study the problem of discovering joinable datasets at scale. This is, how to automatically discover pairs of attributes in a massive collection of independent, heterogeneous datasets that can be joined. Exact (e.g., based on distinct values) and hash-based (e.g., based on locality-sensitive hashing) techniques require indexing the entire dataset, which is unattainable at scale. To overcome this…
▽ More
We study the problem of discovering joinable datasets at scale. This is, how to automatically discover pairs of attributes in a massive collection of independent, heterogeneous datasets that can be joined. Exact (e.g., based on distinct values) and hash-based (e.g., based on locality-sensitive hashing) techniques require indexing the entire dataset, which is unattainable at scale. To overcome this issue, we approach the problem from a learning perspective relying on profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a distributed and parallel fashion. Profiles are then compared, to predict the quality of a join operation among a pair of attributes from different datasets. In contrast to the state-of-the-art, we define a novel notion of join quality that relies on a metric considering both the containment and cardinality proportions between candidate attributes. We implement our approach in a system called NextiaJD, and present extensive experiments to show the predictive performance and computational efficiency of our method. Our experiments show that NextiaJD obtains similar predictive performance to that of hash-based methods, yet we are able to scale-up to larger volumes of data. Also, NextiaJD generates a considerably less amount of false positives, which is a desirable feature at scale.
△ Less
Submitted 3 December, 2020; v1 submitted 1 December, 2020;
originally announced December 2020.
-
Audience and Streamer Participation at Scale on Twitch
Authors:
Claudia Flores-Saviaga,
Jessica Hammer,
Juan Pablo Flores,
Joseph Seering,
Stuart Reeves,
Saiph Savage
Abstract:
Large-scale streaming platforms such as Twitch are becoming increasingly popular, but detailed audience-streamer interaction dynamics remain unexplored at scale. In this paper, we perform a mixed-methods study on a dataset with over 12 million audience chat messages and 45 hours of streaming video to understand audience participation and streamer performance on Twitch. We uncover five types of str…
▽ More
Large-scale streaming platforms such as Twitch are becoming increasingly popular, but detailed audience-streamer interaction dynamics remain unexplored at scale. In this paper, we perform a mixed-methods study on a dataset with over 12 million audience chat messages and 45 hours of streaming video to understand audience participation and streamer performance on Twitch. We uncover five types of streams based on size and audience participation styles: Clique Streams, small streams with close streamer-audience interactions; Rising Streamers, mid-range streams using custom technology and moderators to formalize their communities; Chatter-boxes, mid-range streams with established conversational dynamics; Spotlight Streamers, large streams that engage large numbers of viewers while still retaining a sense of community; and Professionals, massive streams with the stadium-style audiences. We discuss challenges and opportunities emerging for streamers and audiences from each style and conclude by providing data-backed design implications that empower streamers, audiences, live streaming platforms, and game designers
△ Less
Submitted 30 November, 2020;
originally announced December 2020.
-
Interpretable Poverty Map** using Social Media Data, Satellite Images, and Geospatial Information
Authors:
Chiara Ledesma,
Oshean Lee Garonita,
Lorenzo Jaime Flores,
Isabelle Tingzon,
Danielle Dalisay
Abstract:
Access to accurate, granular, and up-to-date poverty data is essential for humanitarian organizations to identify vulnerable areas for poverty alleviation efforts. Recent works have shown success in combining computer vision and satellite imagery for poverty estimation; however, the cost of acquiring high-resolution images coupled with black box models can be a barrier to adoption for many develop…
▽ More
Access to accurate, granular, and up-to-date poverty data is essential for humanitarian organizations to identify vulnerable areas for poverty alleviation efforts. Recent works have shown success in combining computer vision and satellite imagery for poverty estimation; however, the cost of acquiring high-resolution images coupled with black box models can be a barrier to adoption for many development organizations. In this study, we present a interpretable and cost-efficient approach to poverty estimation using machine learning and readily accessible data sources including social media data, low-resolution satellite images, and volunteered geographic information. Using our method, we achieve an $R^2$ of 0.66 for wealth estimation in the Philippines, compared to 0.63 using satellite imagery. Finally, we use feature importance analysis to identify the highest contributing features both globally and locally to help decision makers gain deeper insights into poverty.
△ Less
Submitted 27 November, 2020;
originally announced November 2020.
-
Machine learning for music genre: multifaceted review and experimentation with audioset
Authors:
Jaime Ramírez,
M. Julia Flores
Abstract:
Music genre classification is one of the sub-disciplines of music information retrieval (MIR) with growing popularity among researchers, mainly due to the already open challenges. Although research has been prolific in terms of number of published works, the topic still suffers from a problem in its foundations: there is no clear and formal definition of what genre is. Music categorizations are va…
▽ More
Music genre classification is one of the sub-disciplines of music information retrieval (MIR) with growing popularity among researchers, mainly due to the already open challenges. Although research has been prolific in terms of number of published works, the topic still suffers from a problem in its foundations: there is no clear and formal definition of what genre is. Music categorizations are vague and unclear, suffering from human subjectivity and lack of agreement. In its first part, this paper offers a survey trying to cover the many different aspects of the matter. Its main goal is give the reader an overview of the history and the current state-of-the-art, exploring techniques and datasets used to the date, as well as identifying current challenges, such as this ambiguity of genre definitions or the introduction of human-centric approaches. The paper pays special attention to new trends in machine learning applied to the music annotation problem. Finally, we also include a music genre classification experiment that compares different machine learning models using Audioset.
△ Less
Submitted 28 November, 2019;
originally announced November 2019.
-
Fast Convolutional Dictionary Learning off the Grid
Authors:
Andrew H. Song,
Francisco J. Flores,
Demba Ba
Abstract:
Given a continuous-time signal that can be modeled as the superposition of localized, time-shifted events from multiple sources, the goal of Convolutional Dictionary Learning (CDL) is to identify the location of the events--by Convolutional Sparse Coding (CSC)--and learn the template for each source--by Convolutional Dictionary Update (CDU). In practice, because we observe samples of the continuou…
▽ More
Given a continuous-time signal that can be modeled as the superposition of localized, time-shifted events from multiple sources, the goal of Convolutional Dictionary Learning (CDL) is to identify the location of the events--by Convolutional Sparse Coding (CSC)--and learn the template for each source--by Convolutional Dictionary Update (CDU). In practice, because we observe samples of the continuous-time signal on a uniformly-sampled grid in discrete time, classical CSC methods can only produce estimates of the times when the events occur on this grid, which degrades the performance of the CDU. We introduce a CDL framework that significantly reduces the errors arising from performing the estimation in discrete time. Specifically, we construct an expanded dictionary that comprises, not only discrete-time shifts of the templates, but also interpolated variants, obtained by bandlimited interpolation, that account for continuous-time shifts. For CSC, we develop a novel computationally efficient CSC algorithm, termed Convolutional Orthogonal Matching Pursuit with interpolated dictionary (COMP-INTERP). We benchmarked COMP-INTERP to Contiunuous Basis Pursuit (CBP), the state-of-the-art CSC algorithm for estimating off-the-grid events, and demonstrate, on simulated data, that 1) COMP-INTERP achieves a similar level of accuracy, and 2) is two orders of magnitude faster. For CDU, we derive a novel procedure to update the templates given sparse codes that can occur both on and off the discrete-time grid. We also show that 3) dictionary update with the overcomplete dictionary yields more accurate templates. Finally, we apply the algorithms to the spike sorting problem on electrophysiology recording and show their competitive performance.
△ Less
Submitted 21 July, 2019;
originally announced July 2019.
-
Cellular morphogenesis of three-dimensional tensegrity structures
Authors:
Omar Aloui,
Jessica Flores,
David Orden,
Landolf Rhode-Barbarigos
Abstract:
The topology and form finding of tensegrity structures have been studied extensively since the introduction of the tensegrity concept. However, most of these studies address topology and form separately, where the former represented a research focus of rigidity theory and graph theory, while the latter attracted the attention of structural engineers. In this paper, a biomimetic approach for the co…
▽ More
The topology and form finding of tensegrity structures have been studied extensively since the introduction of the tensegrity concept. However, most of these studies address topology and form separately, where the former represented a research focus of rigidity theory and graph theory, while the latter attracted the attention of structural engineers. In this paper, a biomimetic approach for the combined topology and form finding of spatial tensegrity systems is introduced. Tensegrity cells, elementary infinitesimally rigid self-stressed structures that have been proven to compose any tensegrity, are used to generate more complex tensegrity structures through the morphogenesis mechanisms of adhesion and fusion. A methodology for constructing a basis to describe the self-stress space is also provided. Through the definition of self-stress, the cellular morphogenesis method can integrate design considerations, such as a desired shape or number of nodes and members, providing great flexibility and control over the tensegrity structure generated.
△ Less
Submitted 15 February, 2019;
originally announced February 2019.
-
Fair Algorithms for Clustering
Authors:
Suman K. Bera,
Deeparnab Chakrabarty,
Nicolas J. Flores,
Maryam Negahbani
Abstract:
We study the problem of finding low-cost Fair Clusterings in data where each data point may belong to many protected groups. Our work significantly generalizes the seminal work of Chierichetti et.al. (NIPS 2017) as follows.
- We allow the user to specify the parameters that define fair representation. More precisely, these parameters define the maximum over- and minimum under-representation of a…
▽ More
We study the problem of finding low-cost Fair Clusterings in data where each data point may belong to many protected groups. Our work significantly generalizes the seminal work of Chierichetti et.al. (NIPS 2017) as follows.
- We allow the user to specify the parameters that define fair representation. More precisely, these parameters define the maximum over- and minimum under-representation of any group in any cluster.
- Our clustering algorithm works on any $\ell_p$-norm objective (e.g. $k$-means, $k$-median, and $k$-center). Indeed, our algorithm transforms any vanilla clustering solution into a fair one incurring only a slight loss in quality.
- Our algorithm also allows individuals to lie in multiple protected groups. In other words, we do not need the protected groups to partition the data and we can maintain fairness across different groups simultaneously.
Our experiments show that on established data sets, our algorithm performs much better in practice than what our theoretical results suggest.
△ Less
Submitted 17 June, 2019; v1 submitted 8 January, 2019;
originally announced January 2019.
-
Diversity, Topology, and the Risk of Node Re-identification in Labeled Social Graphs
Authors:
Sameera Horawalavithana,
Clayton Gandy,
Juan Arroyo Flores,
John Skvoretz,
Adriana Iamnitchi
Abstract:
Real network datasets provide significant benefits for understanding phenomena such as information diffusion or network evolution. Yet the privacy risks raised from sharing real graph datasets, even when stripped of user identity information, are significant. When nodes have associated attributes, the privacy risks increase. In this paper we quantitatively study the impact of binary node attribute…
▽ More
Real network datasets provide significant benefits for understanding phenomena such as information diffusion or network evolution. Yet the privacy risks raised from sharing real graph datasets, even when stripped of user identity information, are significant. When nodes have associated attributes, the privacy risks increase. In this paper we quantitatively study the impact of binary node attributes on node privacy by employing machine-learning-based re-identification attacks and exploring the interplay between graph topology and attribute placement. Our experiments show that the population's diversity on the binary attribute consistently degrades anonymity.
△ Less
Submitted 31 August, 2018;
originally announced August 2018.
-
A Fuzzy Control System for Inductive Video Games
Authors:
Carlos Lara-Alvarez,
Hugo Mitre-Hernandez,
Juan Flores,
Maria Fuentes
Abstract:
It has been shown that the emotional state of students has an important relationship with learning; for instance, engaged concentration is positively correlated with learning. This paper proposes the Inductive Control (IC) for educational games. Unlike conventional approaches that only modify the game level, the proposed technique also induces emotions in the player for supporting the learning pro…
▽ More
It has been shown that the emotional state of students has an important relationship with learning; for instance, engaged concentration is positively correlated with learning. This paper proposes the Inductive Control (IC) for educational games. Unlike conventional approaches that only modify the game level, the proposed technique also induces emotions in the player for supporting the learning process. This paper explores a fuzzy system that analyzes the players' performance and their emotional state for controlling the level and aesthetic content of an educational video game. The emotional state of the player is recognized through voice analysis. A total of 20 subjects played a video game designed to practice basic math skills; for each trial, a student plays two times in a row the same game but each time the game was controlled by one of the two approaches ---Dynamic Difficulty Adjustment (DDA) and IC, the playing order was assigned randomly. Results show that when the proposed approach is used the participants changed faster from Unpleasant--low to pleasant or high emotions, and reached softly and kept in the flow zone. These experiments demonstrate that the inductive control technique improves the learning effectiveness through detection and stimulation of positive emotions.
△ Less
Submitted 15 April, 2018; v1 submitted 4 September, 2017;
originally announced September 2017.
-
Rank diversity of languages: Generic behavior in computational linguistics
Authors:
Germinal Cocho,
Jorge Flores,
Carlos Gershenson,
Carlos Pineda,
Sergio Sánchez
Abstract:
Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution \emph{rank diversity}. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation a…
▽ More
Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution \emph{rank diversity}. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: "heads" consist of words which almost do not change their rank in time, "bodies" are words of general use, while "tails" are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity of words can be understood as the result of random variations in rank, where the size of the variation depends on the rank itself. We find that the core size is similar for all languages studied.
△ Less
Submitted 14 May, 2015;
originally announced May 2015.
-
Eigenvector centrality of nodes in multiplex networks
Authors:
Luis Sola,
Miguel Romance,
Regino Criado,
Julio Flores,
Alejandro Garcia del Amo,
Stefano Boccaletti
Abstract:
We extend the concept of eigenvector centrality to multiplex networks, and introduce several alternative parameters that quantify the importance of nodes in a multi-layered networked system, including the definition of vectorial-type centralities. In addition, we rigorously show that, under reasonable conditions, such centrality measures exist and are unique. Computer experiments and simulations d…
▽ More
We extend the concept of eigenvector centrality to multiplex networks, and introduce several alternative parameters that quantify the importance of nodes in a multi-layered networked system, including the definition of vectorial-type centralities. In addition, we rigorously show that, under reasonable conditions, such centrality measures exist and are unique. Computer experiments and simulations demonstrate that the proposed measures provide substantially different results when applied to the same multiplex structure, and highlight the non-trivial relationships between the different measures of centrality introduced.
△ Less
Submitted 4 September, 2013; v1 submitted 31 May, 2013;
originally announced May 2013.
-
Incremental Compilation of Bayesian networks
Authors:
Julia M. Flores,
Jose A. Gamez,
Kristian G. Olesen
Abstract:
Most methods of exact probability propagation in Bayesian networks do not carry out the inference directly over the network, but over a secondary structure known as a junction tree or a join tree (JT). The process of obtaining a JT is usually termed {sl compilation}. As compilation is usually viewed as a whole process; each time the network is modified, a new compilation process has to be carried…
▽ More
Most methods of exact probability propagation in Bayesian networks do not carry out the inference directly over the network, but over a secondary structure known as a junction tree or a join tree (JT). The process of obtaining a JT is usually termed {sl compilation}. As compilation is usually viewed as a whole process; each time the network is modified, a new compilation process has to be carried out. The possibility of reusing an already existing JT, in order to obtain the new one regarding only the modifications in the network has received only little attention in the literature. In this paper we present a method for incremental compilation of a Bayesian network, following the classical scheme in which triangulation plays the key role. In order to perform incremental compilation we propose to recompile only those parts of the JT which can have been affected by the networks modifications. To do so, we exploit the technique OF maximal prime subgraph decomposition in determining the minimal subgraph(s) that have to be recompiled, and thereby the minimal subtree(s) of the JT that should be replaced by new subtree(s).We focus on structural modifications : addition and deletion of links and variables.
△ Less
Submitted 19 October, 2012;
originally announced December 2012.
-
A mathematical model for networks with structures in the mesoscale
Authors:
Regino Criado,
Julio Flores,
Alejandro García del Amo,
Jesús Gómez-Gardeñes,
Miguel Romance
Abstract:
The new concept of multilevel network is introduced in order to embody some topological properties of complex systems with structures in the mesoscale which are not completely captured by the classical models. This new model, which generalizes the hyper-network and hyper-structure models, fits perfectly with several real-life complex systems, including social and public transportation networks. We…
▽ More
The new concept of multilevel network is introduced in order to embody some topological properties of complex systems with structures in the mesoscale which are not completely captured by the classical models. This new model, which generalizes the hyper-network and hyper-structure models, fits perfectly with several real-life complex systems, including social and public transportation networks. We present an analysis of the structural properties of the multilevel network, including the clustering and the metric structures. Some analytical relationships amongst the efficiency and clustering coefficient of this new model and the corresponding parameters of the underlying network are obtained. Finally some random models for multilevel networks are given to illustrate how different multilevel structures can produce similar underlying networks and therefore that the mesoscale structure should be taken into account in many applications.
△ Less
Submitted 15 December, 2010;
originally announced December 2010.