Search | arXiv e-print repository

TextGrad: Automatic "Differentiation" via Text

Authors: Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, James Zou

Abstract: AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, develo** principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic different… ▽ More AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, develo** principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TextGrad, a powerful framework performing automatic ``differentiation'' via text. TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TextGrad follows PyTorch's syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TextGrad's effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51\%$ to $55\%$, yields $20\%$ relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TextGrad lays a foundation to accelerate the development of the next-generation of AI systems. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 41 pages, 6 figures

arXiv:2402.13926 [pdf, other]

Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content

Authors: Federico Bianchi, James Zou

Abstract: The risks derived from large language models (LLMs) generating deceptive and damaging content have been the subject of considerable research, but even safe generations can lead to problematic downstream impacts. In our study, we shift the focus to how even safe text coming from LLMs can be easily turned into potentially dangerous content through Bait-and-Switch attacks. In such attacks, the user f… ▽ More The risks derived from large language models (LLMs) generating deceptive and damaging content have been the subject of considerable research, but even safe generations can lead to problematic downstream impacts. In our study, we shift the focus to how even safe text coming from LLMs can be easily turned into potentially dangerous content through Bait-and-Switch attacks. In such attacks, the user first prompts LLMs with safe questions and then employs a simple find-and-replace post-hoc technique to manipulate the outputs into harmful narratives. The alarming efficacy of this approach in generating toxic content highlights a significant challenge in develo** reliable safety guardrails for LLMs. In particular, we stress that focusing on the safety of the verbatim LLM outputs is insufficient and that we also need to consider post-hoc transformations. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.10634 [pdf, other]

Graph-based Forecasting with Missing Data through Spatiotemporal Downsampling

Authors: Ivan Marisca, Cesare Alippi, Filippo Maria Bianchi

Abstract: Given a set of synchronous time series, each associated with a sensor-point in space and characterized by inter-series relationships, the problem of spatiotemporal forecasting consists of predicting future observations for each point. Spatiotemporal graph neural networks achieve striking results by representing the relationships across time series as a graph. Nonetheless, most existing methods rel… ▽ More Given a set of synchronous time series, each associated with a sensor-point in space and characterized by inter-series relationships, the problem of spatiotemporal forecasting consists of predicting future observations for each point. Spatiotemporal graph neural networks achieve striking results by representing the relationships across time series as a graph. Nonetheless, most existing methods rely on the often unrealistic assumption that inputs are always available and fail to capture hidden spatiotemporal dynamics when part of the data is missing. In this work, we tackle this problem through hierarchical spatiotemporal downsampling. The input time series are progressively coarsened over time and space, obtaining a pool of representations that capture heterogeneous temporal and spatial dynamics. Conditioned on observations and missing data patterns, such representations are combined by an interpretable attention mechanism to generate the forecasts. Our approach outperforms state-of-the-art methods on synthetic and real-world benchmarks under different missing data distributions, particularly in the presence of contiguous blocks of missing values. △ Less

Submitted 8 June, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

Comments: Accepted at ICML 2024

arXiv:2402.05863 [pdf, other]

How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis

Authors: Federico Bianchi, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, James Zou

Abstract: Negotiation is the basis of social interactions; humans negotiate everything from the price of cars to how to share common resources. With rapidly growing interest in using large language models (LLMs) to act as agents on behalf of human users, such LLM agents would also need to be able to negotiate. In this paper, we study how well LLMs can negotiate with each other. We develop NegotiationArena:… ▽ More Negotiation is the basis of social interactions; humans negotiate everything from the price of cars to how to share common resources. With rapidly growing interest in using large language models (LLMs) to act as agents on behalf of human users, such LLM agents would also need to be able to negotiate. In this paper, we study how well LLMs can negotiate with each other. We develop NegotiationArena: a flexible framework for evaluating and probing the negotiation abilities of LLM agents. We implemented three types of scenarios in NegotiationArena to assess LLM's behaviors in allocating shared resources (ultimatum games), aggregate resources (trading games) and buy/sell goods (price negotiations). Each scenario allows for multiple turns of flexible dialogues between LLM agents to allow for more complex negotiations. Interestingly, LLM agents can significantly boost their negotiation outcomes by employing certain behavioral tactics. For example, by pretending to be desolate and desperate, LLMs can improve their payoffs by 20\% when negotiating against the standard GPT-4. We also quantify irrational negotiation behaviors exhibited by the LLM agents, many of which also appear in humans. Together, \NegotiationArena offers a new environment to investigate LLM interactions, enabling new insights into LLM's theory of mind, irrationality, and reasoning abilities. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2312.17221 [pdf, ps, other]

doi 10.1145/3605098.3636154

Scalable and automated Evaluation of Blue Team cyber posture in Cyber Ranges

Authors: Federica Bianchi, Enrico Bassetti, Angelo Spognardi

Abstract: Cyber ranges are virtual training ranges that have emerged as indispensable environments for conducting secure exercises and simulating real or hypothetical scenarios. These complex computational infrastructures enable the simulation of attacks, facilitating the evaluation of defense tools and methodologies and develo** novel countermeasures against threats. One of the main challenges of cyber r… ▽ More Cyber ranges are virtual training ranges that have emerged as indispensable environments for conducting secure exercises and simulating real or hypothetical scenarios. These complex computational infrastructures enable the simulation of attacks, facilitating the evaluation of defense tools and methodologies and develo** novel countermeasures against threats. One of the main challenges of cyber range scalability is the exercise evaluation that often requires the manual intervention of human operators, the White team. This paper proposes a novel approach that uses Blue and Red team reports and well-known databases to automate the evaluation and assessment of the exercise outcomes, overcoming the limitations of existing assessment models. Our proposal encompasses evaluating various aspects and metrics, explicitly emphasizing Blue Teams' actions and strategies and allowing the automated generation of their cyber posture. △ Less

Submitted 28 December, 2023; originally announced December 2023.

arXiv:2309.11118 [pdf, other]

Vehicle-to-Grid and ancillary services:a profitability analysis under uncertainty

Authors: Federico Bianchi, Alessandro Falsone, Riccardo Vignali

Abstract: The rapid and massive diffusion of electric vehicles poses new challenges to the electric system, which must be able to supply these new loads, but at the same time opens up new opportunities thanks to the possible provision of ancillary services. Indeed, in the so-called Vehicle-to-Grid (V2G) set-up, the charging power can be modulated throughout the day so that a fleet of vehicles can absorb an… ▽ More The rapid and massive diffusion of electric vehicles poses new challenges to the electric system, which must be able to supply these new loads, but at the same time opens up new opportunities thanks to the possible provision of ancillary services. Indeed, in the so-called Vehicle-to-Grid (V2G) set-up, the charging power can be modulated throughout the day so that a fleet of vehicles can absorb an excess of power from the grid or provide extra power during a shortage.To this end, many works in the literature focus on the optimization of each vehicle daily charging profiles to offer the requested ancillary services while guaranteeing a charged battery for each vehicle at the end of the day. However, the size of the economic benefits related to the provision of ancillary services varies significantly with the modeling approaches, different assumptions, and considered scenarios. In this paper we propose a profitability analysis with reference to a recently proposed framework for V2G optimal operation in presence of uncertainty. We provide necessary and sufficient conditions for profitability in a simplified case and we show via simulation that they also hold for the general case. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: Accepted by IFAC for publication under a Creative Commons Licence CC-BY-NC-ND

arXiv:2309.07875 [pdf, other]

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

Authors: Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, James Zou

Abstract: Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning.… ▽ More Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) when fine-tuning a model like LLaMA can substantially improve its safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find exaggerated safety behaviours, where too much safety-tuning makes models refuse perfectly safe prompts if they superficially resemble unsafe ones. As a whole, our results illustrate trade-offs in training LLMs to be helpful and training them to be safe. △ Less

Submitted 19 March, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

arXiv:2308.12844 [pdf, other]

Probabilistic load forecasting with Reservoir Computing

Authors: Michele Guerra, Simone Scardapane, Filippo Maria Bianchi

Abstract: Some applications of deep learning require not only to provide accurate results but also to quantify the amount of confidence in their prediction. The management of an electric power grid is one of these cases: to avoid risky scenarios, decision-makers need both precise and reliable forecasts of, for example, power loads. For this reason, point forecasts are not enough hence it is necessary to ado… ▽ More Some applications of deep learning require not only to provide accurate results but also to quantify the amount of confidence in their prediction. The management of an electric power grid is one of these cases: to avoid risky scenarios, decision-makers need both precise and reliable forecasts of, for example, power loads. For this reason, point forecasts are not enough hence it is necessary to adopt methods that provide an uncertainty quantification. This work focuses on reservoir computing as the core time series forecasting method, due to its computational efficiency and effectiveness in predicting time series. While the RC literature mostly focused on point forecasting, this work explores the compatibility of some popular uncertainty quantification methods with the reservoir setting. Both Bayesian and deterministic approaches to uncertainty assessment are evaluated and compared in terms of their prediction accuracy, computational resource efficiency and reliability of the estimated uncertainty, based on a set of carefully chosen performance metrics. △ Less

Submitted 24 August, 2023; originally announced August 2023.

arXiv:2308.01263 [pdf, other]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Authors: Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy

Abstract: Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and… ▽ More Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. We describe XSTest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models. △ Less

Submitted 1 April, 2024; v1 submitted 2 August, 2023; originally announced August 2023.

Comments: Accepted at NAACL 2024 (Main Conference)

arXiv:2304.10621 [pdf, other]

E Pluribus Unum: Guidelines on Multi-Objective Evaluation of Recommender Systems

Authors: Patrick John Chia, Giuseppe Attanasio, Jacopo Tagliabue, Federico Bianchi, Ciro Greco, Gabriel de Souza P. Moreira, Davide Eynard, Fahd Husain

Abstract: Recommender Systems today are still mostly evaluated in terms of accuracy, with other aspects beyond the immediate relevance of recommendations, such as diversity, long-term user retention and fairness, often taking a back seat. Moreover, reconciling multiple performance perspectives is by definition indeterminate, presenting a stumbling block to those in the pursuit of rounded evaluation of Recom… ▽ More Recommender Systems today are still mostly evaluated in terms of accuracy, with other aspects beyond the immediate relevance of recommendations, such as diversity, long-term user retention and fairness, often taking a back seat. Moreover, reconciling multiple performance perspectives is by definition indeterminate, presenting a stumbling block to those in the pursuit of rounded evaluation of Recommender Systems. EvalRS 2022 -- a data challenge designed around Multi-Objective Evaluation -- was a first practical endeavour, providing many insights into the requirements and challenges of balancing multiple objectives in evaluation. In this work, we reflect on EvalRS 2022 and expound upon crucial learnings to formulate a first-principles approach toward Multi-Objective model selection, and outline a set of guidelines for carrying out a Multi-Objective Evaluation challenge, with potential applicability to the problem of rounded evaluation of competing models in real-world deployments. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Comments: 15 pages, under submission

arXiv:2304.07152 [pdf, other]

Combining Stochastic Explainers and Subgraph Neural Networks can Increase Expressivity and Interpretability

Authors: Indro Spinelli, Michele Guerra, Filippo Maria Bianchi, Simone Scardapane

Abstract: Subgraph-enhanced graph neural networks (SGNN) can increase the expressive power of the standard message-passing framework. This model family represents each graph as a collection of subgraphs, generally extracted by random sampling or with hand-crafted heuristics. Our key observation is that by selecting "meaningful" subgraphs, besides improving the expressivity of a GNN, it is also possible to o… ▽ More Subgraph-enhanced graph neural networks (SGNN) can increase the expressive power of the standard message-passing framework. This model family represents each graph as a collection of subgraphs, generally extracted by random sampling or with hand-crafted heuristics. Our key observation is that by selecting "meaningful" subgraphs, besides improving the expressivity of a GNN, it is also possible to obtain interpretable results. For this purpose, we introduce a novel framework that jointly predicts the class of the graph and a set of explanatory sparse subgraphs, which can be analyzed to understand the decision process of the classifier. We compare the performance of our framework against standard subgraph extraction policies, like random node/edge deletion strategies. The subgraphs produced by our framework allow to achieve comparable performance in terms of accuracy, with the additional benefit of providing explanations. △ Less

Submitted 14 April, 2023; originally announced April 2023.

arXiv:2304.07145 [pdf, ps, other]

EvalRS 2023. Well-Rounded Recommender Systems For Real-World Deployments

Authors: Federico Bianchi, Patrick John Chia, Ciro Greco, Claudio Pomo, Gabriel Moreira, Davide Eynard, Fahd Husain, Jacopo Tagliabue

Abstract: EvalRS aims to bring together practitioners from industry and academia to foster a debate on rounded evaluation of recommender systems, with a focus on real-world impact across a multitude of deployment scenarios. Recommender systems are often evaluated only through accuracy metrics, which fall short of fully characterizing their generalization capabilities and miss important aspects, such as fair… ▽ More EvalRS aims to bring together practitioners from industry and academia to foster a debate on rounded evaluation of recommender systems, with a focus on real-world impact across a multitude of deployment scenarios. Recommender systems are often evaluated only through accuracy metrics, which fall short of fully characterizing their generalization capabilities and miss important aspects, such as fairness, bias, usefulness, informativeness. This workshop builds on the success of last year's workshop at CIKM, but with a broader scope and an interactive format. △ Less

Submitted 22 July, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

Comments: EvalRS 2023 is a workshop at KDD23. Code and hackathon materials: https://github.com/RecList/evalRS-KDD-2023

arXiv:2304.01575 [pdf, other]

The expressive power of pooling in Graph Neural Networks

Authors: Filippo Maria Bianchi, Veronica Lachi

Abstract: In Graph Neural Networks (GNNs), hierarchical pooling operators generate local summaries of the data by coarsening the graph structure and the vertex features. While considerable attention has been devoted to analyzing the expressive power of message-passing (MP) layers in GNNs, a study on how graph pooling affects the expressiveness of a GNN is still lacking. Additionally, despite the recent adva… ▽ More In Graph Neural Networks (GNNs), hierarchical pooling operators generate local summaries of the data by coarsening the graph structure and the vertex features. While considerable attention has been devoted to analyzing the expressive power of message-passing (MP) layers in GNNs, a study on how graph pooling affects the expressiveness of a GNN is still lacking. Additionally, despite the recent advances in the design of pooling operators, there is not a principled criterion to compare them. In this work, we derive sufficient conditions for a pooling operator to fully preserve the expressive power of the MP layers before it. These conditions serve as a universal and theoretically grounded criterion for choosing among existing pooling operators or designing new ones. Based on our theoretical findings, we analyze several existing pooling operators and identify those that fail to satisfy the expressiveness conditions. Finally, we introduce an experimental setup to verify empirically the expressive power of a GNN equipped with pooling layers, in terms of its capability to perform a graph isomorphism test. △ Less

Submitted 12 October, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

arXiv:2212.09056 [pdf, other]

Beyond Digital "Echo Chambers": The Role of Viewpoint Diversity in Political Discussion

Authors: Rishav Hada, Amir Ebrahimi Fard, Sarah Shugars, Federico Bianchi, Patricia Rossini, Dirk Hovy, Rebekah Tromble, Nava Tintarev

Abstract: Increasingly taking place in online spaces, modern political conversations are typically perceived to be unproductively affirming -- siloed in so called ``echo chambers'' of exclusively like-minded discussants. Yet, to date we lack sufficient means to measure viewpoint diversity in conversations. To this end, in this paper, we operationalize two viewpoint metrics proposed for recommender systems a… ▽ More Increasingly taking place in online spaces, modern political conversations are typically perceived to be unproductively affirming -- siloed in so called ``echo chambers'' of exclusively like-minded discussants. Yet, to date we lack sufficient means to measure viewpoint diversity in conversations. To this end, in this paper, we operationalize two viewpoint metrics proposed for recommender systems and adapt them to the context of social media conversations. This is the first study to apply these two metrics (Representation and Fragmentation) to real world data and to consider the implications for online conversations specifically. We apply these measures to two topics -- daylight savings time (DST), which serves as a control, and the more politically polarized topic of immigration. We find that the diversity scores for both Fragmentation and Representation are lower for immigration than for DST. Further, we find that while pro-immigrant views receive consistent pushback on the platform, anti-immigrant views largely operate within echo chambers. We observe less severe yet similar patterns for DST. Taken together, Representation and Fragmentation paint a meaningful and important new picture of viewpoint diversity. △ Less

Submitted 18 December, 2022; originally announced December 2022.

Comments: Camera-ready version in WSDM 2023

arXiv:2211.06218 [pdf, other]

Total Variation Graph Neural Networks

Authors: Jonas Berg Hansen, Filippo Maria Bianchi

Abstract: Recently proposed Graph Neural Networks (GNNs) for vertex clustering are trained with an unsupervised minimum cut objective, approximated by a Spectral Clustering (SC) relaxation. However, the SC relaxation is loose and, while it offers a closed-form solution, it also yields overly smooth cluster assignments that poorly separate the vertices. In this paper, we propose a GNN model that computes clu… ▽ More Recently proposed Graph Neural Networks (GNNs) for vertex clustering are trained with an unsupervised minimum cut objective, approximated by a Spectral Clustering (SC) relaxation. However, the SC relaxation is loose and, while it offers a closed-form solution, it also yields overly smooth cluster assignments that poorly separate the vertices. In this paper, we propose a GNN model that computes cluster assignments by optimizing a tighter relaxation of the minimum cut based on graph total variation (GTV). The cluster assignments can be used directly to perform vertex clustering or to implement graph pooling in a graph classification framework. Our model consists of two core components: i) a message-passing layer that minimizes the $\ell_1$ distance in the features of adjacent vertices, which is key to achieving sharp transitions between clusters; ii) an unsupervised loss function that minimizes the GTV of the cluster assignments while ensuring balanced partitions. Experimental results show that our model outperforms other GNNs for vertex clustering and graph classification. △ Less

Submitted 27 April, 2023; v1 submitted 11 November, 2022; originally announced November 2022.

arXiv:2211.04281 [pdf, other]

SocioProbe: What, When, and Where Language Models Learn about Sociodemographics

Authors: Anne Lauscher, Federico Bianchi, Samuel Bowman, Dirk Hovy

Abstract: Pre-trained language models (PLMs) have outperformed other NLP models on a wide range of tasks. Opting for a more thorough understanding of their capabilities and inner workings, researchers have established the extend to which they capture lower-level knowledge like grammaticality, and mid-level semantic knowledge like factual understanding. However, there is still little understanding of their k… ▽ More Pre-trained language models (PLMs) have outperformed other NLP models on a wide range of tasks. Opting for a more thorough understanding of their capabilities and inner workings, researchers have established the extend to which they capture lower-level knowledge like grammaticality, and mid-level semantic knowledge like factual understanding. However, there is still little understanding of their knowledge of higher-level aspects of language. In particular, despite the importance of sociodemographic aspects in sha** our language, the questions of whether, where, and how PLMs encode these aspects, e.g., gender or age, is still unexplored. We address this research gap by probing the sociodemographic knowledge of different single-GPU PLMs on multiple English data sets via traditional classifier probing and information-theoretic minimum description length probing. Our results show that PLMs do encode these sociodemographics, and that this knowledge is sometimes spread across the layers of some of the tested PLMs. We further conduct a multilingual analysis and investigate the effect of supplementary training to further explore to what extent, where, and with what amount of pre-training data the knowledge is encoded. Our overall results indicate that sociodemographic knowledge is still a major challenge for NLP. PLMs require large amounts of pre-training data to acquire the knowledge and models that excel in general language understanding do not seem to own more knowledge about these aspects. △ Less

Submitted 8 November, 2022; originally announced November 2022.

Comments: Accepted for publication at EMNLP 2022

arXiv:2211.03759 [pdf, other]

doi 10.1145/3593013.3594095

Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale

Authors: Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, Aylin Caliskan

Abstract: Machine learning models that convert user-written text descriptions into images are now widely available online and used by millions of users to generate millions of images a day. We investigate the potential for these models to amplify dangerous and complex stereotypes. We find a broad range of ordinary prompts produce stereotypes, including prompts simply mentioning traits, descriptors, occupati… ▽ More Machine learning models that convert user-written text descriptions into images are now widely available online and used by millions of users to generate millions of images a day. We investigate the potential for these models to amplify dangerous and complex stereotypes. We find a broad range of ordinary prompts produce stereotypes, including prompts simply mentioning traits, descriptors, occupations, or objects. For example, we find cases of prompting for basic traits or social roles resulting in images reinforcing whiteness as ideal, prompting for occupations resulting in amplification of racial and gender disparities, and prompting for objects resulting in reification of American norms. Stereotypes are present regardless of whether prompts explicitly mention identity and demographic language or avoid such language. Moreover, stereotypes persist despite mitigation strategies; neither user attempts to counter stereotypes by requesting images with specific counter-stereotypes nor institutional attempts to add system ``guardrails'' have prevented the perpetuation of stereotypes. Our analysis justifies concerns regarding the impacts of today's models, presenting striking exemplars, and connecting these findings with deep insights into harms drawn from social scientific and humanist disciplines. This work contributes to the effort to shed light on the uniquely complex biases in language-vision models and demonstrates the ways that the mass deployment of text-to-image generation models results in mass dissemination of stereotypes and resulting harms. △ Less

Submitted 7 June, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

Comments: FAccT 2023 paper. The published version is available at 10.1145/3593013.3594095

arXiv:2210.15870 [pdf, other]

"It's Not Just Hate'': A Multi-Dimensional Perspective on Detecting Harmful Speech Online

Authors: Federico Bianchi, Stefanie Anja Hills, Patricia Rossini, Dirk Hovy, Rebekah Tromble, Nava Tintarev

Abstract: Well-annotated data is a prerequisite for good Natural Language Processing models. Too often, though, annotation decisions are governed by optimizing time or annotator agreement. We make a case for nuanced efforts in an interdisciplinary setting for annotating offensive online speech. Detecting offensive content is rapidly becoming one of the most important real-world NLP tasks. However, most data… ▽ More Well-annotated data is a prerequisite for good Natural Language Processing models. Too often, though, annotation decisions are governed by optimizing time or annotator agreement. We make a case for nuanced efforts in an interdisciplinary setting for annotating offensive online speech. Detecting offensive content is rapidly becoming one of the most important real-world NLP tasks. However, most datasets use a single binary label, e.g., for hate or incivility, even though each concept is multi-faceted. This modeling choice severely limits nuanced insights, but also performance. We show that a more fine-grained multi-label approach to predicting incivility and hateful or intolerant content addresses both conceptual and performance issues. We release a novel dataset of over 40,000 tweets about immigration from the US and UK, annotated with six labels for different aspects of incivility and intolerance. Our dataset not only allows for a more nuanced understanding of harmful speech online, models trained on it also outperform or match performance on benchmark datasets. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: EMNLP 2022

arXiv:2210.14763 [pdf, other]

ProSiT! Latent Variable Discovery with PROgressive SImilarity Thresholds

Authors: Tommaso Fornaciari, Dirk Hovy, Federico Bianchi

Abstract: The most common ways to explore latent document dimensions are topic models and clustering methods. However, topic models have several drawbacks: e.g., they require us to choose the number of latent dimensions a priori, and the results are stochastic. Most clustering methods have the same issues and lack flexibility in various ways, such as not accounting for the influence of different topics on s… ▽ More The most common ways to explore latent document dimensions are topic models and clustering methods. However, topic models have several drawbacks: e.g., they require us to choose the number of latent dimensions a priori, and the results are stochastic. Most clustering methods have the same issues and lack flexibility in various ways, such as not accounting for the influence of different topics on single documents, forcing word-descriptors to belong to a single topic (hard-clustering) or necessarily relying on word representations. We propose PROgressive SImilarity Thresholds - ProSiT, a deterministic and interpretable method, agnostic to the input format, that finds the optimal number of latent dimensions and only has two hyper-parameters, which can be set efficiently via grid search. We compare this method with a wide range of topic models and clustering methods on four benchmark data sets. In most setting, ProSiT matches or outperforms the other methods in terms six metrics of topic coherence and distinctiveness, producing replicable, deterministic results. △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2210.11359 [pdf, other]

Data-Efficient Strategies for Expanding Hate Speech Detection into Under-Resourced Languages

Authors: Paul Röttger, Debora Nozza, Federico Bianchi, Dirk Hovy

Abstract: Hate speech is a global phenomenon, but most hate speech datasets so far focus on English-language content. This hinders the development of more effective hate speech detection models in hundreds of languages spoken by billions across the world. More data is needed, but annotating hateful content is expensive, time-consuming and potentially harmful to annotators. To mitigate these issues, we explo… ▽ More Hate speech is a global phenomenon, but most hate speech datasets so far focus on English-language content. This hinders the development of more effective hate speech detection models in hundreds of languages spoken by billions across the world. More data is needed, but annotating hateful content is expensive, time-consuming and potentially harmful to annotators. To mitigate these issues, we explore data-efficient strategies for expanding hate speech detection into under-resourced languages. In a series of experiments with mono- and multilingual models across five non-English languages, we find that 1) a small amount of target-language fine-tuning data is needed to achieve strong performance, 2) the benefits of using more such data decrease exponentially, and 3) initial fine-tuning on readily-available English data can partially substitute target-language data and improve model generalisability. Based on these findings, we formulate actionable recommendations for hate speech detection in low-resource language settings. △ Less

Submitted 20 October, 2022; originally announced October 2022.

Comments: Accepted at EMNLP 2022 (Main Conference)

arXiv:2210.07365 [pdf, other]

Is It Worth the (Environmental) Cost? Limited Evidence for Temporal Adaptation via Continuous Training

Authors: Giuseppe Attanasio, Debora Nozza, Federico Bianchi, Dirk Hovy

Abstract: Language is constantly changing and evolving, leaving language models to become quickly outdated. Consequently, we should continuously update our models with new data to expose them to new events and facts. However, that requires additional computing, which means new carbon emissions. Do any measurable benefits justify this cost? This paper looks for empirical evidence to support continuous traini… ▽ More Language is constantly changing and evolving, leaving language models to become quickly outdated. Consequently, we should continuously update our models with new data to expose them to new events and facts. However, that requires additional computing, which means new carbon emissions. Do any measurable benefits justify this cost? This paper looks for empirical evidence to support continuous training. We reproduce existing benchmarks and extend them to include additional time periods, models, and tasks. Our results show that the downstream task performance of temporally adapted English models for social media data do not improve over time. Pretrained models without temporal adaptation are actually significantly more effective and efficient. However, we also note a lack of suitable temporal benchmarks. Our findings invite a critical reflection on when and how to temporally adapt language models, accounting for sustainability. △ Less

Submitted 4 May, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: 8 pages

arXiv:2210.01936 [pdf, other]

When and why vision-language models behave like bags-of-words, and what to do about it?

Authors: Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, James Zou

Abstract: Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the… ▽ More Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO & Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on retrieval over existing datasets without using the composition and order information. Given that contrastive pretraining optimizes for retrieval on datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality. △ Less

Submitted 23 March, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

Comments: ICLR 2023 Oral (notable-top-5%)

arXiv:2209.07926 [pdf, other]

doi 10.7557/18.6796

Explainability in subgraphs-enhanced Graph Neural Networks

Authors: Michele Guerra, Indro Spinelli, Simone Scardapane, Filippo Maria Bianchi

Abstract: Recently, subgraphs-enhanced Graph Neural Networks (SGNNs) have been introduced to enhance the expressive power of Graph Neural Networks (GNNs), which was proved to be not higher than the 1-dimensional Weisfeiler-Leman isomorphism test. The new paradigm suggests using subgraphs extracted from the input graph to improve the model's expressiveness, but the additional complexity exacerbates an alread… ▽ More Recently, subgraphs-enhanced Graph Neural Networks (SGNNs) have been introduced to enhance the expressive power of Graph Neural Networks (GNNs), which was proved to be not higher than the 1-dimensional Weisfeiler-Leman isomorphism test. The new paradigm suggests using subgraphs extracted from the input graph to improve the model's expressiveness, but the additional complexity exacerbates an already challenging problem in GNNs: explaining their predictions. In this work, we adapt PGExplainer, one of the most recent explainers for GNNs, to SGNNs. The proposed explainer accounts for the contribution of all the different subgraphs and can produce a meaningful explanation that humans can interpret. The experiments that we performed both on real and synthetic datasets show that our framework is successful in explaining the decision process of an SGNN on graph classification tasks. △ Less

Submitted 19 January, 2023; v1 submitted 16 September, 2022; originally announced September 2022.

Comments: The source code implementing our workflow is publicly available online at https://github.com/MicheleUIT/Explaining_SGNN

arXiv:2209.06520 [pdf, other]

Scalable Spatiotemporal Graph Neural Networks

Authors: Andrea Cini, Ivan Marisca, Filippo Maria Bianchi, Cesare Alippi

Abstract: Neural forecasting of spatiotemporal time series drives both research and industrial innovation in several relevant application domains. Graph neural networks (GNNs) are often the core component of the forecasting architecture. However, in most spatiotemporal GNNs, the computational complexity scales up to a quadratic factor with the length of the sequence times the number of links in the graph, h… ▽ More Neural forecasting of spatiotemporal time series drives both research and industrial innovation in several relevant application domains. Graph neural networks (GNNs) are often the core component of the forecasting architecture. However, in most spatiotemporal GNNs, the computational complexity scales up to a quadratic factor with the length of the sequence times the number of links in the graph, hence hindering the application of these models to large graphs and long temporal sequences. While methods to improve scalability have been proposed in the context of static graphs, few research efforts have been devoted to the spatiotemporal case. To fill this gap, we propose a scalable architecture that exploits an efficient encoding of both temporal and spatial dynamics. In particular, we use a randomized recurrent neural network to embed the history of the input time series into high-dimensional state representations encompassing multi-scale temporal dynamics. Such representations are then propagated along the spatial dimension using different powers of the graph adjacency matrix to generate node embeddings characterized by a rich pool of spatiotemporal features. The resulting node embeddings can be efficiently pre-computed in an unsupervised manner, before being fed to a feed-forward decoder that learns to map the multi-scale spatiotemporal representations to predictions. The training procedure can then be parallelized node-wise by sampling the node embeddings without breaking any dependency, thus enabling scalability to large networks. Empirical results on relevant datasets show that our approach achieves results competitive with the state of the art, while dramatically reducing the computational burden. △ Less

Submitted 20 February, 2023; v1 submitted 14 September, 2022; originally announced September 2022.

Comments: Published as conference paper at AAAI 23

arXiv:2208.05192 [pdf, other]

doi 10.3390/s22207951

Real-Time Oil Leakage Detection on Aftermarket Motorcycle Dam** System with Convolutional Neural Networks

Authors: Federico Bianchi, Stefano Speziali, Andrea Marini, Massimiliano Proietti, Lorenzo Menculini, Alberto Garinei, Gabriele Bellani, Marcello Marconi

Abstract: In this work, we describe in detail how Deep Learning and Computer Vision can help to detect fault events of the AirTender system, an aftermarket motorcycle dam** system component. One of the most effective ways to monitor the AirTender functioning is to look for oil stains on its surface. Starting from real-time images, AirTender is first detected in the motorbike suspension system, simulated i… ▽ More In this work, we describe in detail how Deep Learning and Computer Vision can help to detect fault events of the AirTender system, an aftermarket motorcycle dam** system component. One of the most effective ways to monitor the AirTender functioning is to look for oil stains on its surface. Starting from real-time images, AirTender is first detected in the motorbike suspension system, simulated indoor, and then, a binary classifier determines whether AirTender is spilling oil or not. The detection is made with the help of the Yolo5 architecture, whereas the classification is carried out with the help of a suitably designed Convolutional Neural Network, OilNet40. In order to detect oil leaks more clearly, we dilute the oil in AirTender with a fluorescent dye with an excitation wavelength peak of approximately 390 nm. AirTender is then illuminated with suitable UV LEDs. The whole system is an attempt to design a low-cost detection setup. An on-board device, such as a mini-computer, is placed near the suspension system and connected to a full hd camera framing AirTender. The on-board device, through our Neural Network algorithm, is then able to localize and classify AirTender as normally functioning (non-leak image) or anomaly (leak image). △ Less

Submitted 23 November, 2022; v1 submitted 10 August, 2022; originally announced August 2022.

Comments: analysis of literature reviewed, n.2 figures added, minor corrections

arXiv:2207.08779 [pdf, other]

doi 10.7557/18.6790

Simplifying Clustering with Graph Neural Networks

Authors: Filippo Maria Bianchi

Abstract: The objective functions used in spectral clustering are usually composed of two terms: i) a term that minimizes the local quadratic variation of the cluster assignments on the graph and; ii) a term that balances the clustering partition and helps avoiding degenerate solutions. This paper shows that a graph neural network, equipped with suitable message passing layers, can generate good cluster ass… ▽ More The objective functions used in spectral clustering are usually composed of two terms: i) a term that minimizes the local quadratic variation of the cluster assignments on the graph and; ii) a term that balances the clustering partition and helps avoiding degenerate solutions. This paper shows that a graph neural network, equipped with suitable message passing layers, can generate good cluster assignments by optimizing only a balancing term. Results on attributed graph datasets show the effectiveness of the proposed approach in terms of clustering performance and computation time. △ Less

Submitted 27 November, 2022; v1 submitted 18 July, 2022; originally announced July 2022.

arXiv:2207.05772 [pdf, ps, other]

EvalRS: a Rounded Evaluation of Recommender Systems

Authors: Jacopo Tagliabue, Federico Bianchi, Tobias Schnabel, Giuseppe Attanasio, Ciro Greco, Gabriel de Souza P. Moreira, Patrick John Chia

Abstract: Much of the complexity of Recommender Systems (RSs) comes from the fact that they are used as part of more complex applications and affect user experience through a varied range of user interfaces. However, research focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow f… ▽ More Much of the complexity of Recommender Systems (RSs) comes from the fact that they are used as part of more complex applications and affect user experience through a varied range of user interfaces. However, research focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow focus has limited the capacity of RSs to have a lasting impact in the real world and makes them vulnerable to undesired behavior, such as reinforcing data biases. We propose EvalRS as a new type of challenge, in order to foster this discussion among practitioners and build in the open new methodologies for testing RSs "in the wild". △ Less

Submitted 12 August, 2022; v1 submitted 12 July, 2022; originally announced July 2022.

Comments: CIKM 2022 Data Challenge Paper

arXiv:2204.03972 [pdf, other]

Contrastive language and vision learning of general fashion concepts

Authors: Patrick John Chia, Giuseppe Attanasio, Federico Bianchi, Silvia Terragni, Ana Rita Magalhães, Diogo Goncalves, Ciro Greco, Jacopo Tagliabue

Abstract: The steady rise of online shop** goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like mo… ▽ More The steady rise of online shop** goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model for the fashion industry. We showcase its capabilities for retrieval, classification and grounding, and release our model and code to the community. △ Less

Submitted 18 April, 2023; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Latest version available at https://www.nature.com/articles/s41598-022-23052-9; model available at https://huggingface.co/patrickjohncyh/fashion-clip

arXiv:2204.02473 [pdf, other]

"Does it come in black?" CLIP-like models are zero-shot recommenders

Authors: Patrick John Chia, Jacopo Tagliabue, Federico Bianchi, Ciro Greco, Diogo Goncalves

Abstract: Product discovery is a crucial component for online shop**. However, item-to-item recommendations today do not allow users to explore changes along selected dimensions: given a query item, can a model suggest something similar but in a different color? We consider item recommendations of the comparative nature (e.g. "something darker") and show how CLIP-based models can support this use case in… ▽ More Product discovery is a crucial component for online shop**. However, item-to-item recommendations today do not allow users to explore changes along selected dimensions: given a query item, can a model suggest something similar but in a different color? We consider item recommendations of the comparative nature (e.g. "something darker") and show how CLIP-based models can support this use case in a zero-shot manner. Leveraging a large model built for fashion, we introduce GradREC and its industry potential, and offer a first rounded assessment of its strength and weaknesses. △ Less

Submitted 11 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: Accepted at ACL 2022 (ECNLP)

arXiv:2203.16401 [pdf, other]

doi 10.1109/TGRS.2022.3204886

Recognition of polar lows in Sentinel-1 SAR images with deep learning

Authors: Jakob Grahn, Filippo Maria Bianchi

Abstract: In this paper, we explore the possibility of detecting polar lows in C-band SAR images by means of deep learning. Specifically, we introduce a novel dataset consisting of Sentinel-1 images divided into two classes, representing the presence and absence of a maritime mesocyclone, respectively. The dataset is constructed using the ERA5 dataset as baseline and it consists of 2004 annotated images. To… ▽ More In this paper, we explore the possibility of detecting polar lows in C-band SAR images by means of deep learning. Specifically, we introduce a novel dataset consisting of Sentinel-1 images divided into two classes, representing the presence and absence of a maritime mesocyclone, respectively. The dataset is constructed using the ERA5 dataset as baseline and it consists of 2004 annotated images. To our knowledge, this is the first dataset of its kind to be publicly released. The dataset is used to train a deep learning model to classify the labeled images. Evaluated on an independent test set, the model yields an F-1 score of 0.95, indicating that polar lows can be consistently detected from SAR images. Interpretability techniques applied to the deep learning model reveal that atmospheric fronts and cyclonic eyes are key features in the classification. Moreover, experimental results show that the model is accurate even if: (i) such features are significantly cropped due to the limited swath width of the SAR, (ii) the features are partly covered by sea ice and (iii) land is covering significant parts of the images. By evaluating the model performance on multiple input image resolutions (pixel sizes of 500m, 1km and 2km), it is found that higher resolution yield the best performance. This emphasises the potential of using high resolution sensors like SAR for detecting polar lows, as compared to conventionally used sensors such as scatterometers. △ Less

Submitted 5 September, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

Comments: 11 pages (+4 supplementary), 11 figures (+2 supplementary)

arXiv:2203.07080 [pdf, other]

Probabilistic forecasts of wind power generation in regions with complex topography using deep learning methods: An Arctic case

Authors: Odin Foldvik Eikeland, Finn Dag Hovem, Tom Eirik Olsen, Matteo Chiesa, Filippo Maria Bianchi

Abstract: The energy market relies on forecasting capabilities of both demand and power generation that need to be kept in dynamic balance. Today, when it comes to renewable energy generation, such decisions are increasingly made in a liberalized electricity market environment, where future power generation must be offered through contracts and auction mechanisms, hence based on forecasts. The increased sha… ▽ More The energy market relies on forecasting capabilities of both demand and power generation that need to be kept in dynamic balance. Today, when it comes to renewable energy generation, such decisions are increasingly made in a liberalized electricity market environment, where future power generation must be offered through contracts and auction mechanisms, hence based on forecasts. The increased share of highly intermittent power generation from renewable energy sources increases the uncertainty about the expected future power generation. Point forecast does not account for such uncertainties. To account for these uncertainties, it is possible to make probabilistic forecasts. This work first presents important concepts and approaches concerning probabilistic forecasts with deep learning. Then, deep learning models are used to make probabilistic forecasts of day-ahead power generation from a wind power plant located in Northern Norway. The performance in terms of obtained quality of the prediction intervals is compared for different deep learning models and sets of covariates. The findings show that the accuracy of the predictions improves when historical data on measured weather and numerical weather predictions (NWPs) were included as exogenous variables. This allows the model to auto-correct systematic biases in the NWPs using the historical measurement data. Using only NWPs, or only measured weather as exogenous variables, worse prediction performances were obtained. △ Less

Submitted 10 March, 2022; originally announced March 2022.

Comments: 16 pages, 8 Figures, 4 Tables

arXiv:2202.08756 [pdf, other]

doi 10.1109/TNNLS.2022.3217694

Ensemble Conformalized Quantile Regression for Probabilistic Time Series Forecasting

Authors: Vilde Jensen, Filippo Maria Bianchi, Stian Norman Anfinsen

Abstract: This paper presents a novel probabilistic forecasting method called ensemble conformalized quantile regression (EnCQR). EnCQR constructs distribution-free and approximately marginally valid prediction intervals (PIs), which are suitable for nonstationary and heteroscedastic time series data. EnCQR can be applied on top of a generic forecasting model, including deep learning architectures. EnCQR ex… ▽ More This paper presents a novel probabilistic forecasting method called ensemble conformalized quantile regression (EnCQR). EnCQR constructs distribution-free and approximately marginally valid prediction intervals (PIs), which are suitable for nonstationary and heteroscedastic time series data. EnCQR can be applied on top of a generic forecasting model, including deep learning architectures. EnCQR exploits a bootstrap ensemble estimator, which enables the use of conformal predictors for time series by removing the requirement of data exchangeability. The ensemble learners are implemented as generic machine learning algorithms performing quantile regression, which allow the length of the PIs to adapt to local variability in the data. In the experiments, we predict time series characterized by a different amount of heteroscedasticity. The results demonstrate that EnCQR outperforms models based only on quantile regression or conformal prediction, and it provides sharper, more informative, and valid PIs. △ Less

Submitted 6 November, 2022; v1 submitted 17 February, 2022; originally announced February 2022.

Journal ref: IEEE Transactions on Neural Networks and Learning Systems, 2022

arXiv:2201.10986 [pdf, other]

Twitter-Demographer: A Flow-based Tool to Enrich Twitter Data

Authors: Federico Bianchi, Vincenzo Cutrona, Dirk Hovy

Abstract: Twitter data have become essential to Natural Language Processing (NLP) and social science research, driving various scientific discoveries in recent years. However, the textual data alone are often not enough to conduct studies: especially social scientists need more variables to perform their analysis and control for various factors. How we augment this information, such as users' location, age,… ▽ More Twitter data have become essential to Natural Language Processing (NLP) and social science research, driving various scientific discoveries in recent years. However, the textual data alone are often not enough to conduct studies: especially social scientists need more variables to perform their analysis and control for various factors. How we augment this information, such as users' location, age, or tweet sentiment, has ramifications for anonymity and reproducibility, and requires dedicated effort. This paper describes Twitter-Demographer, a simple, flow-based tool to enrich Twitter data with additional information about tweets and users. Twitter-Demographer is aimed at NLP practitioners and (computational) social scientists who want to enrich their datasets with aggregated information, facilitating reproducibility, and providing algorithmic privacy-by-design measures for pseudo-anonymity. We discuss our design choices, inspired by the flow-based programming paradigm, to use black-box components that can easily be chained together and extended. We also analyze the ethical issues related to the use of this tool, and the built-in measures to facilitate pseudo-anonymity. △ Less

Submitted 26 January, 2022; originally announced January 2022.

arXiv:2111.09963 [pdf, other]

doi 10.1145/3487553.3524215

Beyond NDCG: behavioral testing of recommender systems with RecList

Authors: Patrick John Chia, Jacopo Tagliabue, Federico Bianchi, Chloe He, Brian Ko

Abstract: As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis and deployment-specific tests must be employed to ensure the desired quality in actual deployments. In this paper, we propose RecList, a behavioral-based testing methodology. Rec… ▽ More As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis and deployment-specific tests must be employed to ensure the desired quality in actual deployments. In this paper, we propose RecList, a behavioral-based testing methodology. RecList organizes recommender systems by use case and introduces a general plug-and-play procedure to scale up behavioral testing. We demonstrate its capabilities by analyzing known algorithms and black-box commercial systems, and we release RecList as an open source, extensible package for the community. △ Less

Submitted 27 March, 2022; v1 submitted 18 November, 2021; originally announced November 2021.

Comments: Paper accepted to the WebConf 2022

arXiv:2111.02169 [pdf, other]

doi 10.1109/TPWRS.2022.3195301

Power Flow Balancing with Decentralized Graph Neural Networks

Authors: Jonas Berg Hansen, Stian Normann Anfinsen, Filippo Maria Bianchi

Abstract: We propose an end-to-end framework based on a Graph Neural Network (GNN) to balance the power flows in energy grids. The balancing is framed as a supervised vertex regression task, where the GNN is trained to predict the current and power injections at each grid branch that yield a power flow balance. By representing the power grid as a line graph with branches as vertices, we can train a GNN that… ▽ More We propose an end-to-end framework based on a Graph Neural Network (GNN) to balance the power flows in energy grids. The balancing is framed as a supervised vertex regression task, where the GNN is trained to predict the current and power injections at each grid branch that yield a power flow balance. By representing the power grid as a line graph with branches as vertices, we can train a GNN that is accurate and robust to changes in topology. In addition, by using specialized GNN layers, we are able to build a very deep architecture that accounts for large neighborhoods on the graph, while implementing only localized operations. We perform three different experiments to evaluate: i) the benefits of using localized rather than global operations and the tendency of deep GNN models to oversmooth the quantities on the nodes; ii) the resilience to perturbations in the graph topology; and iii) the capability to train the model simultaneously on multiple grid topologies and the consequential improvement in generalization to new, unseen grids. The proposed framework is efficient and, compared to other solvers based on deep learning, is robust to perturbations not only to the physical quantities on the grid components, but also to the topology. △ Less

Submitted 11 August, 2022; v1 submitted 3 November, 2021; originally announced November 2021.

arXiv:2110.05292 [pdf, other]

doi 10.1109/TNNLS.2022.3190922

Understanding Pooling in Graph Neural Networks

Authors: Daniele Grattarola, Daniele Zambon, Filippo Maria Bianchi, Cesare Alippi

Abstract: Inspired by the conventional pooling layers in convolutional neural networks, many recent works in the field of graph machine learning have introduced pooling operators to reduce the size of graphs. The great variety in the literature stems from the many possible strategies for coarsening a graph, which may depend on different assumptions on the graph structure or the specific downstream task. In… ▽ More Inspired by the conventional pooling layers in convolutional neural networks, many recent works in the field of graph machine learning have introduced pooling operators to reduce the size of graphs. The great variety in the literature stems from the many possible strategies for coarsening a graph, which may depend on different assumptions on the graph structure or the specific downstream task. In this paper we propose a formal characterization of graph pooling based on three main operations, called selection, reduction, and connection, with the goal of unifying the literature under a common framework. Following this formalization, we introduce a taxonomy of pooling operators and categorize more than thirty pooling methods proposed in recent literature. We propose criteria to evaluate the performance of a pooling operator and use them to investigate and contrast the behavior of different classes of the taxonomy on a variety of tasks. △ Less

Submitted 11 October, 2021; originally announced October 2021.

Comments: 10 pages, 6 figures

Journal ref: IEEE Transactions on Neural Networks and Learning Systems (Volume: 35, Issue: 2, February 2024)

arXiv:2109.13037 [pdf, other]

Language Invariant Properties in Natural Language Processing

Authors: Federico Bianchi, Debora Nozza, Dirk Hovy

Abstract: Meaning is context-dependent, but many properties of language (should) remain the same even if we transform the context. For example, sentiment, entailment, or speaker properties should be the same in a translation and original of a text. We introduce language invariant properties: i.e., properties that should not change when we transform text, and how they can be used to quantitatively evaluate t… ▽ More Meaning is context-dependent, but many properties of language (should) remain the same even if we transform the context. For example, sentiment, entailment, or speaker properties should be the same in a translation and original of a text. We introduce language invariant properties: i.e., properties that should not change when we transform text, and how they can be used to quantitatively evaluate the robustness of transformation algorithms. We use translation and paraphrasing as transformation examples, but our findings apply more broadly to any transformation. Our results indicate that many NLP transformations change properties like author characteristics, i.e., make them sound more male. We believe that studying these properties will allow NLP to address both social factors and pragmatic aspects of language. We also release an application suite that can be used to evaluate the invariance of transformation applications. △ Less

Submitted 1 October, 2021; v1 submitted 27 September, 2021; originally announced September 2021.

arXiv:2109.07231 [pdf, other]

SWEAT: Scoring Polarization of Topics across Different Corpora

Authors: Federico Bianchi, Marco Marelli, Paolo Nicoli, Matteo Palmonari

Abstract: Understanding differences of viewpoints across corpora is a fundamental task for computational social sciences. In this paper, we propose the Sliced Word Embedding Association Test (SWEAT), a novel statistical measure to compute the relative polarization of a topical wordset across two distributional representations. To this end, SWEAT uses two additional wordsets, deemed to have opposite valence,… ▽ More Understanding differences of viewpoints across corpora is a fundamental task for computational social sciences. In this paper, we propose the Sliced Word Embedding Association Test (SWEAT), a novel statistical measure to compute the relative polarization of a topical wordset across two distributional representations. To this end, SWEAT uses two additional wordsets, deemed to have opposite valence, to represent two different poles. We validate our approach and illustrate a case study to show the usefulness of the introduced measure. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: Published as a conference paper at EMNLP2021

arXiv:2108.08688 [pdf, other]

Contrastive Language-Image Pre-training for the Italian Language

Authors: Federico Bianchi, Giuseppe Attanasio, Raphael Pisoni, Silvia Terragni, Gabriele Sarti, Sri Lakshmi

Abstract: CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs hi… ▽ More CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on zero-shot classification tasks. Training the same model on a different language is not trivial, since data in other languages might be not enough and the model needs high-quality translations of the texts to guarantee a good performance. In this paper, we present the first CLIP model for the Italian Language (CLIP-Italian), trained on more than 1.4 million image-text pairs. Results show that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image retrieval and zero-shot classification. △ Less

Submitted 19 August, 2021; originally announced August 2021.

arXiv:2108.07060 [pdf, other]

Detecting and interpreting faults in vulnerable power grids with machine learning

Authors: Odin Foldvik Eikeland, Inga Setså Holmstrand, Sigurd Bakkejord, Matteo Chiesa, Filippo Maria Bianchi

Abstract: Unscheduled power disturbances cause severe consequences both for customers and grid operators. To defend against such events, it is necessary to identify the causes of interruptions in the power distribution network. In this work, we focus on the power grid of a Norwegian community in the Arctic that experiences several faults whose sources are unknown. First, we construct a data set consisting o… ▽ More Unscheduled power disturbances cause severe consequences both for customers and grid operators. To defend against such events, it is necessary to identify the causes of interruptions in the power distribution network. In this work, we focus on the power grid of a Norwegian community in the Arctic that experiences several faults whose sources are unknown. First, we construct a data set consisting of relevant meteorological data and information about the current power quality logged by power-quality meters. Then, we adopt machine-learning techniques to predict the occurrence of faults. Experimental results show that both linear and non-linear classifiers achieve good classification performance. This indicates that the considered power-quality and weather variables explain well the power disturbances. Interpreting the decision process of the classifiers provides valuable insights to understand the main causes of disturbances. Traditional features selection methods can only indicate which are the variables that, on average, mostly explain the fault occurrences in the dataset. Besides providing such a global interpretation, it is also important to identify the specific set of variables that explain each individual fault. To address this challenge, we adopt a recent technique to interpret the decision process of a deep learning model, called Integrated Gradients. The proposed approach allows to gain detailed insights on the occurrence of a specific fault, which are valuable for the distribution system operators to implement strategies to prevent and mitigate power disturbances. △ Less

Submitted 16 August, 2021; originally announced August 2021.

arXiv:2104.09423 [pdf, ps, other]

SIGIR 2021 E-Commerce Workshop Data Challenge

Authors: Jacopo Tagliabue, Ciro Greco, Jean-Francis Roy, Bingqing Yu, Patrick John Chia, Federico Bianchi, Giovanni Cassani

Abstract: The 2021 SIGIR workshop on eCommerce is hosting the Coveo Data Challenge for "In-session prediction for purchase intent and recommendations". The challenge addresses the growing need for reliable predictions within the boundaries of a shop** session, as customer intentions can be different depending on the occasion. The need for efficient procedures for personalization is even clearer if we cons… ▽ More The 2021 SIGIR workshop on eCommerce is hosting the Coveo Data Challenge for "In-session prediction for purchase intent and recommendations". The challenge addresses the growing need for reliable predictions within the boundaries of a shop** session, as customer intentions can be different depending on the occasion. The need for efficient procedures for personalization is even clearer if we consider the e-commerce landscape more broadly: outside of giant digital retailers, the constraints of the problem are stricter, due to smaller user bases and the realization that most users are not frequently returning customers. We release a new session-based dataset including more than 30M fine-grained browsing events (product detail, add, purchase), enriched by linguistic behavior (queries made by shoppers, with items clicked and items not clicked after the query) and catalog meta-data (images, text, pricing information). On this dataset, we ask participants to showcase innovative solutions for two open problems: a recommendation task (where a model is shown some events at the start of a session, and it is asked to predict future product interactions); an intent prediction task, where a model is shown a session containing an add-to-cart event, and it is asked to predict whether the item will be bought before the end of the session. △ Less

Submitted 16 July, 2021; v1 submitted 19 April, 2021; originally announced April 2021.

Comments: SIGIR eCOM 2021 Data Challenge

arXiv:2104.08874 [pdf, other]

Language in a (Search) Box: Grounding Language Learning in Real-World Human-Machine Interaction

Authors: Federico Bianchi, Ciro Greco, Jacopo Tagliabue

Abstract: We investigate grounded language learning through real-world data, by modelling a teacher-learner dynamics through the natural interactions occurring between users and search engines; in particular, we explore the emergence of semantic generalization from unsupervised dense representations outside of synthetic environments. A grounding domain, a denotation function and a composition function are l… ▽ More We investigate grounded language learning through real-world data, by modelling a teacher-learner dynamics through the natural interactions occurring between users and search engines; in particular, we explore the emergence of semantic generalization from unsupervised dense representations outside of synthetic environments. A grounding domain, a denotation function and a composition function are learned from user data only. We show how the resulting semantics for noun phrases exhibits compositional properties while being fully learnable without any explicit labelling. We benchmark our grounded semantics on compositionality and zero-shot inference tasks, and we show that it provides better results and better generalizations than SOTA non-grounded models, such as word2vec and BERT. △ Less

Submitted 18 April, 2021; originally announced April 2021.

Comments: Published as a conference paper at NAACL2021

arXiv:2104.04710 [pdf, other]

Pyramidal Reservoir Graph Neural Network

Authors: Filippo Maria Bianchi, Claudio Gallicchio, Alessio Micheli

Abstract: We propose a deep Graph Neural Network (GNN) model that alternates two types of layers. The first type is inspired by Reservoir Computing (RC) and generates new vertex features by iterating a non-linear map until it converges to a fixed point. The second type of layer implements graph pooling operations, that gradually reduce the support graph and the vertex features, and further improve the compu… ▽ More We propose a deep Graph Neural Network (GNN) model that alternates two types of layers. The first type is inspired by Reservoir Computing (RC) and generates new vertex features by iterating a non-linear map until it converges to a fixed point. The second type of layer implements graph pooling operations, that gradually reduce the support graph and the vertex features, and further improve the computational efficiency of the RC-based GNN. The architecture is, therefore, pyramidal. In the last layer, the features of the remaining vertices are combined into a single vector, which represents the graph embedding. Through a mathematical derivation introduced in this paper, we show formally how graph pooling can reduce the computational complexity of the model and speed-up the convergence of the dynamical updates of the vertex features. Our proposed approach to the design of RC-based GNNs offers an advantageous and principled trade-off between accuracy and complexity, which we extensively demonstrate in experiments on a large set of graph datasets. △ Less

Submitted 10 April, 2021; originally announced April 2021.

Comments: this is a pre-print version of a paper submitted for journal publication

arXiv:2104.02061 [pdf, other]

Query2Prod2Vec Grounded Word Embeddings for eCommerce

Authors: Federico Bianchi, Jacopo Tagliabue, Bingqing Yu

Abstract: We present Query2Prod2Vec, a model that grounds lexical representations for product search in product embeddings: in our model, meaning is a map** between words and a latent space of products in a digital shop. We leverage shop** sessions to learn the underlying space and use merchandising annotations to build lexical analogies for evaluation: our experiments show that our model is more accura… ▽ More We present Query2Prod2Vec, a model that grounds lexical representations for product search in product embeddings: in our model, meaning is a map** between words and a latent space of products in a digital shop. We leverage shop** sessions to learn the underlying space and use merchandising annotations to build lexical analogies for evaluation: our experiments show that our model is more accurate than known techniques from the NLP and IR literature. Finally, we stress the importance of data efficiency for product search outside of retail giants, and highlight how Query2Prod2Vec fits with practical constraints faced by most practitioners. △ Less

Submitted 2 April, 2021; originally announced April 2021.

Comments: Published as a conference paper at NAACL2021 - Industry Track

arXiv:2012.09807 [pdf, other]

BERT Goes Shop**: Comparing Distributional Models for Product Representations

Authors: Federico Bianchi, Bingqing Yu, Jacopo Tagliabue

Abstract: Word embeddings (e.g., word2vec) have been applied successfully to eCommerce products through~\textit{prod2vec}. Inspired by the recent performance improvements on several NLP tasks brought by contextualized embeddings, we propose to transfer BERT-like architectures to eCommerce: our model -- ~\textit{Prod2BERT} -- is trained to generate representations of products through masked session modeling.… ▽ More Word embeddings (e.g., word2vec) have been applied successfully to eCommerce products through~\textit{prod2vec}. Inspired by the recent performance improvements on several NLP tasks brought by contextualized embeddings, we propose to transfer BERT-like architectures to eCommerce: our model -- ~\textit{Prod2BERT} -- is trained to generate representations of products through masked session modeling. Through extensive experiments over multiple shops, different tasks, and a range of design choices, we systematically compare the accuracy of~\textit{Prod2BERT} and~\textit{prod2vec} embeddings: while~\textit{Prod2BERT} is found to be superior in several scenarios, we highlight the importance of resources and hyperparameters in the best performing models. Finally, we provide guidelines to practitioners for training embeddings under a variety of computational and data constraints. △ Less

Submitted 23 June, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

Comments: Updated version. Published as a workshop paper at ECNLP 4 at ACL-IJCNLP 2021

arXiv:2007.14906 [pdf, other]

Fantastic Embeddings and How to Align Them: Zero-Shot Inference in a Multi-Shop Scenario

Authors: Federico Bianchi, Jacopo Tagliabue, Bingqing Yu, Luca Bigon, Ciro Greco

Abstract: This paper addresses the challenge of leveraging multiple embedding spaces for multi-shop personalization, proving that zero-shot inference is possible by transferring shop** intent from one website to another without manual intervention. We detail a machine learning pipeline to train and optimize embeddings within shops first, and support the quantitative findings with additional qualitative in… ▽ More This paper addresses the challenge of leveraging multiple embedding spaces for multi-shop personalization, proving that zero-shot inference is possible by transferring shop** intent from one website to another without manual intervention. We detail a machine learning pipeline to train and optimize embeddings within shops first, and support the quantitative findings with additional qualitative insights. We then turn to the harder task of using learned embeddings across shops: if products from different shops live in the same vector space, user intent - as represented by regions in this space - can then be transferred in a zero-shot fashion across websites. We propose and benchmark unsupervised and supervised methods to "travel" between embedding spaces, each with its own assumptions on data quantity and quality. We show that zero-shot personalization is indeed possible at scale by testing the shared embedding space with two downstream tasks, event prediction and type-ahead suggestions. Finally, we curate a cross-shop anonymized embeddings dataset to foster an inclusive discussion of this important business scenario. △ Less

Submitted 20 July, 2020; originally announced July 2020.

Comments: accepted at 2020 SIGIR Workshop On eCommerce

arXiv:2006.13575 [pdf, other]

Large-scale detection and categorization of oil spills from SAR images with deep learning

Authors: Filippo Maria Bianchi, Martine M. Espeseth, Njål Borch

Abstract: We propose a deep learning framework to detect and categorize oil spills in synthetic aperture radar (SAR) images at a large scale. By means of a carefully designed neural network model for image segmentation trained on an extensive dataset, we are able to obtain state-of-the-art performance in oil spill detection, achieving results that are comparable to results produced by human operators. We al… ▽ More We propose a deep learning framework to detect and categorize oil spills in synthetic aperture radar (SAR) images at a large scale. By means of a carefully designed neural network model for image segmentation trained on an extensive dataset, we are able to obtain state-of-the-art performance in oil spill detection, achieving results that are comparable to results produced by human operators. We also introduce a classification task, which is novel in the context of oil spill detection in SAR. Specifically, after being detected, each oil spill is also classified according to different categories pertaining to its shape and texture characteristics. The classification results provide valuable insights for improving the design of oil spill services by world-leading providers. As the last contribution, we present our operational pipeline and a visualization tool for large-scale data, which allows to detect and analyze the historical presence of oil spills worldwide. △ Less

Submitted 24 June, 2020; originally announced June 2020.

arXiv:2004.14843 [pdf, other]

doi 10.3233/SSW200011

Knowledge Graph Embeddings and Explainable AI

Authors: Federico Bianchi, Gaetano Rossiello, Luca Costabello, Matteo Palmonari, Pasquale Minervini

Abstract: Knowledge graph embeddings are now a widely adopted approach to knowledge representation in which entities and relationships are embedded in vector spaces. In this chapter, we introduce the reader to the concept of knowledge graph embeddings by explaining what they are, how they can be generated and how they can be evaluated. We summarize the state-of-the-art in this field by describing the approa… ▽ More Knowledge graph embeddings are now a widely adopted approach to knowledge representation in which entities and relationships are embedded in vector spaces. In this chapter, we introduce the reader to the concept of knowledge graph embeddings by explaining what they are, how they can be generated and how they can be evaluated. We summarize the state-of-the-art in this field by describing the approaches that have been introduced to represent knowledge in the vector space. In relation to knowledge representation, we consider the problem of explainability, and discuss models and methods for explaining predictions obtained via knowledge graph embeddings. △ Less

Submitted 30 April, 2020; originally announced April 2020.

Comments: Federico Bianchi, Gaetano Rossiello, Luca Costabello, Matteo Plamonari, Pasquale Minervini, Knowledge Graph Embeddings and Explainable AI. In: Ilaria Tiddi, Freddy Lecue, Pascal Hitzler (eds.), Knowledge Graphs for eXplainable AI -- Foundations, Applications and Challenges. Studies on the Semantic Web, IOS Press, Amsterdam, 2020

arXiv:2004.07737 [pdf, other]

Cross-lingual Contextualized Topic Models with Zero-shot Learning

Authors: Federico Bianchi, Silvia Terragni, Dirk Hovy, Debora Nozza, Elisabetta Fersini

Abstract: Many data sets (e.g., reviews, forums, news, etc.) exist parallelly in multiple languages. They all cover the same content, but the linguistic differences make it impossible to use traditional, bag-of-word-based topic models. Models have to be either single-language or suffer from a huge, but extremely sparse vocabulary. Both issues can be addressed by transfer learning. In this paper, we introduc… ▽ More Many data sets (e.g., reviews, forums, news, etc.) exist parallelly in multiple languages. They all cover the same content, but the linguistic differences make it impossible to use traditional, bag-of-word-based topic models. Models have to be either single-language or suffer from a huge, but extremely sparse vocabulary. Both issues can be addressed by transfer learning. In this paper, we introduce a zero-shot cross-lingual topic model. Our model learns topics on one language (here, English), and predicts them for unseen documents in different languages (here, Italian, French, German, and Portuguese). We evaluate the quality of the topic predictions for the same document in different languages. Our results show that the transferred topics are coherent and stable across languages, which suggests exciting future research directions. △ Less

Submitted 4 February, 2021; v1 submitted 16 April, 2020; originally announced April 2020.

Comments: Updated version. Published as a conference paper at EACL2021

arXiv:2004.07011 [pdf, other]

Code-Aligned Autoencoders for Unsupervised Change Detection in Multimodal Remote Sensing Images

Authors: Luigi T. Luppino, Mads A. Hansen, Michael Kampffmeyer, Filippo M. Bianchi, Gabriele Moser, Robert Jenssen, Stian N. Anfinsen

Abstract: Image translation with convolutional autoencoders has recently been used as an approach to multimodal change detection in bitemporal satellite images. A main challenge is the alignment of the code spaces by reducing the contribution of change pixels to the learning of the translation function. Many existing approaches train the networks by exploiting supervised information of the change areas, whi… ▽ More Image translation with convolutional autoencoders has recently been used as an approach to multimodal change detection in bitemporal satellite images. A main challenge is the alignment of the code spaces by reducing the contribution of change pixels to the learning of the translation function. Many existing approaches train the networks by exploiting supervised information of the change areas, which, however, is not always available. We propose to extract relational pixel information captured by domain-specific affinity matrices at the input and use this to enforce alignment of the code spaces and reduce the impact of change pixels on the learning objective. A change prior is derived in an unsupervised fashion from pixel pair affinities that are comparable across domains. To achieve code space alignment we enforce that pixel with similar affinity relations in the input domains should be correlated also in code space. We demonstrate the utility of this procedure in combination with cycle consistency. The proposed approach are compared with state-of-the-art deep learning algorithms. Experiments conducted on four real datasets show the effectiveness of our methodology. △ Less

Submitted 15 April, 2020; originally announced April 2020.

Showing 1–50 of 89 results for author: Bianchi, F