Search | arXiv e-print repository

How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis

Authors: Federico Bianchi, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, James Zou

Abstract: Negotiation is the basis of social interactions; humans negotiate everything from the price of cars to how to share common resources. With rapidly growing interest in using large language models (LLMs) to act as agents on behalf of human users, such LLM agents would also need to be able to negotiate. In this paper, we study how well LLMs can negotiate with each other. We develop NegotiationArena:… ▽ More Negotiation is the basis of social interactions; humans negotiate everything from the price of cars to how to share common resources. With rapidly growing interest in using large language models (LLMs) to act as agents on behalf of human users, such LLM agents would also need to be able to negotiate. In this paper, we study how well LLMs can negotiate with each other. We develop NegotiationArena: a flexible framework for evaluating and probing the negotiation abilities of LLM agents. We implemented three types of scenarios in NegotiationArena to assess LLM's behaviors in allocating shared resources (ultimatum games), aggregate resources (trading games) and buy/sell goods (price negotiations). Each scenario allows for multiple turns of flexible dialogues between LLM agents to allow for more complex negotiations. Interestingly, LLM agents can significantly boost their negotiation outcomes by employing certain behavioral tactics. For example, by pretending to be desolate and desperate, LLMs can improve their payoffs by 20\% when negotiating against the standard GPT-4. We also quantify irrational negotiation behaviors exhibited by the LLM agents, many of which also appear in humans. Together, \NegotiationArena offers a new environment to investigate LLM interactions, enabling new insights into LLM's theory of mind, irrationality, and reasoning abilities. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2304.10621 [pdf, other]

E Pluribus Unum: Guidelines on Multi-Objective Evaluation of Recommender Systems

Authors: Patrick John Chia, Giuseppe Attanasio, Jacopo Tagliabue, Federico Bianchi, Ciro Greco, Gabriel de Souza P. Moreira, Davide Eynard, Fahd Husain

Abstract: Recommender Systems today are still mostly evaluated in terms of accuracy, with other aspects beyond the immediate relevance of recommendations, such as diversity, long-term user retention and fairness, often taking a back seat. Moreover, reconciling multiple performance perspectives is by definition indeterminate, presenting a stumbling block to those in the pursuit of rounded evaluation of Recom… ▽ More Recommender Systems today are still mostly evaluated in terms of accuracy, with other aspects beyond the immediate relevance of recommendations, such as diversity, long-term user retention and fairness, often taking a back seat. Moreover, reconciling multiple performance perspectives is by definition indeterminate, presenting a stumbling block to those in the pursuit of rounded evaluation of Recommender Systems. EvalRS 2022 -- a data challenge designed around Multi-Objective Evaluation -- was a first practical endeavour, providing many insights into the requirements and challenges of balancing multiple objectives in evaluation. In this work, we reflect on EvalRS 2022 and expound upon crucial learnings to formulate a first-principles approach toward Multi-Objective model selection, and outline a set of guidelines for carrying out a Multi-Objective Evaluation challenge, with potential applicability to the problem of rounded evaluation of competing models in real-world deployments. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Comments: 15 pages, under submission

arXiv:2304.07145 [pdf, ps, other]

EvalRS 2023. Well-Rounded Recommender Systems For Real-World Deployments

Authors: Federico Bianchi, Patrick John Chia, Ciro Greco, Claudio Pomo, Gabriel Moreira, Davide Eynard, Fahd Husain, Jacopo Tagliabue

Abstract: EvalRS aims to bring together practitioners from industry and academia to foster a debate on rounded evaluation of recommender systems, with a focus on real-world impact across a multitude of deployment scenarios. Recommender systems are often evaluated only through accuracy metrics, which fall short of fully characterizing their generalization capabilities and miss important aspects, such as fair… ▽ More EvalRS aims to bring together practitioners from industry and academia to foster a debate on rounded evaluation of recommender systems, with a focus on real-world impact across a multitude of deployment scenarios. Recommender systems are often evaluated only through accuracy metrics, which fall short of fully characterizing their generalization capabilities and miss important aspects, such as fairness, bias, usefulness, informativeness. This workshop builds on the success of last year's workshop at CIKM, but with a broader scope and an interactive format. △ Less

Submitted 22 July, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

Comments: EvalRS 2023 is a workshop at KDD23. Code and hackathon materials: https://github.com/RecList/evalRS-KDD-2023

arXiv:2207.05772 [pdf, ps, other]

EvalRS: a Rounded Evaluation of Recommender Systems

Authors: Jacopo Tagliabue, Federico Bianchi, Tobias Schnabel, Giuseppe Attanasio, Ciro Greco, Gabriel de Souza P. Moreira, Patrick John Chia

Abstract: Much of the complexity of Recommender Systems (RSs) comes from the fact that they are used as part of more complex applications and affect user experience through a varied range of user interfaces. However, research focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow f… ▽ More Much of the complexity of Recommender Systems (RSs) comes from the fact that they are used as part of more complex applications and affect user experience through a varied range of user interfaces. However, research focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow focus has limited the capacity of RSs to have a lasting impact in the real world and makes them vulnerable to undesired behavior, such as reinforcing data biases. We propose EvalRS as a new type of challenge, in order to foster this discussion among practitioners and build in the open new methodologies for testing RSs "in the wild". △ Less

Submitted 12 August, 2022; v1 submitted 12 July, 2022; originally announced July 2022.

Comments: CIKM 2022 Data Challenge Paper

arXiv:2204.03972 [pdf, other]

Contrastive language and vision learning of general fashion concepts

Authors: Patrick John Chia, Giuseppe Attanasio, Federico Bianchi, Silvia Terragni, Ana Rita Magalhães, Diogo Goncalves, Ciro Greco, Jacopo Tagliabue

Abstract: The steady rise of online shop** goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like mo… ▽ More The steady rise of online shop** goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model for the fashion industry. We showcase its capabilities for retrieval, classification and grounding, and release our model and code to the community. △ Less

Submitted 18 April, 2023; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Latest version available at https://www.nature.com/articles/s41598-022-23052-9; model available at https://huggingface.co/patrickjohncyh/fashion-clip

arXiv:2204.02473 [pdf, other]

"Does it come in black?" CLIP-like models are zero-shot recommenders

Authors: Patrick John Chia, Jacopo Tagliabue, Federico Bianchi, Ciro Greco, Diogo Goncalves

Abstract: Product discovery is a crucial component for online shop**. However, item-to-item recommendations today do not allow users to explore changes along selected dimensions: given a query item, can a model suggest something similar but in a different color? We consider item recommendations of the comparative nature (e.g. "something darker") and show how CLIP-based models can support this use case in… ▽ More Product discovery is a crucial component for online shop**. However, item-to-item recommendations today do not allow users to explore changes along selected dimensions: given a query item, can a model suggest something similar but in a different color? We consider item recommendations of the comparative nature (e.g. "something darker") and show how CLIP-based models can support this use case in a zero-shot manner. Leveraging a large model built for fashion, we introduce GradREC and its industry potential, and offer a first rounded assessment of its strength and weaknesses. △ Less

Submitted 11 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: Accepted at ACL 2022 (ECNLP)

arXiv:2112.00219 [pdf, other]

Scalable Primitives for Generalized Sensor Fusion in Autonomous Vehicles

Authors: Sammy Sidhu, Linda Wang, Tayyab Naseer, Ashish Malhotra, Jay Chia, Aayush Ahuja, Ella Rasmussen, Qiangui Huang, Ray Gao

Abstract: In autonomous driving, there has been an explosion in the use of deep neural networks for perception, prediction and planning tasks. As autonomous vehicles (AVs) move closer to production, multi-modal sensor inputs and heterogeneous vehicle fleets with different sets of sensor platforms are becoming increasingly common in the industry. However, neural network architectures typically target specifi… ▽ More In autonomous driving, there has been an explosion in the use of deep neural networks for perception, prediction and planning tasks. As autonomous vehicles (AVs) move closer to production, multi-modal sensor inputs and heterogeneous vehicle fleets with different sets of sensor platforms are becoming increasingly common in the industry. However, neural network architectures typically target specific sensor platforms and are not robust to changes in input, making the problem of scaling and model deployment particularly difficult. Furthermore, most players still treat the problem of optimizing software and hardware as entirely independent problems. We propose a new end to end architecture, Generalized Sensor Fusion (GSF), which is designed in such a way that both sensor inputs and target tasks are modular and modifiable. This enables AV system designers to easily experiment with different sensor configurations and methods and opens up the ability to deploy on heterogeneous fleets using the same models that are shared across a large engineering organization. Using this system, we report experimental results where we demonstrate near-parity of an expensive high-density (HD) LiDAR sensor with a cheap low-density (LD) LiDAR plus camera setup in the 3D object detection task. This paves the way for the industry to jointly design hardware and software architectures as well as large fleets with heterogeneous configurations. △ Less

Submitted 30 November, 2021; originally announced December 2021.

Comments: Presented in Machine Learning for Autonomous Driving Workshop at the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia. 11 pages, 8 figures

arXiv:2111.09963 [pdf, other]

doi 10.1145/3487553.3524215

Beyond NDCG: behavioral testing of recommender systems with RecList

Authors: Patrick John Chia, Jacopo Tagliabue, Federico Bianchi, Chloe He, Brian Ko

Abstract: As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis and deployment-specific tests must be employed to ensure the desired quality in actual deployments. In this paper, we propose RecList, a behavioral-based testing methodology. Rec… ▽ More As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis and deployment-specific tests must be employed to ensure the desired quality in actual deployments. In this paper, we propose RecList, a behavioral-based testing methodology. RecList organizes recommender systems by use case and introduces a general plug-and-play procedure to scale up behavioral testing. We demonstrate its capabilities by analyzing known algorithms and black-box commercial systems, and we release RecList as an open source, extensible package for the community. △ Less

Submitted 27 March, 2022; v1 submitted 18 November, 2021; originally announced November 2021.

Comments: Paper accepted to the WebConf 2022

arXiv:2107.03256 [pdf, other]

"Are you sure?": Preliminary Insights from Scaling Product Comparisons to Multiple Shops

Authors: Patrick John Chia, Bingqing Yu, Jacopo Tagliabue

Abstract: Large eCommerce players introduced comparison tables as a new type of recommendations. However, building comparisons at scale without pre-existing training/taxonomy data remains an open challenge, especially within the operational constraints of shops in the long tail. We present preliminary results from building a comparison pipeline designed to scale in a multi-shop scenario: we describe our des… ▽ More Large eCommerce players introduced comparison tables as a new type of recommendations. However, building comparisons at scale without pre-existing training/taxonomy data remains an open challenge, especially within the operational constraints of shops in the long tail. We present preliminary results from building a comparison pipeline designed to scale in a multi-shop scenario: we describe our design choices and run extensive benchmarks on multiple shops to stress-test it. Finally, we run a small user study on property selection and conclude by discussing potential improvements and highlighting the questions that remain to be addressed. △ Less

Submitted 8 July, 2021; v1 submitted 7 July, 2021; originally announced July 2021.

Comments: Accepted for publication at SIGIR eCom 2021

arXiv:2104.09423 [pdf, ps, other]

SIGIR 2021 E-Commerce Workshop Data Challenge

Authors: Jacopo Tagliabue, Ciro Greco, Jean-Francis Roy, Bingqing Yu, Patrick John Chia, Federico Bianchi, Giovanni Cassani

Abstract: The 2021 SIGIR workshop on eCommerce is hosting the Coveo Data Challenge for "In-session prediction for purchase intent and recommendations". The challenge addresses the growing need for reliable predictions within the boundaries of a shop** session, as customer intentions can be different depending on the occasion. The need for efficient procedures for personalization is even clearer if we cons… ▽ More The 2021 SIGIR workshop on eCommerce is hosting the Coveo Data Challenge for "In-session prediction for purchase intent and recommendations". The challenge addresses the growing need for reliable predictions within the boundaries of a shop** session, as customer intentions can be different depending on the occasion. The need for efficient procedures for personalization is even clearer if we consider the e-commerce landscape more broadly: outside of giant digital retailers, the constraints of the problem are stricter, due to smaller user bases and the realization that most users are not frequently returning customers. We release a new session-based dataset including more than 30M fine-grained browsing events (product detail, add, purchase), enriched by linguistic behavior (queries made by shoppers, with items clicked and items not clicked after the query) and catalog meta-data (images, text, pricing information). On this dataset, we ask participants to showcase innovative solutions for two open problems: a recommendation task (where a model is shown some events at the start of a session, and it is asked to predict future product interactions); an intent prediction task, where a model is shown a session containing an add-to-cart event, and it is asked to predict whether the item will be bought before the end of the session. △ Less

Submitted 16 July, 2021; v1 submitted 19 April, 2021; originally announced April 2021.

Comments: SIGIR eCOM 2021 Data Challenge

Showing 1–10 of 10 results for author: Chia, J