Search | arXiv e-print repository

How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis

Authors: Federico Bianchi, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, James Zou

Abstract: Negotiation is the basis of social interactions; humans negotiate everything from the price of cars to how to share common resources. With rapidly growing interest in using large language models (LLMs) to act as agents on behalf of human users, such LLM agents would also need to be able to negotiate. In this paper, we study how well LLMs can negotiate with each other. We develop NegotiationArena:… ▽ More Negotiation is the basis of social interactions; humans negotiate everything from the price of cars to how to share common resources. With rapidly growing interest in using large language models (LLMs) to act as agents on behalf of human users, such LLM agents would also need to be able to negotiate. In this paper, we study how well LLMs can negotiate with each other. We develop NegotiationArena: a flexible framework for evaluating and probing the negotiation abilities of LLM agents. We implemented three types of scenarios in NegotiationArena to assess LLM's behaviors in allocating shared resources (ultimatum games), aggregate resources (trading games) and buy/sell goods (price negotiations). Each scenario allows for multiple turns of flexible dialogues between LLM agents to allow for more complex negotiations. Interestingly, LLM agents can significantly boost their negotiation outcomes by employing certain behavioral tactics. For example, by pretending to be desolate and desperate, LLMs can improve their payoffs by 20\% when negotiating against the standard GPT-4. We also quantify irrational negotiation behaviors exhibited by the LLM agents, many of which also appear in humans. Together, \NegotiationArena offers a new environment to investigate LLM interactions, enabling new insights into LLM's theory of mind, irrationality, and reasoning abilities. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2304.10621 [pdf, other]

E Pluribus Unum: Guidelines on Multi-Objective Evaluation of Recommender Systems

Authors: Patrick John Chia, Giuseppe Attanasio, Jacopo Tagliabue, Federico Bianchi, Ciro Greco, Gabriel de Souza P. Moreira, Davide Eynard, Fahd Husain

Abstract: Recommender Systems today are still mostly evaluated in terms of accuracy, with other aspects beyond the immediate relevance of recommendations, such as diversity, long-term user retention and fairness, often taking a back seat. Moreover, reconciling multiple performance perspectives is by definition indeterminate, presenting a stumbling block to those in the pursuit of rounded evaluation of Recom… ▽ More Recommender Systems today are still mostly evaluated in terms of accuracy, with other aspects beyond the immediate relevance of recommendations, such as diversity, long-term user retention and fairness, often taking a back seat. Moreover, reconciling multiple performance perspectives is by definition indeterminate, presenting a stumbling block to those in the pursuit of rounded evaluation of Recommender Systems. EvalRS 2022 -- a data challenge designed around Multi-Objective Evaluation -- was a first practical endeavour, providing many insights into the requirements and challenges of balancing multiple objectives in evaluation. In this work, we reflect on EvalRS 2022 and expound upon crucial learnings to formulate a first-principles approach toward Multi-Objective model selection, and outline a set of guidelines for carrying out a Multi-Objective Evaluation challenge, with potential applicability to the problem of rounded evaluation of competing models in real-world deployments. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Comments: 15 pages, under submission

arXiv:2304.07145 [pdf, ps, other]

EvalRS 2023. Well-Rounded Recommender Systems For Real-World Deployments

Authors: Federico Bianchi, Patrick John Chia, Ciro Greco, Claudio Pomo, Gabriel Moreira, Davide Eynard, Fahd Husain, Jacopo Tagliabue

Abstract: EvalRS aims to bring together practitioners from industry and academia to foster a debate on rounded evaluation of recommender systems, with a focus on real-world impact across a multitude of deployment scenarios. Recommender systems are often evaluated only through accuracy metrics, which fall short of fully characterizing their generalization capabilities and miss important aspects, such as fair… ▽ More EvalRS aims to bring together practitioners from industry and academia to foster a debate on rounded evaluation of recommender systems, with a focus on real-world impact across a multitude of deployment scenarios. Recommender systems are often evaluated only through accuracy metrics, which fall short of fully characterizing their generalization capabilities and miss important aspects, such as fairness, bias, usefulness, informativeness. This workshop builds on the success of last year's workshop at CIKM, but with a broader scope and an interactive format. △ Less

Submitted 22 July, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

Comments: EvalRS 2023 is a workshop at KDD23. Code and hackathon materials: https://github.com/RecList/evalRS-KDD-2023

arXiv:2301.00213 [pdf]

Electrically Sign-Reversible Topological Hall Effect in a Top-Gated Topological Insulator (Bi,Sb)2Te3 on a Ferrimagnetic Insulator Europium Iron Garnet

Authors: Jyun-Fong Wong, Ko-Hsuan Mandy Chen, Jui-Min Chia, Zih-** Huang, Sheng-Xin Wang, Pei-Tze Chen, Lawrence Boyu Young, Yen-Hsun Glen Lin, Shang-Fan Lee, Chung-Yu Mou, Minghwei Hong, Jueinai Kwo

Abstract: Topological Hall effect (THE), an electrical transport signature of systems with chiral spin textures like skyrmions, has been observed recently in topological insulator (TI)-based magnetic heterostructures. However, the intriguing interplay between the topological surface state and THE is yet to be fully understood. In this work, we report a large THE of ~10 ohm (~4 micro-ohm*cm) at 2 K with an e… ▽ More Topological Hall effect (THE), an electrical transport signature of systems with chiral spin textures like skyrmions, has been observed recently in topological insulator (TI)-based magnetic heterostructures. However, the intriguing interplay between the topological surface state and THE is yet to be fully understood. In this work, we report a large THE of ~10 ohm (~4 micro-ohm*cm) at 2 K with an electrically reversible sign in a top-gated 4 nm TI (Bi0.3Sb0.7)2Te3 (BST) grown on a ferrimagnetic insulator (FI) europium iron garnet (EuIG). Temperature, external magnetic field angle, and top gate bias dependences of magnetotransport properties were investigated and consistent with a skyrmion-driven THE. Most importantly, a sign change in THE was discovered as the Fermi level was tuned from the upper to the lower parts of the gapped Dirac cone and vice versa. This discovery is anticipated to impact technological applications in ultralow power skyrmion-based spintronics. △ Less

Submitted 13 April, 2023; v1 submitted 31 December, 2022; originally announced January 2023.

arXiv:2207.05772 [pdf, ps, other]

EvalRS: a Rounded Evaluation of Recommender Systems

Authors: Jacopo Tagliabue, Federico Bianchi, Tobias Schnabel, Giuseppe Attanasio, Ciro Greco, Gabriel de Souza P. Moreira, Patrick John Chia

Abstract: Much of the complexity of Recommender Systems (RSs) comes from the fact that they are used as part of more complex applications and affect user experience through a varied range of user interfaces. However, research focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow f… ▽ More Much of the complexity of Recommender Systems (RSs) comes from the fact that they are used as part of more complex applications and affect user experience through a varied range of user interfaces. However, research focused almost exclusively on the ability of RSs to produce accurate item rankings while giving little attention to the evaluation of RS behavior in real-world scenarios. Such narrow focus has limited the capacity of RSs to have a lasting impact in the real world and makes them vulnerable to undesired behavior, such as reinforcing data biases. We propose EvalRS as a new type of challenge, in order to foster this discussion among practitioners and build in the open new methodologies for testing RSs "in the wild". △ Less

Submitted 12 August, 2022; v1 submitted 12 July, 2022; originally announced July 2022.

Comments: CIKM 2022 Data Challenge Paper

arXiv:2204.03972 [pdf, other]

Contrastive language and vision learning of general fashion concepts

Authors: Patrick John Chia, Giuseppe Attanasio, Federico Bianchi, Silvia Terragni, Ana Rita Magalhães, Diogo Goncalves, Ciro Greco, Jacopo Tagliabue

Abstract: The steady rise of online shop** goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like mo… ▽ More The steady rise of online shop** goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from more transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model for the fashion industry. We showcase its capabilities for retrieval, classification and grounding, and release our model and code to the community. △ Less

Submitted 18 April, 2023; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Latest version available at https://www.nature.com/articles/s41598-022-23052-9; model available at https://huggingface.co/patrickjohncyh/fashion-clip

arXiv:2204.02473 [pdf, other]

"Does it come in black?" CLIP-like models are zero-shot recommenders

Authors: Patrick John Chia, Jacopo Tagliabue, Federico Bianchi, Ciro Greco, Diogo Goncalves

Abstract: Product discovery is a crucial component for online shop**. However, item-to-item recommendations today do not allow users to explore changes along selected dimensions: given a query item, can a model suggest something similar but in a different color? We consider item recommendations of the comparative nature (e.g. "something darker") and show how CLIP-based models can support this use case in… ▽ More Product discovery is a crucial component for online shop**. However, item-to-item recommendations today do not allow users to explore changes along selected dimensions: given a query item, can a model suggest something similar but in a different color? We consider item recommendations of the comparative nature (e.g. "something darker") and show how CLIP-based models can support this use case in a zero-shot manner. Leveraging a large model built for fashion, we introduce GradREC and its industry potential, and offer a first rounded assessment of its strength and weaknesses. △ Less

Submitted 11 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: Accepted at ACL 2022 (ECNLP)

arXiv:2112.00219 [pdf, other]

Scalable Primitives for Generalized Sensor Fusion in Autonomous Vehicles

Authors: Sammy Sidhu, Linda Wang, Tayyab Naseer, Ashish Malhotra, Jay Chia, Aayush Ahuja, Ella Rasmussen, Qiangui Huang, Ray Gao

Abstract: In autonomous driving, there has been an explosion in the use of deep neural networks for perception, prediction and planning tasks. As autonomous vehicles (AVs) move closer to production, multi-modal sensor inputs and heterogeneous vehicle fleets with different sets of sensor platforms are becoming increasingly common in the industry. However, neural network architectures typically target specifi… ▽ More In autonomous driving, there has been an explosion in the use of deep neural networks for perception, prediction and planning tasks. As autonomous vehicles (AVs) move closer to production, multi-modal sensor inputs and heterogeneous vehicle fleets with different sets of sensor platforms are becoming increasingly common in the industry. However, neural network architectures typically target specific sensor platforms and are not robust to changes in input, making the problem of scaling and model deployment particularly difficult. Furthermore, most players still treat the problem of optimizing software and hardware as entirely independent problems. We propose a new end to end architecture, Generalized Sensor Fusion (GSF), which is designed in such a way that both sensor inputs and target tasks are modular and modifiable. This enables AV system designers to easily experiment with different sensor configurations and methods and opens up the ability to deploy on heterogeneous fleets using the same models that are shared across a large engineering organization. Using this system, we report experimental results where we demonstrate near-parity of an expensive high-density (HD) LiDAR sensor with a cheap low-density (LD) LiDAR plus camera setup in the 3D object detection task. This paves the way for the industry to jointly design hardware and software architectures as well as large fleets with heterogeneous configurations. △ Less

Submitted 30 November, 2021; originally announced December 2021.

Comments: Presented in Machine Learning for Autonomous Driving Workshop at the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia. 11 pages, 8 figures

arXiv:2111.09963 [pdf, other]

doi 10.1145/3487553.3524215

Beyond NDCG: behavioral testing of recommender systems with RecList

Authors: Patrick John Chia, Jacopo Tagliabue, Federico Bianchi, Chloe He, Brian Ko

Abstract: As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis and deployment-specific tests must be employed to ensure the desired quality in actual deployments. In this paper, we propose RecList, a behavioral-based testing methodology. Rec… ▽ More As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis and deployment-specific tests must be employed to ensure the desired quality in actual deployments. In this paper, we propose RecList, a behavioral-based testing methodology. RecList organizes recommender systems by use case and introduces a general plug-and-play procedure to scale up behavioral testing. We demonstrate its capabilities by analyzing known algorithms and black-box commercial systems, and we release RecList as an open source, extensible package for the community. △ Less

Submitted 27 March, 2022; v1 submitted 18 November, 2021; originally announced November 2021.

Comments: Paper accepted to the WebConf 2022

arXiv:2107.03256 [pdf, other]

"Are you sure?": Preliminary Insights from Scaling Product Comparisons to Multiple Shops

Authors: Patrick John Chia, Bingqing Yu, Jacopo Tagliabue

Abstract: Large eCommerce players introduced comparison tables as a new type of recommendations. However, building comparisons at scale without pre-existing training/taxonomy data remains an open challenge, especially within the operational constraints of shops in the long tail. We present preliminary results from building a comparison pipeline designed to scale in a multi-shop scenario: we describe our des… ▽ More Large eCommerce players introduced comparison tables as a new type of recommendations. However, building comparisons at scale without pre-existing training/taxonomy data remains an open challenge, especially within the operational constraints of shops in the long tail. We present preliminary results from building a comparison pipeline designed to scale in a multi-shop scenario: we describe our design choices and run extensive benchmarks on multiple shops to stress-test it. Finally, we run a small user study on property selection and conclude by discussing potential improvements and highlighting the questions that remain to be addressed. △ Less

Submitted 8 July, 2021; v1 submitted 7 July, 2021; originally announced July 2021.

Comments: Accepted for publication at SIGIR eCom 2021

arXiv:2104.09423 [pdf, ps, other]

SIGIR 2021 E-Commerce Workshop Data Challenge

Authors: Jacopo Tagliabue, Ciro Greco, Jean-Francis Roy, Bingqing Yu, Patrick John Chia, Federico Bianchi, Giovanni Cassani

Abstract: The 2021 SIGIR workshop on eCommerce is hosting the Coveo Data Challenge for "In-session prediction for purchase intent and recommendations". The challenge addresses the growing need for reliable predictions within the boundaries of a shop** session, as customer intentions can be different depending on the occasion. The need for efficient procedures for personalization is even clearer if we cons… ▽ More The 2021 SIGIR workshop on eCommerce is hosting the Coveo Data Challenge for "In-session prediction for purchase intent and recommendations". The challenge addresses the growing need for reliable predictions within the boundaries of a shop** session, as customer intentions can be different depending on the occasion. The need for efficient procedures for personalization is even clearer if we consider the e-commerce landscape more broadly: outside of giant digital retailers, the constraints of the problem are stricter, due to smaller user bases and the realization that most users are not frequently returning customers. We release a new session-based dataset including more than 30M fine-grained browsing events (product detail, add, purchase), enriched by linguistic behavior (queries made by shoppers, with items clicked and items not clicked after the query) and catalog meta-data (images, text, pricing information). On this dataset, we ask participants to showcase innovative solutions for two open problems: a recommendation task (where a model is shown some events at the start of a session, and it is asked to predict future product interactions); an intent prediction task, where a model is shown a session containing an add-to-cart event, and it is asked to predict whether the item will be bought before the end of the session. △ Less

Submitted 16 July, 2021; v1 submitted 19 April, 2021; originally announced April 2021.

Comments: SIGIR eCOM 2021 Data Challenge

arXiv:2103.16487 [pdf]

Enormous Berry-Curvature-Driven Anomalous Hall Effect in Topological Insulator (Bi,Sb)2Te3 on Ferrimagnetic Europium Iron Garnet beyond 400 K

Authors: Wei-Jhih Zou, Meng-Xin Guo, Jyun-Fong Wong, Zih-** Huang, Jui-Min Chia, Wei-Nien Chen, Sheng-Xin Wang, Keng-Yung Lin, Lawrence Boyu Young, Yen-Hsun Glen Lin, Mohammad Yahyavi, Chien-Ting Wu, Horng-Tay Jeng, Shang-Fan Lee, Tay-Rong Chang, Minghwei Hong, Jueinai Kwo

Abstract: To realize the quantum anomalous Hall effect (QAHE) at elevated temperatures, the approach of magnetic proximity effect (MPE) was adopted to break the time-reversal symmetry in the topological insulator (Bi0.3Sb0.7)2Te3 (BST) based heterostructures with a ferrimagnetic insulator europium iron garnet (EuIG) of perpendicular magnetic anisotropy. Here we demonstrate phenomenally large anomalous Hall… ▽ More To realize the quantum anomalous Hall effect (QAHE) at elevated temperatures, the approach of magnetic proximity effect (MPE) was adopted to break the time-reversal symmetry in the topological insulator (Bi0.3Sb0.7)2Te3 (BST) based heterostructures with a ferrimagnetic insulator europium iron garnet (EuIG) of perpendicular magnetic anisotropy. Here we demonstrate phenomenally large anomalous Hall resistance (RAHE) exceeding 8 Ω (\r{ho}AHE of 3.2 μΩ*cm) at 300 K and sustaining to 400 K in 35 BST/EuIG samples, surpassing the past record of 0.28 Ω (\r{ho}AHE of 0.14 μΩ*cm) at 300 K. The remarkably large RAHE as attributed to an atomically abrupt, Fe-rich interface between BST and EuIG. Importantly, the gate dependence of the AHE loops shows no sign change with varying chemical potential. This observation is supported by our first-principles calculations via applying a gradient Zeeman field plus a contact potential on BST. Our calculations further demonstrate that the AHE in this heterostructure is attributed to the intrinsic Berry curvature. Furthermore, for gate-biased 4 nm BST on EuIG, a pronounced topological Hall effect (THE) coexisting with AHE is observed at the negative top-gate voltage up to 15 K. Interface tuning with theoretical calculations has opened up new opportunities to realize topologically distinct phenomena in tailored magnetic TI-based heterostructures. △ Less

Submitted 30 September, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

Comments: 66 pages, 16 figures

arXiv:2102.12173 [pdf]

Deep learning-based framework for cardiac function assessment in embryonic zebrafish from heart beating videos

Authors: Amir Mohammad Naderi, Haisong Bu, **gcheng Su, Mao-Hsiang Huang, Khuong Vo, Ramses Seferino Trigo Torres, J. -C. Chiao, Juhyun Lee, Michael P. H. Lau, Xiaolei Xu, Hung Cao

Abstract: Zebrafish is a powerful and widely-used model system for a host of biological investigations including cardiovascular studies and genetic screening. Zebrafish are readily assessable during developmental stages; however, the current methods for quantification and monitoring of cardiac functions mostly involve tedious manual work and inconsistent estimations. In this paper, we developed and validate… ▽ More Zebrafish is a powerful and widely-used model system for a host of biological investigations including cardiovascular studies and genetic screening. Zebrafish are readily assessable during developmental stages; however, the current methods for quantification and monitoring of cardiac functions mostly involve tedious manual work and inconsistent estimations. In this paper, we developed and validated a Zebrafish Automatic Cardiovascular Assessment Framework (ZACAF) based on a U-net deep learning model for automated assessment of cardiovascular indices, such as ejection fraction (EF) and fractional shortening (FS) from microscopic videos of wildtype and cardiomyopathy mutant zebrafish embryos. Our approach yielded favorable performance with accuracy above 90% compared with manual processing. We used only black and white regular microscopic recordings with frame rates of 5-20 frames per second (fps); thus, the framework could be widely applicable with any laboratory resources and infrastructure. Most importantly, the automatic feature holds promise to enable efficient, consistent and reliable processing and analysis capacity for large amounts of videos, which can be generated by diverse collaborating teams. △ Less

Submitted 24 February, 2021; originally announced February 2021.

arXiv:nlin/0306028 [pdf]

Steady State Multiplicity in a Polymer Electrolyte Membrane Fuel Cell

Authors: Ee-Sunn J. Chia, Jay B. Benziger, Ioannis G. Kevrekidis

Abstract: A simplified differential reactor model that embodies the essential physics controlling PEM fuel cell (PEM-FC) dynamics is presented. A remarkable analogy exists between water management in the differential PEM-FC and energy balance in the classical exothermic stirred tank reactor. Water, the reaction product in the PEM-FC autocatalytically accelerates the reaction rate by enhancing proton trans… ▽ More A simplified differential reactor model that embodies the essential physics controlling PEM fuel cell (PEM-FC) dynamics is presented. A remarkable analogy exists between water management in the differential PEM-FC and energy balance in the classical exothermic stirred tank reactor. Water, the reaction product in the PEM-FC autocatalytically accelerates the reaction rate by enhancing proton transport through the PEM. Established analyses of heat autocatalyticity in a CSTR are modified to present water management autocatalyticity in a stirred tank reactor PEM-FC. △ Less

Submitted 16 June, 2003; originally announced June 2003.

Comments: 8 pages, 4 figures

Showing 1–14 of 14 results for author: Chia, J