A Simple, Solid, and Reproducible Baseline for Bridge Bidding AI ^†^†thanks: ^∗Equal contributions. 979-8-3503-5067-8/24/$31.00 ©2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. This study was supported by NEDO (JPNP20006) and by JSPS KAKENHI (No. 22H04998 and 23H04676), Japan.

Haruka Kita^∗ Kyoto University
Kyoto, Japan
[email protected] Sotetsu Koyamada^∗ ATR, Kyoto University
Kyoto, Japan
[email protected] Yotaro Yamaguchi LY Corporation
Tokyo, Japan
[email protected] Shin Ishii Kyoto University
Kyoto, Japan
[email protected]

Abstract

Contract bridge, a cooperative game characterized by imperfect information and multi-agent dynamics, poses significant challenges and serves as a critical benchmark in artificial intelligence (AI) research. Success in this domain requires agents to effectively cooperate with their partners. This study demonstrates that an appropriate combination of existing methods can perform surprisingly well in bridge bidding against WBridge5, a leading benchmark in the bridge bidding system and a multiple-time World Computer-Bridge Championship winner. Our approach is notably simple, yet it outperforms the current state-of-the-art methodologies in this field. Furthermore, we have made our code and models publicly available as open-source software. This initiative provides a strong starting foundation for future bridge AI research, facilitating the development and verification of new strategies and advancements in the field.

Index Terms:

reinforcement learning, imperfect information game, multi-agent, contract bridge

I Introduction

Throughout the history of artificial intelligence (AI) research, games have played pivotal roles as benchmarks for measuring progress. AIs have now achieved or even surpassed the skill levels of human experts in a variety of classic games. Notable examples include backgammon [1], chess [2], Go [3, 4, 2], poker [5, 6, 7], mahjong [8], and Atari 2600 [9].

Contract bridge joins the ranks of these classic games as a significant benchmark for AI [10, 11, 12, 13, 14, 15]. It presents complex sets of challenges due to its multi-agent nature, the imperfect information available to players, and the need for both cooperation within teams and competition against the opposing team. Bridge is somewhat akin to the game of Hanabi [16], where information sharing is crucial, though bridge also incorporates the competitive element of playing against another team like DouDiZhu [17]. Despite extensive research efforts, to our best knowledge, no AI has yet been demonstrated to consistently outperform top human players in bridge.

The game of bridge is structured around two main phases: bidding and playing. The bidding phase, in particular, is critical to success in the game [12] and is the focus of our study. Our contributions to this area are twofold:

•

We have discovered that a straightforward integration of existing techniques can achieve state-of-the-art (SOTA) performance in the bidding phase, specifically in tests against WBridge5¹¹1http://www.wbridge5.com/. This program is a multiple-time winner of the World Computer-Bridge Championship (2005, 2007, 2008, and 2016-2018) and serves as the standard benchmark for bridge AI research.
•

To foster further advancements in the field, we have made our code and trained models open-source. This allows our work to be easily reproduced and verified by others, offering a new baseline for future research in bridge AI, beyond the traditional evaluations using WBridge5.

II Background: Contract Bridge Overview

Here, we provide a simplified overview of the game’s flow rather than detailing all its rules. Bridge is a card game for four players, divided into two teams. Each player receives 13 cards from a standard 52-card deck, and these cards are kept secret from the other players. The game unfolds in two main stages: the bidding phase and the playing phase.

•

Bidding phase. In this auction-style stage, players predict how many tricks (sets of four cards, one from each player) their team can win, using bids as a form of communication to signal their hand’s strength and potential to their partner. Additionally, they select a suit to serve as trump, which can override other suits to win tricks. They make bids to set a “contract,” which outlines the number of tricks the team aims to win and identifies the “declarer” (the player who made the bid that established the final contract).
•

Playing phase. Players take turns playing one card at a time, with the highest card of the led suit or trump winning the trick. This process repeats for all 13 tricks.

The team’s score depends on meeting or exceeding their contract in tricks won, with penalties for falling short. Effective communication and strategy are key, as players must signal their hand’s potential to their partner through their bids to form a winning contract.

III Related Work

While advancements like those by Jack²²2https://www.jackbridge.com/eindex.htm, WBridge5, and in the work of Ginsberg et al.[10] have seen AI reach human-level performance in the playing phase, the bidding phase remains a more formidable challenge[12]. This complexity has guided much of the recent focus towards improving AI performance in the bidding aspect of bridge: Yeh et al. [12] pioneered the application of neural networks to bridge bidding, albeit under some simplified conditions such as a restricted number of bids and opponents that always pass. Rong et al. [13] developed a neural network-based bidding system free from these constraints. Their approach included both a policy network for decision-making and an estimation network to predict unseen hands, initially trained on data from human experts and later refined through reinforcement learning (RL) and self-play. Gong et al.[18] were the first to claim the creation of a strong bidding system developed without relying on human game data, achieving significant improvements over WBridge5. They utilized the A3C algorithm[19] for training their policy-value network entirely through self-play. Tian et al. [14] introduced a joint policy search (JPS) algorithm tailored for cooperative games, offering theoretical assurances that JPS-derived policies would at least match the performance of baseline strategies in purely cooperative settings. Despite these guarantees not strictly applying to bridge, their application of JPS led to enhanced bidding strategies. Lockhart et al. [15] focused on develo** AI policies capable of cooperating with human players, achieving SOTA results against WBridge5 through the use of search techniques and policy iteration on a pretrained model. To the best of our knowledge, their work represents the current benchmark in AI performance for bridge bidding. These studies collectively underscore the evolving landscape of AI research in bridge, highlighting a shift from foundational models to sophisticated strategies capable of navigating the game’s intricate dynamics.

IV Methods: Training Recipe

This section outlines the training process for our bridge bidding model, which involves two main stages:

•

Initially, we pretrain the neural network using supervised learning (SL). Further information is given in Sec. IV-B.
•

Next, we enhance the model using the Proximal Policy Optimization (PPO) algorithm [20], a popular reinforcement learning (RL) method, combined with fictitious self-play (FSP) [21]. Details are provided in Sec. IV-C.

IV-A Network Architecture and Input Features

Our model processes a 480-dimensional binary input vector, consistent with standards set by OpenSpiel [22] and Pgx [23]. The input features are detailed in Table I. The network architecture comprises a 4-layer multi-layer perceptron (MLP), each layer containing 1024 neurons and employing ReLU activation functions [24], following the design of Lockhart et al. [15]. Outputs include a policy head for 38 actions (35 bids, pass, double, redouble) and a value head.

TABLE I: Input features.

Feature	Size
Vulnerability	4
Pass before the opening bid	4
For each bid, who made it? (35 4-dim one-hot vector)	140
For each double, who made it? (35 4-dim one-hot vector)	140
For each redouble, who made it? (35 4-dim one-hot vector)	140
Current player’s hand	52
Total	480

IV-B Model Pretraining by Supervised Learning (SL)

Initial training utilizes a dataset from OpenSpiel³³3https://console.cloud.google.com/storage/browser/openspiel-data/bridge, also employed by Lockhart et al. [15]. This dataset, generated with WBridge5 but based on the SAYC bidding system⁴⁴4https://web2.acbl.org/documentlibrary/play/SP3%20(bk)%20single%20pages.pdf, a simple bidding system different from WBridge5’s own system. It includes 1M boards for training and 10K for evaluation, with 12.8M state-action pairs for training and 110K for evaluation. We used Adam [25] with a learning rate of $1.0\times 10^{-4}$ and a batch size of 128, running the training over 40 epochs.

IV-C Reinforcement Learning (RL)

For model enhancement, we applied the PPO algorithm [20], effective in cooperative multi-agent settings [26], and includes A2C as a special case [27]. To mitigate policy cycling common in self-play, we incorporated FSP [21], which samples the opponent uniformly from the checkpoints.

Reward function. Non-zero rewards are assigned only at the end of each game. The reward $z$ is calculated by $z=\text{score}/7600$ , where the score is derived from the double dummy solver (DDS)⁵⁵5https://github.com/dds-bridge/dds, a standard approximator for the playing phase, and 7600 represents the maximum absolute score.

DDS dataset. To bypass real-time DDS calculations during RL, we used a precomputed DDS dataset from Pgx [23], containing 12.5M boards for training and 100K for evaluation.

Invalid action masking. This technique, aimed at preventing the agent from selecting illegal actions, has been widely adopted in AI research; including notable implementations like Suphx [8], OpenAI Five [28], and AlphaStar [29], among others. For detailed insights, see [30].

Other details. Our PPO implementation is a fork of PureJaxRL⁶⁶6https://github.com/luchris429/purejaxrl [31]. After conducting preliminary tests without using the test DDS data, we established the following hyperparameters: 8192 vectorized environments, a rollout length of 32, GAE $\lambda$ of 0.95, a discount factor of 1.0, a clip ratio of 0.2, a value loss coefficient of 0.5, an entropy coefficient of $1.0\times 10^{-3}$ , a batch size of 1024, using Adam, with a learning rate of $1.0\times 10^{-6}$ . We trained the model for $10^{4}$ PPO update steps, in which each step has 10 epochs over rollout data.

V Results

V-A Performance against WBridge5

To assess our model’s effectiveness, trained as described in Sec. IV, we tested it against WBridge5, the leading benchmark in computer bridge. We utilized WBridge5 at its highest difficulty setting and with its native bidding system, which differs from the SAYC system used during our SL pretraining phase. The evaluation comprised 1K games, conducted over a day, reflecting the significant time needed because WBridge5 operates with a GUI and includes a playing phase.

The outcomes, detailed in Table II, also compare our model’s performance with that reported in prior studies. Our approach achieved an average of +1.24 International Match Points (IMPs)⁷⁷7Established in Law 78B: https://web2.acbl.org/documentlibrary/play/laws-of-duplicate-bridge.pdf. per board against WBridge5 across these games, surpassing the previous SOTA performance of +0.85 IMPs/board by Lockhart et al. [15]. This improvement of 0.39 IMPs/board is significant in the context of computer bridge competitiveness [11].

TABLE II: Performance against WBridge5.

Paper	IMPs/board ( $\pm$ SE)	# games
Rong et al. [13]	$+0.25$ ( $\pm$ N/A)	64
Gong et al. [18]	$+0.41$ $(\pm 0.27)$	64
Tian et al. [14]	$+0.63$ $(\pm 0.22)$	1K
Lockhert et al. [15]	$+0.85$ $(\pm 0.05)$	10K
Ours	$\mathbf{+1.24}$ $(\pm 0.19)$	1K

V-B Ablation Study

Our method combines SL pretraining with RL model improvement through FSP. To dissect the contribution of each component, we tested variations of our model lacking one of these elements against WBridge5, with findings summarized in Fig. 1. We used a learning rate 10 times larger for the model from scratch (i.e., w/o SL), as we found that it performs better than the original learning rate in those settings. We also trained the model from scratch with twice the number of steps to compensate for the lack of SL pretraining.

Key observations include:

1.

Removing SL pretraining drastically reduces performance, rendering the model unable to surpass the WBridge5 baseline.
2.

Integrating FSP enhances results post-SL pretraining but is ineffective on its own.

The first insight challenges Gong et al.’s [18] assertion that a model can outperform WBridge5 without SL pretraining, a claim we could not replicate despite extensive hyperparameter testing. We leave further exploration of this discrepancy for future work. We can offer a plausible explanation for the second observation. Starting from scratch, facing a random (or nearly random) opponent policy might slow the learning process. It is important to note that the bidding system used to create the dataset for SL pretraining differs from WBridge5’s system. Therefore, the model enhanced with FSP is not just learning to outperform a version that mimics WBridge5.

Refer to caption — Figure 1: Ablation of each training component.

To verify the mitigation of policy cycling by FSP, we organized a round-robin tournament among training checkpoints. Fig. 2 shows the results. Unlike standard self-play, where some later-stage models might struggle against earlier ones, FSP consistently demonstrated the ability to outperform its predecessors, underscoring its value in stable training.

VI Open-Source Software and Models

Our straightforward approach, as detailed in Sec. IV, has demonstrated SOTA performance against the most recognized benchmark in computer bridge. While effective, this method is not specifically optimized for bridge’s unique aspects, indicating potential areas for enhancement. To encourage continued advancement in bridge AI research, we are releasing our code and trained models as open-source resources:

https://github.com/harukaki/brl

This new baseline aims to overcome certain limitations associated with the current WBridge5 benchmark:

1.

Slow WBridge5 evaluation. Primarily designed for human interaction, WBridge5’s evaluation process, which relies on GUI operations and includes a playing phase, is notably time-consuming and resource-intensive. This was highlighted by Rong et al. [13], who manually tested their model against WBridge5.
2.

Potential weakness of WBridge5. As evidenced in Table II, recent advancements have significantly outperformed WBridge5, raising questions about the benchmark’s current competitiveness. Moreover, fairness in evaluation is a concern since WBridge5 does not incorporate DDS strategies, although recent studies trained their models with DDS datasets.

By addressing these issues, our baseline not only offers a more efficient and equitable framework for assessment but also enhances the diversity of bidding systems under consideration.

VII Limitations, Future Work, and Conclusion

Our study demonstrates that straightforward integration of existing techniques can outperform WBridge5, a leading benchmark in computer bridge bidding systems. However, our approach relies on SL pretraining to surpass WBridge5, contrasting with Gong et al. [18], who claimed to achieve superior results without SL, using only RL from scratch. Exploring the reasons behind this discrepancy presents a valuable opportunity for future research.

Additionally, our methodology, while effective, is not specifically designed with the unique aspects of bridge in mind. This suggests there may be room for further optimization and refinement tailored to bridge’s strategic complexities.

Despite these limitations, we are confident our work lays a solid foundation for subsequent studies in bridge AI. By providing our code and models as open-source resources, we aim to facilitate the development of more advanced AI systems capable of exceeding human expertise in bridge.

References

[1] G. Tesauro, “Temporal difference learning and TD-Gammon,” Commun. ACM, vol. 38, no. 3, pp. 58–68, 1995.
[2] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.
[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[4] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017.
[5] M. Moravčík, M. Schmid, N. Burch, V. Lisỳ, D. Morrill, N. Bard, T. Davis, K. Waugh, M. Johanson, and M. Bowling, “Deepstack: Expert-level artificial intelligence in heads-up no-limit poker,” Science, vol. 356, no. 6337, pp. 508–513, 2017.
[6] N. Brown and T. Sandholm, “Superhuman AI for heads-up no-limit poker: Libratus beats top professionals,” Science, vol. 359, no. 6374, pp. 418–424, 2018.
[7] ——, “Superhuman AI for multiplayer poker,” Science, vol. 365, no. 6456, pp. 885–890, 2019.
[8] J. Li, S. Koyamada, Q. Ye, G. Liu, C. Wang, R. Yang, L. Zhao, T. Qin, T.-Y. Liu, and H.-W. Hon, “Suphx: Mastering Mahjong with Deep Reinforcement Learning,” arXiv:2003.13590, 2020.
[9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[10] M. L. Ginsberg, “GIB: Steps Toward an Expert-Level Bridge-Playing Program,” in IJCAI, 1999.
[11] V. Ventos, Y. Costel, O. Teytaud, and S. T. Ventos, “Boosting a Bridge Artificial Intelligence,” in IEEE ICTAI, 2017.
[12] C.-K. Yeh, C.-Y. Hsieh, and H.-T. Lin, “Automatic bridge bidding using deep reinforcement learning,” IEEE ToG, vol. 10, no. 4, pp. 365–377, 2018.
[13] J. Rong, T. Qin, and B. An, “Competitive Bridge Bidding with Deep Neural Networks,” in AAMAS, 2019.
[14] Y. Tian, Q. Gong, and Y. Jiang, “Joint Policy Search for Multi-agent Collaboration with Imperfect Information,” in NeurIPS, 2020.
[15] E. Lockhart, N. Burch, N. Bard, S. Borgeaud, T. Eccles, L. Smaira, and R. Smith, “Human-Agent Cooperation in Bridge Bidding,” in Cooperative AI Workshops at NeurIPS, 2020.
[16] N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song, E. Parisotto, V. Dumoulin, S. Moitra, E. Hughes et al., “The Hanabi challenge: A new frontier for AI research,” Artificial Intelligence, vol. 280, p. 103216, 2020.
[17] D. Zha, J. Xie, W. Ma, S. Zhang, X. Lian, X. Hu, and J. Liu, “Douzero: Mastering doudizhu with self-play deep reinforcement learning,” in ICML, 2021.
[18] Q. Gong, Y. Jiang, and Y. Tian, “Simple is better: Training an end-to-end contract bridge bidding agent without human knowledge,” in Real-world Sequential Decision Making Workshop at ICML, 2019.
[19] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in ICML, 2016.
[20] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” arXiv:1707.06347, 2017.
[21] J. Heinrich, M. Lanctot, and D. Silver, “Fictitious self-play in extensive-form games,” in ICML, 2015.
[22] M. Lanctot, E. Lockhart, J.-B. Lespiau, V. Zambaldi, S. Upadhyay, J. Pérolat, S. Srinivasan, F. Timbers, K. Tuyls, S. Omidshafiei, D. Hennes, D. Morrill, P. Muller, T. Ewalds, R. Faulkner, J. Kramár, B. De Vylder, B. Saeta, J. Bradbury, D. Ding, S. Borgeaud, M. Lai, J. Schrittwieser, T. Anthony, E. Hughes, I. Danihelka, and J. Ryan-Davis, “OpenSpiel: A Framework for Reinforcement Learning in Games,” arXiv:1908.09453, 2019.
[23] S. Koyamada, S. Okano, S. Nishimori, Y. Murata, K. Habara, H. Kita, and S. Ishii, “Pgx: Hardware-Accelerated Parallel Game Simulators for Reinforcement Learning,” in NeurIPS, 2023.
[24] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” in AISTATS, 2011.
[25] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in ICLR, 2015.
[26] C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, and Y. WU, “The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games,” in NeurIPS, 2022.
[27] S. Huang, A. Kanervisto, A. Raffin, W. Wang, S. Ontañón, and R. F. J. Dossa, “A2C is a special case of PPO,” arXiv:2205.09123, 2022.
[28] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Józefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang, “Dota 2 with large scale deep reinforcement learning,” arXiv:1912.06680, 2019.
[29] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver, “Grandmaster level in StarCraft II using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019.
[30] S. Huang and S. Ontañón, “A Closer Look at Invalid Action Masking in Policy Gradient Algorithms,” in FLAIRS, 2022.
[31] C. Lu, J. Kuba, A. Letcher, L. Metz, C. Schroeder de Witt, and J. Foerster, “Discovered Policy Optimisation,” in NeurIPS, 2022.