Whose wife is it anyway?
Assessing bias against same-gender relationships in machine translation

Ian Stewart
Pacific Northwest National Laboratory
[email protected]
&Rada Mihalcea
University of Michigan
[email protected]

Abstract

Machine translation often suffers from biased data and algorithms that can lead to unacceptable errors in system output. While bias in gender norms has been investigated, less is known about whether MT systems encode bias about social relationships, e.g. sentences such as “the lawyer kissed her wife”. We investigate the degree of bias against same-gender relationships in MT systems, using generated template sentences drawn from several noun-gender languages (e.g. Spanish). We find that three popular MT services consistently fail to accurately translate sentences concerning relationships between nouns of the same gender. The error rate varies considerably based on the context, e.g. same-gender sentences referencing high female-representation occupations are translated with lower accuracy. We provide this work as a case study in the evaluation of intrinsic bias in NLP systems, with respect to social relationships.

1 Introduction

Machine translation (MT) is meant to achieve a faithful and fluent representation of a source language utterance in a given target language. While NLP research continues to improve the accuracy and robustness of MT systems Lai et al. (2022); Liu et al. (2020), the full space of possible translation failures remains to be determined, particularly with respect to gender Stanovsky et al. (2019). One example is how MT systems may generate masculine-gender words as the default for gendered languages Savoldi et al. (2021), which led Google Translate to provide simultaneous translations for all genders.

Refer to caption — Figure 1: Example translation error of same-gender sentence between English and Spanish (Google Translate; accessed 1 November 2023).

Focusing on word-based bias in MT is a good start, but translation systems may also exhibit grammatical bias that can reflect social stereotypes Savoldi et al. (2021). We show an example of grammatical bias in Figure 1, where a sentence containing a same-gender relationship (“the lawyer kissed his husband”) is re-translated as a sentence with a different-gender relationship (“her husband”). This error seems to reveal the model’s bias toward fluent translation at the cost of faithfulness Feng et al. (2020), generating an output sentence with higher likelihood in the target language (“her husband”) but a possibly inaccurate meaning for the source language. Furthermore, this kind of error can only be brought to light by focusing on relationships between entities, an issue which is more complicated but equally important as bias in individual entities like “doctor.” Addressing bias in translation of relationships is important for some social groups such as LGBTQ people, who often face discrimination for engaging in relationships with partners of the same gender Poushter and Kent (2020).

This study presents an analysis of the discrepancy in representation of same-gender vs. different-gender relationships in multilingual data and translation systems, focusing on languages with noun gender-marking. Our study proceeds as follows:

•

We generate a curated data set of sentence templates for a variety of relationships in French, Italian, and Spanish, using the format “OCCUPATION RELATIONSHIP-VERB HIS/HER RELATIONSHIP-TARGET.” (§ 3.1).
•

We test several leading MT models on this dataset, and we find a consistent bias against same-gender relationships when translating to English (§ 3.2).
•

We assess possible correlates of bias using social factors and find that sentences containing occupations with higher income have lower accuracy for same-gender relationships (§ 3.2.1).

This study not only highlights latent bias in MT, it also addresses the need to assess complex social constructs as part of bias testing, including relationships. Diagnosing and addressing this kind of bias can ensure that the needs of minority groups are addressed in the evaluation of common NLP methods Blodgett et al. (2020).

2 Related work

Traditionally, research in ML-related bias has focused on well-established social demographics that are protected by law such as gender, race, and religion Field et al. (2021); Nadeem et al. (2021); Rudinger et al. (2018). While demographics play an important role in understanding bias, many other facets of social identity can also explain what language models learn about humans from text Hovy and Yang (2021). Of particular interest are the representation of social relationships, including power dynamics Prabhakaran et al. (2012), friendship Krishnan and Eisenstein (2015), and romantic relationships Seraj et al. (2021). A system that can accurately process these relationships will require not just a piecewise understanding of the individual concepts (e.g. “man” and “woman”) but also a broad understanding of the social norms around the interactions between individuals (e.g. why two unrelated adults decide to live together) Bosselut et al. (2019); Choi et al. (2020).

While norms around social relationships vary widely between societies Miller et al. (2017), it is reasonable to assume that NLP systems should process romantic relationships consistently regardless of the demographics of the participants. Furthermore, relationships can form an important part of social identity for many people Sky and Jurgens (2021), including LGBTQ people whose self-image may be negatively impacted by stereotypes about their relationships Park et al. (2021). To fill the gap in the space of relationship-related bias, this study offers a path forward in assessing bias against with same-gender relationships in NLP systems.

Word category	Examples	Count
Occupation	el abogado (M; “lawyer”); la abogada (F)	100
Relationship template	X besó a Y (“X kissed Y”)	5
Relationship target	el novio (M; “boyfriend”); la novia (F; “girlfriend”)	6
Sentence	El abogado besó a su novio. (“The lawyer kissed his boyfriend.”)	3000

Table 1: Summary of relationship sentences, for a single source language.

3 Assessing biases in relationship representation

3.1 Data generation

This study evaluates the presence of bias for same vs. different-gender relationships in machine translation. To our knowledge, prior work in MT has not developed a data set specifically to handle relationships based on pairs of grammatical gender, although some prior work has included relationships as part of their data in assessment of gender bias Kocmi et al. (2020); Troles and Schmid (2021). We therefore develop our own using a simple set of fixed sample sentences.

We generate a variety of sample sentences to test the ability of multilingual models to process human relationships. We begin with sentence templates that describe common activities in romantic relationships ranging from casual to serious, e.g. “X met Y on a date,” where each template has a subject X and an object Y. We fill in the subject position of the templates with occupation nouns which have different male and female versions in the source languages, e.g. Spanish “panadero” (“baker,” male) vs. “panadera” (female). The list of occupations is taken from a prior study of gender bias Gonen and Goldberg (2019).

We fill the object position of the templates with relationship targets, i.e. FRIEND (boy/girlfriend), ENGAGE (fiancé(e)), and SPOUSE (husband/wife). This procedure generates example sentences such as “El autor conoció a su esposo en una cita” (“The author met his husband on a date”). For each language we generate up to 3000 sentences to catch every combination of occupation, gender, template, and target, and a summary is shown in Table 1.

3.2 Same-gender bias in translation

We test the ability of publicly available MT models to faithfully translate text about same-gender relationships. While we cannot cover all available translation services, we focus on several of the most popular services available to developers: Google Cloud Translation, Amazon Translate, and Microsoft Azure AI TranslatorAmazon (2023); Google (2023); Microsoft (2023).

We provide each generated sentence $S$ to the translation model with English as the target language. We count a translation as correct if the gender of the English possessive pronoun in the translated sentence matches the gender of the subject noun in the source language sentence. E.g. for the Spanish sentence “la abogado besó a su esposa,” we expect the translated English sentence to contain the pronoun “her” for “the lawyer kissed her wife.”

We show the aggregate results in 1(a). The translation system produces the correct subject gender at a lower rate for same-gender relationships than different-gender relationships (1(a) Furthermore, the accuracy is slightly better for female same-gender relationships than for male same-gender relationships (1(b)), which may mean that the MT models were trained with data that associated female gender in English with masculine-gender occupations (e.g. translating masculine “abogado” as a female “lawyer”).

Out of all the models, the Amazon MT model has the highest accuracy for same-gender relationships, but the gap between same-gender and different-gender relationships remains substantial with roughly 50% accuracy for all same-gender relationship sentences versus 100% accuracy for different-gender relationship sentences (1(c)). The considerable variation among models suggests substantial differences in the training data distribution, perhaps due to proprietary data collection practices. For languages (1(d)), we see the best performance for Spanish, followed by French and Italian, which could indicate imbalances in relationships in different MT training corpora (e.g. less representation of same-gender relationships in Italian MT documents).

Lastly, we see significant variation among different occupations (1(e), 1(f)). Occupations with higher income tend to see a very low accuracy for same-gender translations (e.g. “judge,” 15% accuracy), while occupations that are more public-facing have higher accuracy for same-gender translations (“athlete,” 64% accuracy), although the accuracy never reaches parity. This variation across occupations leads us to test the relative effect of different aspects of the occupations, to investigate social correlates of bias.

3.2.1 Assessing social correlates of bias

Prior work in system-internal bias has found correlations with language-external phenomena that impact the perception of different social groups, such as trends in immigration Garg et al. (2018). To that end, we conduct further analysis of the bias using several social variables that correspond to different occupations:

•

Income level (high-income occupations may be more equitable);
•

Female representation (high female-representation occupations may be more equitable);
•

Age representation (youth-oriented occupations may be more equitable).

We collect the occupation-related variables using statistics from the US Department of Labor and Bureau of Labor Statistics BLS (2023); DOL (2023). We manually match each occupation to the corresponding official category: e.g. “boss” is mapped to “General and Operations Managers”.

We run a logistic regression to predict whether a sentence was translated with the correct subject gender, limiting the analysis to same-gender sentences to isolate correlates of the bias. We add categorical variables for the subject gender, source language, MT model, the relationship template, and the relationship target. We also include the occupation-related described above as scalar values, and we Z-normalize the values for fair comparison of effect sizes.

The regression results are shown in Table 2. The model replicates the trends observed from aggregate comparisons: lower likelihood of correct subject-gender prediction for sentences with a male-gender subject, sentences in Italian, cases where the Microsoft MT model was used, and sentences with FRIEND or SPOUSE as relationship target. We also find that a lower likelihood of correct subject-gender prediction for occupations that had a higher income, a higher female representation, and higher age.

The negative correlation between female representation and accuracy is somewhat unexpected but could indicate that the MT systems were trained on text that portrayed high-female representation occupations (e.g. secretary, teacher) in the context of more traditional different-gender relationships, e.g. “the teacher and her husband” or “the man’s wife was a secretary.” As for the other occupation variables, the MT systems may have learned more social conservative norms associated with high-income occupations (e.g. dentist, lawyer) and higher-age occupations (farmer, judge).

	$\beta$	SE	Z	$p$
Intercept	1.3091	0.067	19.642	*
Subject gender (default female)
Male	-0.5664	0.047	-12.024	*
Language (default French)
Italian	-0.5329	0.062	-8.632	*
Spanish	0.5156	0.055	9.294	*
Model (default Amazon)
Google	-0.7138	0.057	-12.598	*
Microsoft	-1.5303	0.060	-25.616	*
Relationship target (default ENGAGED)
FRIEND	-0.3981	0.051	-7.823	*
SPOUSE	-2.9832	0.073	-41.020	*
Occupation variables
Income	-0.1915	0.027	-6.993	*
Female representation	-0.3110	0.027	-11.516	*
Age	-0.1227	0.031	-3.930	*

Table 2: Logistic regression for correct pronoun prediction for same-gender sentences; positive coefficient means higher likelihood of correct pronoun prediction. d.f.=10, N=11070, LLR=3758 (

p

<0.001). * indicates

p<0.001

4 Conclusion

This study has identified consistent bias against same-gender relationships in machine translation among several Romance languages. Using Google Translate, we demonstrated consistent bias toward different-gender relationships across language, gender, topic, and subject type. Upon further investigation, we found that occupations with higher income, higher female representation, and higher median age tend to exhibit higher rates of bias. This suggests that the MT models may have been trained on data representing only certain types of relationships, and that future systems should take care to balance their training data to represent a wider range of relationships. Future work should broaden the investigation of how relationships are processed in multilingual models, including coreference resolution and natural language inference, to provide a more complete picture into the representation of gender-minority relationships.

Limitations

We acknowledge that the study is limited to a sub-set of the Romance languages, due to their use of grammatical gender that contributes to the potential for bias (gender marked on NP and unmarked on possessive pronouns). While this analysis is not appropriate for all languages, it can be adapted to fit other situations, e.g. identifying the inferred possessive pronoun when translating from a language without explicit possession marking (e.g. translating “she met ø wife” from Norwegian; Lødrup 2010) to a language with explicit possession marking. The study also only focuses on one direction of translation, even though the direction from no-gender-NP to gender-NP is known to exhibit gender bias Stanovsky et al. (2019). Future studies should assess bias in both translation directions, as well as to/from languages without any grammatical gender such as Chinese. Lastly, our analysis of occupations uses statistics from the United States, which may not match the statistics of the countries in which the languages under study (French, Italian, Spanish) are spoken. We assume that the relative ranking of occupations by the social variables will not be significantly different between countries (e.g. in many countries, a physician will earn more money than a nurse).

5 Ethical considerations

This study addresses the ethical ramifications of machine translation with respect to a large but only semi-visible population, namely people who participate in same-gender relationships. Although not all LGBTQ people engage in same-gender relationships, they represent a sizable proportion of the US population, around 5.6% by a recent estimate Jones (2021). People in same-gender relationships specifically have often faced considerable legal and social opposition within the US Avery et al. (2007); Soule (2004), and part of that opposition extends to the technology that supports communication in everyday life.

While our study does not address underlying issues facing LGBTQ people such as legal discrimination, it does provide a way forward to identify implicit bias in NLP systems. We hope that the study encourages NLP system researchers to take a broader view of “ethics” when it comes to the design and evaluation of such systems as MT, in order to include minority groups who are not always considered visible Hutchinson et al. (2020). More broadly, this study addresses an under-explored area of ethics in NLP, namely how social norms are implicitly encoded through a system’s grammatical behavior. Rather than assess fairness solely through explicit association tests on individual words, we encourage researchers to look for implicit and implied bias in their systems that emerge in subtle issues such as pronoun choice.

As a caveat around relationships, we want to emphasize that our study does not cover all types of relationships where gender plays an important role. In particular, we focus on grammatical gender rather than social gender, which may be an ethical concern. To illustrate this point, consider a situation where a person referred to as “el abogado” (Sp. masculine) identifies as female, which is an ongoing debate among speakers of noun-gender languages Burgen (2020); Horvath et al. (2016); Lipovsky (2014). In this case, a sentence with “el abogado” as subject noun and a masculine-gender target noun may in fact refer to a relationship between a female-gender person and a male-gender person. Having established this, we do not claim that Google Translate and other MT systems are necessarily biased with respect to the social or psychological construct of gender, only the grammatical construct of gender Alvanoudi (2014). In addition, we acknowledge that not all relationships should be considered ethical when testing MT systems, e.g. relationships with an imbalance in age or power which may be a sign of abuse Volpe et al. (2013).

In this analysis, we do not claim that the observed bias is malicious or even intentional, only that it is systematic and can be corrected. Engineers who build machine learning systems such as Google Translate are rarely aware of all possible downstream errors that their system can generate Nushi et al. (2017). The type of analysis that this study employs should not be used to blame individuals but instead highlight the kinds of “stress-testing” that machine translation systems need before being released for public use.

References

Alvanoudi (2014) Angeliki Alvanoudi. 2014. Grammatical gender in interaction: Cultural and cognitive aspects. Brill.
Amazon (2023) Amazon. 2023. Amazon Translate.
Avery et al. (2007) Alison Avery, Justin Chase, Linda Johansson, Samantha Litvak, Darrel Montero, and Michael Wydra. 2007. America’s changing attitudes toward homosexuality, civil unions, and same-gender marriage: 1977–2004. Social work, 52(1):71–79.
Blodgett et al. (2020) Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476.
BLS (2023) BLS. 2023. Labor force statistics from the current population survey.
Bosselut et al. (2019) Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Ye** Choi. 2019. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4762–4779.
Burgen (2020) S Burgen. 2020. Masculine, feminist or neutral? The language battle that has split Spain. The Guardian.
Choi et al. (2020) Minje Choi, Luca Maria Aiello, Krisztián Zsolt Varga, and Daniele Quercia. 2020. Ten social dimensions of conversations and relationships. In Proceedings of The Web Conference 2020, pages 1514–1525.
DOL (2023) DOL. 2023. Employment and earnings by occupation.
Feng et al. (2020) Yang Feng, Wanying Xie, Shuhao Gu, Chenze Shao, Wen Zhang, Zhengxin Yang, and Dong Yu. 2020. Modeling fluency and faithfulness for diverse neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 59–66.
Field et al. (2021) Anjalie Field, Su Lin Blodgett, Zeerak Waseem, and Yulia Tsvetkov. 2021. A Survey of Race, Racism, and Anti-Racism in NLP. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1905–1925.
Garg et al. (2018) Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16):E3635–E3644.
Gonen and Goldberg (2019) Hila Gonen and Yoav Goldberg. 2019. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 609–614.
Google (2023) Google. 2023. Google Translate.
Horvath et al. (2016) Lisa K Horvath, Elisa F Merkel, Anne Maass, and Sabine Sczesny. 2016. Does gender-fair language pay off? the social perception of professions from a cross-linguistic perspective. Frontiers in psychology, 6:2018.
Hovy and Yang (2021) Dirk Hovy and Diyi Yang. 2021. The importance of modeling social factors of language: Theory and practice. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 588–602.
Hutchinson et al. (2020) Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and Stephen Denuyl. 2020. Social Biases in NLP Models as Barriers for Persons with Disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5491–5501.
Jones (2021) Jeffrey M Jones. 2021. LGBT identification rises to 5.6% in latest US estimate. Gallup News, 24.
Kocmi et al. (2020) Tom Kocmi, Tomasz Limisiewicz, and Gabriel Stanovsky. 2020. Gender Coreference and Bias Evaluation at WMT 2020. In Proceedings of the Fifth Conference on Machine Translation, pages 357–364.
Krishnan and Eisenstein (2015) Vinodh Krishnan and Jacob Eisenstein. 2015. “You’re Mr. Lebowski, I’m the Dude”: Inducing Address Term Formality in Signed Social Networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1616–1626.
Lai et al. (2022) Wen Lai, **dřich Libovický, and Alexander Fraser. 2022. Improving both domain robustness and domain adaptability in machine translation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5191–5204, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Lipovsky (2014) Caroline Lipovsky. 2014. Gender-specification and occupational nouns: has linguistic change occurred in job advertisements since the French feminisation reforms? Gender & Language, 8(3).
Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
Lødrup (2010) Helge Lødrup. 2010. Implicit possessives and reflexive binding in norwegian. Transactions of the Philological Society, 108(2):89–109.
Microsoft (2023) Microsoft. 2023. Azure AI Translator.
Miller et al. (2017) Joan G Miller, Hiroko Akiyama, and Shagufa Kapadia. 2017. Cultural variation in communal versus exchange norms: Implications for social support. Journal of Personality and Social Psychology, 113(1):81.
Nadeem et al. (2021) Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online. Association for Computational Linguistics.
Nushi et al. (2017) Besmira Nushi, Ece Kamar, Eric Horvitz, and Donald Kossmann. 2017. On human intellect and machine failures: Troubleshooting integrative machine learning systems. In Thirty-First AAAI Conference on Artificial Intelligence.
Park et al. (2021) Chan Young Park, Xinru Yan, Anjalie Field, and Yulia Tsvetkov. 2021. Multilingual Contextual Affective Analysis of LGBT People Portrayals in Wikipedia. In Proceedings of the International AAAI Conference on Web and Social Media, volume 15, pages 479–490.
Poushter and Kent (2020) Jacob Poushter and Nicholas Kent. 2020. The global divide on homosexuality persists. Pew Research Center, 25.
Prabhakaran et al. (2012) Vinodkumar Prabhakaran, Owen Rambow, and Mona Diab. 2012. Predicting overt display of power in written dialogs. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 518–522.
Rudinger et al. (2018) Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8–14.
Savoldi et al. (2021) Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2021. Gender bias in machine translation. Transactions of the Association for Computational Linguistics, 9:845–874.
Seraj et al. (2021) Sarah Seraj, Kate G Blackburn, and James W Pennebaker. 2021. Language left behind on social media exposes the emotional and cognitive costs of a romantic breakup. Proceedings of the National Academy of Sciences, 118(7).
Sky and Jurgens (2021) CH-Wang Sky and David Jurgens. 2021. Using sociolinguistic variables to reveal changing attitudes towards sexuality and gender. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9918–9938.
Soule (2004) Sarah A Soule. 2004. Going to the chapel? Same-sex marriage bans in the United States, 1973–2000. Social problems, 51(4):453–477.
Stanovsky et al. (2019) Gabriel Stanovsky, Noah A. Smith, and Luke Zettlemoyer. 2019. Evaluating gender bias in machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1679–1684, Florence, Italy. Association for Computational Linguistics.
Troles and Schmid (2021) Jonas-Dario Troles and Ute Schmid. 2021. Extending Challenge Sets to Uncover Gender Bias in Machine Translation: Impact of Stereotypical Verbs and Adjectives. In Proceedings of the Sixth Conference on Machine Translation, pages 531–541.
Volpe et al. (2013) Ellen M Volpe, Thomas L Hardie, Catherine Cerulli, Marilyn S Sommers, and Dianne Morrison-Beedy. 2013. What’s age got to do with it? Partner age difference, power, intimate partner violence, and sexual risk in urban adolescents. Journal of interpersonal violence, 28(10):2068–2087.

Whose wife is it anyway? Assessing bias against same-gender relationships in machine translation