Search | arXiv e-print repository

arXiv:2406.19951 [pdf, other]

Mining Reasons For And Against Vaccination From Unstructured Data Using Nichesourcing and AI Data Augmentation

Authors: Damián Ariel Furman, Juan Junqueras, Z. Burçe Gümüslü, Edgar Altszyler, Joaquin Navajas, Ophelia Deroy, Justin Sulik

Abstract: We present Reasons For and Against Vaccination (RFAV), a dataset for predicting reasons for and against vaccination, and scientific authorities used to justify them, annotated through nichesourcing and augmented using GPT4 and GPT3.5-Turbo. We show how it is possible to mine these reasons in non-structured text, under different task definitions, despite the high level of subjectivity involved and… ▽ More We present Reasons For and Against Vaccination (RFAV), a dataset for predicting reasons for and against vaccination, and scientific authorities used to justify them, annotated through nichesourcing and augmented using GPT4 and GPT3.5-Turbo. We show how it is possible to mine these reasons in non-structured text, under different task definitions, despite the high level of subjectivity involved and explore the impact of artificially augmented data using in-context learning with GPT4 and GPT3.5-Turbo. We publish the dataset and the trained models along with the annotation manual used to train annotators and define the task. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: 8 pages + references and appendix

arXiv:2301.00792 [pdf, other]

The Undesirable Dependence on Frequency of Gender Bias Metrics Based on Word Embeddings

Authors: Francisco Valentini, Germán Rosati, Diego Fernandez Slezak, Edgar Altszyler

Abstract: Numerous works use word embedding-based metrics to quantify societal biases and stereotypes in texts. Recent studies have found that word embeddings can capture semantic similarity but may be affected by word frequency. In this work we study the effect of frequency when measuring female vs. male gender bias with word embedding-based bias quantification methods. We find that Skip-gram with negative… ▽ More Numerous works use word embedding-based metrics to quantify societal biases and stereotypes in texts. Recent studies have found that word embeddings can capture semantic similarity but may be affected by word frequency. In this work we study the effect of frequency when measuring female vs. male gender bias with word embedding-based bias quantification methods. We find that Skip-gram with negative sampling and GloVe tend to detect male bias in high frequency words, while GloVe tends to return female bias in low frequency words. We show these behaviors still exist when words are randomly shuffled. This proves that the frequency-based effect observed in unshuffled corpora stems from properties of the metric rather than from word associations. The effect is spurious and problematic since bias metrics should depend exclusively on word co-occurrences and not individual word frequencies. Finally, we compare these results with the ones obtained with an alternative metric based on Pointwise Mutual Information. We find that this metric does not show a clear dependence on frequency, even though it is slightly skewed towards male bias across all frequencies. △ Less

Submitted 2 January, 2023; originally announced January 2023.

Comments: Camera Ready for EMNLP 2022 (Findings)

arXiv:2211.08203 [pdf, other]

Investigating the Frequency Distortion of Word Embeddings and Its Impact on Bias Metrics

Authors: Francisco Valentini, Juan Cruz Sosa, Diego Fernandez Slezak, Edgar Altszyler

Abstract: Recent research has shown that static word embeddings can encode word frequency information. However, little has been studied about this phenomenon and its effects on downstream tasks. In the present work, we systematically study the association between frequency and semantic similarity in several static word embeddings. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher… ▽ More Recent research has shown that static word embeddings can encode word frequency information. However, little has been studied about this phenomenon and its effects on downstream tasks. In the present work, we systematically study the association between frequency and semantic similarity in several static word embeddings. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled. This proves that the patterns found are not due to real semantic associations present in the texts, but are an artifact produced by the word embeddings. Finally, we provide an example of how word frequency can strongly impact the measurement of gender bias with embedding-based metrics. In particular, we carry out a controlled experiment that shows that biases can even change sign or reverse their order by manipulating word frequencies. △ Less

Submitted 19 October, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: Camera Ready for EMNLP 2023 (Findings)

arXiv:2104.06474 [pdf, other]

On the Interpretability and Significance of Bias Metrics in Texts: a PMI-based Approach

Authors: Francisco Valentini, Germán Rosati, Damián Blasi, Diego Fernandez Slezak, Edgar Altszyler

Abstract: In recent years, word embeddings have been widely used to measure biases in texts. Even if they have proven to be effective in detecting a wide variety of biases, metrics based on word embeddings lack transparency and interpretability. We analyze an alternative PMI-based metric to quantify biases in texts. It can be expressed as a function of conditional probabilities, which provides a simple inte… ▽ More In recent years, word embeddings have been widely used to measure biases in texts. Even if they have proven to be effective in detecting a wide variety of biases, metrics based on word embeddings lack transparency and interpretability. We analyze an alternative PMI-based metric to quantify biases in texts. It can be expressed as a function of conditional probabilities, which provides a simple interpretation in terms of word co-occurrences. We also prove that it can be approximated by an odds ratio, which allows estimating confidence intervals and statistical significance of textual biases. This approach produces similar results to metrics based on word embeddings when capturing gender gaps of the real world embedded in large corpora. △ Less

Submitted 18 July, 2023; v1 submitted 13 April, 2021; originally announced April 2021.

Comments: Camera Ready for ACL 2023 (main conference)

arXiv:2011.12096 [pdf]

doi 10.1080/14680777.2022.2047090

Gender bias in magazines oriented to men and women: a computational approach

Authors: Diego Kozlowski, Gabriela Lozano, Carla M. Felcher, Fernando Gonzalez, Edgar Altszyler

Abstract: Cultural products are a source to acquire individual values and behaviours. Therefore, the differences in the content of the magazines aimed specifically at women or men are a means to create and reproduce gender stereotypes. In this study, we compare the content of a women-oriented magazine with that of a men-oriented one, both produced by the same editorial group, over a decade (2008-2018). With… ▽ More Cultural products are a source to acquire individual values and behaviours. Therefore, the differences in the content of the magazines aimed specifically at women or men are a means to create and reproduce gender stereotypes. In this study, we compare the content of a women-oriented magazine with that of a men-oriented one, both produced by the same editorial group, over a decade (2008-2018). With Topic Modelling techniques we identify the main themes discussed in the magazines and quantify how much the presence of these topics differs between magazines over time. Then, we performed a word-frequency analysis to validate this methodology and extend the analysis to other subjects that did not emerge automatically. Our results show that the frequency of appearance of the topics Family, Business and Women as sex objects, present an initial bias that tends to disappear over time. Conversely, in Fashion and Science topics, the initial differences between both magazines are maintained. Besides, we show that in 2012, the content associated with horoscope increased in the women-oriented magazine, generating a new gap that remained open over time. Also, we show a strong increase in the use of words associated with feminism since 2015 and specifically the word abortion in 2018. Overall, these computational tools allowed us to analyse more than 24,000 articles. Up to our knowledge, this is the first study to compare magazines in such a large dataset, a task that would have been prohibitive using manual content analysis methodologies. △ Less

Submitted 24 November, 2020; originally announced November 2020.

Journal ref: Feminist Media Studies (2022)

arXiv:2009.13275 [pdf, other]

Zero-shot Multi-Domain Dialog State Tracking Using Descriptive Rules

Authors: Edgar Altszyler, Pablo Brusco, Nikoletta Basiou, John Byrnes, Dimitra Vergyri

Abstract: In this work, we present a framework for incorporating descriptive logical rules in state-of-the-art neural networks, enabling them to learn how to handle unseen labels without the introduction of any new training data. The rules are integrated into existing networks without modifying their architecture, through an additional term in the network's loss function that penalizes states of the network… ▽ More In this work, we present a framework for incorporating descriptive logical rules in state-of-the-art neural networks, enabling them to learn how to handle unseen labels without the introduction of any new training data. The rules are integrated into existing networks without modifying their architecture, through an additional term in the network's loss function that penalizes states of the network that do not obey the designed rules. As a case of study, the framework is applied to an existing neural-based Dialog State Tracker. Our experiments demonstrate that the inclusion of logical rules allows the prediction of unseen labels, without deteriorating the predictive capacity of the original system. △ Less

Submitted 17 September, 2020; originally announced September 2020.

arXiv:1712.10054 [pdf, ps, other]

Corpus specificity in LSA and Word2vec: the role of out-of-domain documents

Authors: Edgar Altszyler, Mariano Sigman, Diego Fernandez Slezak

Abstract: Latent Semantic Analysis (LSA) and Word2vec are some of the most widely used word embeddings. Despite the popularity of these techniques, the precise mechanisms by which they acquire new semantic relations between words remain unclear. In the present article we investigate whether LSA and Word2vec capacity to identify relevant semantic dimensions increases with size of corpus. One intuitive hypoth… ▽ More Latent Semantic Analysis (LSA) and Word2vec are some of the most widely used word embeddings. Despite the popularity of these techniques, the precise mechanisms by which they acquire new semantic relations between words remain unclear. In the present article we investigate whether LSA and Word2vec capacity to identify relevant semantic dimensions increases with size of corpus. One intuitive hypothesis is that the capacity to identify relevant dimensions should increase as the amount of data increases. However, if corpus size grow in topics which are not specific to the domain of interest, signal to noise ratio may weaken. Here we set to examine and distinguish these alternative hypothesis. To investigate the effect of corpus specificity and size in word-embeddings we study two ways for progressive elimination of documents: the elimination of random documents vs. the elimination of documents unrelated to a specific task. We show that Word2vec can take advantage of all the documents, obtaining its best performance when it is trained with the whole corpus. On the contrary, the specialization (removal of out-of-domain documents) of the training corpus, accompanied by a decrease of dimensionality, can increase LSA word-representation quality while speeding up the processing time. Furthermore, we show that the specialization without the decrease in LSA dimensionality can produce a strong performance reduction in specific tasks. From a cognitive-modeling point of view, we point out that LSA's word-knowledge acquisitions may not be efficiently exploiting higher-order co-occurrences and global relations, whereas Word2vec does. △ Less

Submitted 28 December, 2017; originally announced December 2017.

Journal ref: Proceedings of the 3rd Workshop on Representation Learning for NLP, pages 1-10, 2018, ACL

arXiv:1610.01520 [pdf, other]

doi 10.1016/j.concog.2017.09.004

Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database

Authors: Edgar Altszyler, Mariano Sigman, Sidarta Ribeiro, Diego Fernández Slezak

Abstract: Word embeddings have been extensively studied in large text datasets. However, only a few studies analyze semantic representations of small corpora, particularly relevant in single-person text production studies. In the present paper, we compare Skip-gram and LSA capabilities in this scenario, and we test both techniques to extract relevant semantic patterns in single-series dreams reports. LSA sh… ▽ More Word embeddings have been extensively studied in large text datasets. However, only a few studies analyze semantic representations of small corpora, particularly relevant in single-person text production studies. In the present paper, we compare Skip-gram and LSA capabilities in this scenario, and we test both techniques to extract relevant semantic patterns in single-series dreams reports. LSA showed better performance than Skip-gram in small size training corpus in two semantic tests. As a study case, we show that LSA can capture relevant words associations in dream reports series, even in cases of small number of dreams or low-frequency words. We propose that LSA can be used to explore words associations in dreams reports, which could bring new insight into this classic research area of psychology △ Less

Submitted 11 April, 2017; v1 submitted 5 October, 2016; originally announced October 2016.

Journal ref: Conscious Cogn. 2017 Nov;56:178-187

arXiv:1608.08007 [pdf, ps, other]

doi 10.1371/journal.pone.0180083

Ultrasensitivity on signaling cascades revisited: Linking local and global ultrasensitivity estimations

Authors: Edgar Altszyler, Alejandra Ventura, Alejandro Colman-Lerner, Ariel Chernomoretz

Abstract: Ultrasensitive response motifs, which are capable of converting graded stimulus in binary responses, are very well-conserved in signal transduction networks. Although it has been shown that a cascade arrangement of multiple ultrasensitive modules can produce an enhancement of the system's ultrasensitivity, how the combination of layers affects the cascade's ultrasensitivity remains an open questio… ▽ More Ultrasensitive response motifs, which are capable of converting graded stimulus in binary responses, are very well-conserved in signal transduction networks. Although it has been shown that a cascade arrangement of multiple ultrasensitive modules can produce an enhancement of the system's ultrasensitivity, how the combination of layers affects the cascade's ultrasensitivity remains an open question for the general case. Here we introduced a methodology that allowed us to determine the presence of sequestration effects and to quantify the relative contribution of each module to the overall cascade's ultrasensitivity. The proposed analysis framework provides a natural link between global and local ultrasensitivity descriptors and is particularly well-suited to characterize and better understand mathematical models used to study real biological systems. As a case study we considered three mathematical models introduced by O'Shaughnessy et al. to study a tunable synthetic MAPK cascade, and showed how our methodology might help modelers to better understand modeling alternatives. △ Less

Submitted 3 April, 2017; v1 submitted 29 August, 2016; originally announced August 2016.

Journal ref: PLoS ONE 12(6), 2017

Showing 1–9 of 9 results for author: Altszyler, E