Skip to main content

Showing 1–14 of 14 results for author: Santy, S

.
  1. arXiv:2405.16915  [pdf, other

    cs.CV cs.LG

    Multilingual Diversity Improves Vision-Language Representations

    Authors: Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei Koh, Ranjay Krishna

    Abstract: Massive web-crawled image-text datasets lay the foundation for recent progress in multimodal learning. These datasets are designed with the goal of training a model to do well on standard computer vision benchmarks, many of which, however, have been shown to be English-centric (e.g., ImageNet). Consequently, existing data curation techniques gravitate towards using predominantly English image-text… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  2. arXiv:2405.06783  [pdf, other

    cs.HC cs.AI cs.CY

    BLIP: Facilitating the Exploration of Undesirable Consequences of Digital Technologies

    Authors: Rock Yuren Pang, Sebastin Santy, René Just, Katharina Reinecke

    Abstract: Digital technologies have positively transformed society, but they have also led to undesirable consequences not anticipated at the time of design or development. We posit that insights into past undesirable consequences can help researchers and practitioners gain awareness and anticipate potential adverse effects. To test this assumption, we introduce BLIP, a system that extracts real-world undes… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: To appear in the Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11--16, 2024, Honolulu, HI, USA

  3. arXiv:2404.10199  [pdf, other

    cs.CL cs.AI

    CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

    Authors: Huihan Li, Liwei Jiang, Jena D. Huang, Hyunwoo Kim, Sebastin Santy, Taylor Sorensen, Bill Yuchen Lin, Nouha Dziri, Xiang Ren, Ye** Choi

    Abstract: As the utilization of large language models (LLMs) has proliferated worldwide, it is crucial for them to have adequate knowledge and fair representation for diverse global cultures. In this work, we uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations, and extract symbols from these generations that are as… ▽ More

    Submitted 26 April, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

  4. arXiv:2310.14356  [pdf, other

    cs.CV cs.CL cs.CY cs.HC

    Computer Vision Datasets and Models Exhibit Cultural and Linguistic Diversity in Perception

    Authors: Andre Ye, Sebastin Santy, Jena D. Hwang, Amy X. Zhang, Ranjay Krishna

    Abstract: Computer vision often treats human perception as homogeneous: an implicit assumption that visual stimuli are perceived similarly by everyone. This assumption is reflected in the way researchers collect datasets and train vision models. By contrast, literature in cross-cultural psychology and linguistics has provided evidence that people from different cultural backgrounds observe vastly different… ▽ More

    Submitted 9 March, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

  5. arXiv:2306.01943  [pdf, other

    cs.CL cs.CY cs.HC

    NLPositionality: Characterizing Design Biases of Datasets and Models

    Authors: Sebastin Santy, Jenny T. Liang, Ronan Le Bras, Katharina Reinecke, Maarten Sap

    Abstract: Design biases in NLP systems, such as performance differences for different populations, often stem from their creator's positionality, i.e., views and lived experiences shaped by identity and background. Despite the prevalence and risks of design biases, they are hard to quantify because researcher, system, and dataset positionality is often unobserved. We introduce NLPositionality, a framework f… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: ACL 2023

  6. arXiv:2212.09045  [pdf, other

    cs.CL cs.CY cs.HC cs.SI

    Task Preferences across Languages on Community Question Answering Platforms

    Authors: Sebastin Santy, Prasanta Bhattacharya, Rishabh Mehrotra

    Abstract: With the steady emergence of community question answering (CQA) platforms like Quora, StackExchange, and WikiHow, users now have an unprecedented access to information on various kind of queries and tasks. Moreover, the rapid proliferation and localization of these platforms spanning geographic and linguistic boundaries offer a unique opportunity to study the task requirements and preferences of u… ▽ More

    Submitted 18 December, 2022; originally announced December 2022.

    Comments: 7 pages, 4 figures

  7. arXiv:2211.16172  [pdf, other

    cs.CL cs.CY

    Learnings from Technological Interventions in a Low Resource Language: Enhancing Information Access in Gondi

    Authors: Devansh Mehta, Harshita Diddee, Ananya Saxena, Anurag Shukla, Sebastin Santy, Ramaravind Kommiya Mothilal, Brij Mohan Lal Srivastava, Alok Sharma, Vishnu Prasad, Venkanna U, Kalika Bali

    Abstract: The primary obstacle to develo** technologies for low-resource languages is the lack of representative, usable data. In this paper, we report the deployment of technology-driven data collection methods for creating a corpus of more than 60,000 translations from Hindi to Gondi, a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India. During this pr… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: In Submission (Revised) to Language Resources and Evaluation Journal. arXiv admin note: text overlap with arXiv:2004.10270

  8. arXiv:2106.06292  [pdf, other

    cs.CL

    A Discussion on Building Practical NLP Leaderboards: The Case of Machine Translation

    Authors: Sebastin Santy, Prasanta Bhattacharya

    Abstract: Recent advances in AI and ML applications have benefited from rapid progress in NLP research. Leaderboards have emerged as a popular mechanism to track and accelerate progress in NLP through competitive model development. While this has increased interest and participation, the over-reliance on single, and accuracy-based metrics have shifted focus from other important metrics that might be equally… ▽ More

    Submitted 30 December, 2022; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: pre-print

  9. arXiv:2106.01105  [pdf, other

    cs.CL

    Use of Formal Ethical Reviews in NLP Literature: Historical Trends and Current Practices

    Authors: Sebastin Santy, Anku Rani, Monojit Choudhury

    Abstract: Ethical aspects of research in language technologies have received much attention recently. It is a standard practice to get a study involving human subjects reviewed and approved by a professional ethics committee/board of the institution. How commonly do we see mention of ethical approvals in NLP research? What types of research or aspects of studies are usually subject to such reviews? With the… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

    Comments: Accepted at ACL 2021 Findings (7 pages)

  10. arXiv:2004.10270  [pdf, other

    cs.CL cs.CY

    Learnings from Technological Interventions in a Low Resource Language: A Case-Study on Gondi

    Authors: Devansh Mehta, Sebastin Santy, Ramaravind Kommiya Mothilal, Brij Mohan Lal Srivastava, Alok Sharma, Anurag Shukla, Vishnu Prasad, Venkanna U, Amit Sharma, Kalika Bali

    Abstract: The primary obstacle to develo** technologies for low-resource languages is the lack of usable data. In this paper, we report the adoption and deployment of 4 technology-driven methods of data collection for Gondi, a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India. In the process of data collection, we also help in its revival by expanding a… ▽ More

    Submitted 26 January, 2021; v1 submitted 21 April, 2020; originally announced April 2020.

    Comments: Accepted at LREC 2020 (7 pages). D.M. and S.S. contributed equally

  11. arXiv:2004.09095  [pdf, other

    cs.CL

    The State and Fate of Linguistic Diversity and Inclusion in the NLP World

    Authors: Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, Monojit Choudhury

    Abstract: Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand th… ▽ More

    Submitted 26 January, 2021; v1 submitted 20 April, 2020; originally announced April 2020.

    Comments: Accepted at ACL 2020 (10 pages + 2 pages Appendix). P.J., S.S. and A.B. contributed equally

  12. arXiv:1912.03457  [pdf, other

    cs.CL cs.CY

    Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities

    Authors: Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, Kalika Bali

    Abstract: In this paper, we examine and analyze the challenges associated with develo** and introducing language technologies to low-resource language communities. While doing so, we bring to light the successes and failures of past work in this area, challenges being faced in doing so, and what they have achieved. Throughout this paper, we take a problem-facing approach and describe essential factors whi… ▽ More

    Submitted 7 December, 2019; originally announced December 2019.

    Comments: Accepted at ICON 2019; 9 pages

  13. arXiv:1811.11833  [pdf, other

    cs.IR cs.CV

    Towards Task Understanding in Visual Settings

    Authors: Sebastin Santy, Wazeer Zulfikar, Rishabh Mehrotra, Emine Yilmaz

    Abstract: We consider the problem of understanding real world tasks depicted in visual images. While most existing image captioning methods excel in producing natural language descriptions of visual scenes involving human tasks, there is often the need for an understanding of the exact task being undertaken rather than a literal description of the scene. We leverage insights from real world task understandi… ▽ More

    Submitted 28 November, 2018; originally announced November 2018.

    Comments: Accepted as Student Abstract at 33rd AAAI Conference on Artificial Intelligence, 2019

  14. arXiv:1809.05611  [pdf, other

    cs.CV

    A study on the use of Boundary Equilibrium GAN for Approximate Frontalization of Unconstrained Faces to aid in Surveillance

    Authors: Wazeer Zulfikar, Sebastin Santy, Sahith Dambekodi, Tirtharaj Dash

    Abstract: Face frontalization is the process of synthesizing frontal facing views of faces given its angled poses. We implement a generative adversarial network (GAN) with spherical linear interpolation (Slerp) for frontalization of unconstrained facial images. Our special focus is intended towards the generation of approximate frontal faces of the side posed images captured from surveillance cameras. Speci… ▽ More

    Submitted 14 September, 2018; originally announced September 2018.