Search | arXiv e-print repository

CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies

Authors: Weiyan Shi, Ryan Li, Yutong Zhang, Caleb Ziems, Chunhua yu, Raya Horesh, Rogério Abreu de Paula, Diyi Yang

Abstract: To enhance language models' cultural awareness, we design a generalizable pipeline to construct cultural knowledge bases from different online communities on a massive scale. With the pipeline, we construct CultureBank, a knowledge base built upon users' self-narratives with 12K cultural descriptors sourced from TikTok and 11K from Reddit. Unlike previous cultural knowledge resources, CultureBank… ▽ More To enhance language models' cultural awareness, we design a generalizable pipeline to construct cultural knowledge bases from different online communities on a massive scale. With the pipeline, we construct CultureBank, a knowledge base built upon users' self-narratives with 12K cultural descriptors sourced from TikTok and 11K from Reddit. Unlike previous cultural knowledge resources, CultureBank contains diverse views on cultural descriptors to allow flexible interpretation of cultural knowledge, and contextualized cultural scenarios to help grounded evaluation. With CultureBank, we evaluate different LLMs' cultural awareness, and identify areas for improvement. We also fine-tune a language model on CultureBank: experiments show that it achieves better performances on two downstream cultural tasks in a zero-shot setting. Finally, we offer recommendations based on our findings for future culturally aware language technologies. The project page is https://culturebank.github.io . The code and model is at https://github.com/SALT-NLP/CultureBank . The released CultureBank dataset is at https://huggingface.co/datasets/SALT-NLP/CultureBank . △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 32 pages, 7 figures, preprint

arXiv:2403.06009 [pdf, other]

Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations

Authors: Swapnaja Achintalwar, Adriana Alvarado Garcia, Ateret Anaby-Tavor, Ioana Baldini, Sara E. Berger, Bishwaranjan Bhattacharjee, Djallel Bouneffouf, Subhajit Chaudhury, Pin-Yu Chen, Lamogha Chiazor, Elizabeth M. Daly, Kirushikesh DB, Rogério Abreu de Paula, Pierre Dognin, Eitan Farchi, Soumya Ghosh, Michael Hind, Raya Horesh, George Kour, Ja Young Lee, Nishtha Madaan, Sameep Mehta, Erik Miehling, Keerthiram Murugesan, Manish Nagireddy , et al. (13 additional authors not shown)

Abstract: Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we presen… ▽ More Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we present our ongoing efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms. In addition to the detectors themselves, we discuss a wide range of uses for these detector models - from acting as guardrails to enabling effective AI governance. We also deep dive into inherent challenges in their development and discuss future work aimed at making the detectors more reliable and broadening their scope. △ Less

Submitted 13 June, 2024; v1 submitted 9 March, 2024; originally announced March 2024.

arXiv:2401.17866 [pdf]

Making Sense of Knowledge Intensive Processes: an Oil & Gas Industry Scenario

Authors: Juliana Jansen Ferreira, Vinícius Segura, Ana Fucs, Rogério de Paula

Abstract: Sensemaking is a constant and ongoing process by which people associate meaning to experiences. It can be an individual process, known as abduction, or a group process by which people give meaning to collective experiences. The sensemaking of a group is influenced by the abduction process of each person about the experience. Every collaborative process needs some level of sensemaking to show resul… ▽ More Sensemaking is a constant and ongoing process by which people associate meaning to experiences. It can be an individual process, known as abduction, or a group process by which people give meaning to collective experiences. The sensemaking of a group is influenced by the abduction process of each person about the experience. Every collaborative process needs some level of sensemaking to show results. For a knowledge intensive process, sensemaking is central and related to most of its tasks. We present findings from a fieldwork executed in knowledge intensive process from the Oil and Gas industry. Our findings indicated that different types of knowledge can be combined to compose the result of a sensemaking process (e.g. decision, the need for more discussion, etc.). This paper presents an initial set of knowledge types that can be combined to compose the result of the sensemaking of a collaborative decision making process. We also discuss ideas for using systems powered by Artificial Intelligence to support sensemaking processes. △ Less

Submitted 31 January, 2024; originally announced January 2024.

Comments: 9 pages. This paper was presented at the Sensemaking in a Senseless World workshop during the 2018 ACM CHI Conference on Human Factors in Computing Systems

arXiv:2205.09508 [pdf, other]

doi 10.1145/3514094.3534183

Practical Skills Demand Forecasting via Representation Learning of Temporal Dynamics

Authors: Maysa M. Garcia de Macedo, Wyatt Clarke, Eli Lucherini, Tyler Baldwin, Dilermando Queiroz Neto, Rogerio de Paula, Subhro Das

Abstract: Rapid technological innovation threatens to leave much of the global workforce behind. Today's economy juxtaposes white-hot demand for skilled labor against stagnant employment prospects for workers unprepared to participate in a digital economy. It is a moment of peril and opportunity for every country, with outcomes measured in long-term capital allocation and the life satisfaction of billions o… ▽ More Rapid technological innovation threatens to leave much of the global workforce behind. Today's economy juxtaposes white-hot demand for skilled labor against stagnant employment prospects for workers unprepared to participate in a digital economy. It is a moment of peril and opportunity for every country, with outcomes measured in long-term capital allocation and the life satisfaction of billions of workers. To meet the moment, governments and markets must find ways to quicken the rate at which the supply of skills reacts to changes in demand. More fully and quickly understanding labor market intelligence is one route. In this work, we explore the utility of time series forecasts to enhance the value of skill demand data gathered from online job advertisements. This paper presents a pipeline which makes one-shot multi-step forecasts into the future using a decade of monthly skill demand observations based on a set of recurrent neural network methods. We compare the performance of a multivariate model versus a univariate one, analyze how correlation between skills can influence multivariate model results, and present predictions of demand for a selection of skills practiced by workers in the information technology industry. △ Less

Submitted 18 May, 2022; originally announced May 2022.

Comments: 15 pages, 5th AAAI/ACM Conference on AI, Ethics, and Society

arXiv:2204.01489 [pdf, other]

Towards a New Science of Disinformation

Authors: Claudio S. Pinhanez, German H. Flores, Marisa A. Vasconcelos, Mu Qiao, Nick Linck, Rogério de Paula, Yuya J. Ong

Abstract: How can we best address the dangerous impact that deep learning-generated fake audios, photographs, and videos (a.k.a. deepfakes) may have in personal and societal life? We foresee that the availability of cheap deepfake technology will create a second wave of disinformation where people will receive specific, personalized disinformation through different channels, making the current approaches to… ▽ More How can we best address the dangerous impact that deep learning-generated fake audios, photographs, and videos (a.k.a. deepfakes) may have in personal and societal life? We foresee that the availability of cheap deepfake technology will create a second wave of disinformation where people will receive specific, personalized disinformation through different channels, making the current approaches to fight disinformation obsolete. We argue that fake media has to be seen as an upcoming cybersecurity problem, and we have to shift from combating its spread to a prevention and cure framework where users have available ways to verify, challenge, and argue against the veracity of each piece of media they are exposed to. To create the technologies behind this framework, we propose that a new Science of Disinformation is needed, one which creates a theoretical framework both for the processes of communication and consumption of false content. Key scientific and technological challenges facing this research agenda are listed and discussed in the light of state-of-art technologies for fake media generation and detection, argument finding and construction, and how to effectively engage users in the prevention and cure processes. △ Less

Submitted 17 March, 2022; originally announced April 2022.

arXiv:2008.07363 [pdf, other]

Predicting Account Receivables with Machine Learning

Authors: Ana Paula Appel, Gabriel Louzada Malfatti, Renato Luiz de Freitas Cunha, Bruno Lima, Rogerio de Paula

Abstract: Being able to predict when invoices will be paid is valuable in multiple industries and supports decision-making processes in most financial workflows. However, due to the complexity of data related to invoices and the fact that the decision-making process is not registered in the accounts receivable system, performing this prediction becomes a challenge. In this paper, we present a prototype able… ▽ More Being able to predict when invoices will be paid is valuable in multiple industries and supports decision-making processes in most financial workflows. However, due to the complexity of data related to invoices and the fact that the decision-making process is not registered in the accounts receivable system, performing this prediction becomes a challenge. In this paper, we present a prototype able to support collectors in predicting the payment of invoices. This prototype is part of a solution developed in partnership with a multinational bank and it has reached up to 81% of prediction accuracy, which improved the prioritization of customers and supported the daily work of collectors. Our simulations show that the adoption of our model to prioritize the work o collectors saves up to ~1.75 million dollars per month. The methodology and results presented in this paper will allow researchers and practitioners in dealing with the problem of invoice payment prediction, providing insights and examples of how to tackle issues present in real data. △ Less

Submitted 11 August, 2020; originally announced August 2020.

Comments: 9 pages, 6 figures, Workshop Machine Learning in Finance. arXiv admin note: substantial text overlap with arXiv:1912.10828

arXiv:1912.10828 [pdf, other]

Optimize Cash Collection: Use Machine learning to Predicting Invoice Payment

Authors: Ana Paula Appel, Victor Oliveira, Bruno Lima, Gabriel Louzada Malfatti, Vagner Figueredo de Santana, Rogerio de Paula

Abstract: Predicting invoice payment is valuable in multiple industries and supports decision-making processes in most financial workflows. However, the challenge in this realm involves dealing with complex data and the lack of data related to decisions-making processes not registered in the accounts receivable system. This work presents a prototype developed as a solution devised during a partnership with… ▽ More Predicting invoice payment is valuable in multiple industries and supports decision-making processes in most financial workflows. However, the challenge in this realm involves dealing with complex data and the lack of data related to decisions-making processes not registered in the accounts receivable system. This work presents a prototype developed as a solution devised during a partnership with a multinational bank to support collectors in predicting invoices payment. The proposed prototype reached up to 77\% of accuracy, which improved the prioritization of customers and supported the daily work of collectors. With the presented results, one expects to support researchers dealing with the problem of invoice payment prediction to get insights and examples of how to tackle issues present in real data. △ Less

Submitted 20 December, 2019; originally announced December 2019.

Comments: 10 pages, 5 figures, 5 tables

arXiv:1808.06044 [pdf, ps, other]

Maximising Throughput in a Complex Coal Export System

Authors: Mateus Rocha de Paula, Natashia Boland, Andreas Ernst, Alexandre Mendes, Martin Savelsbergh

Abstract: The Port of Newcastle features three coal export terminals, operating primarily in cargo assembly mode, that share a rail network on their inbound side, and a channel on their outbound side. Maximising throughput at a single coal terminal, taking into account its layout, its equipment, and its operating policies, is already challenging, but maximising throughput of the Hunter Valley coal export sy… ▽ More The Port of Newcastle features three coal export terminals, operating primarily in cargo assembly mode, that share a rail network on their inbound side, and a channel on their outbound side. Maximising throughput at a single coal terminal, taking into account its layout, its equipment, and its operating policies, is already challenging, but maximising throughput of the Hunter Valley coal export system as a whole requires that terminals and inbound and outbound shared resources be considered simultaneously. Existing approaches to do so either lack realism or are too computationally demanding to be useful as an everyday planning tool. We present a parallel genetic algorithm to optimise the integrated system. The algorithm models activities in continuous time, can handle practical planning horizons efficiently, and generates solutions that match or improve solutions obtained with the state-of-the-art solvers, whilst vastly outperforming them both in memory usage and running time. △ Less

Submitted 22 August, 2018; v1 submitted 18 August, 2018; originally announced August 2018.

Showing 1–8 of 8 results for author: de Paula, R