Search | arXiv e-print repository

An Empirical Study of Gendered Stereotypes in Emotional Attributes for Bangla in Multilingual Large Language Models

Authors: Jayanta Sadhu, Maneesha Rani Saha, Rifat Shahriyar

Abstract: The influence of Large Language Models (LLMs) is rapidly growing, automating more jobs over time. Assessing the fairness of LLMs is crucial due to their expanding impact. Studies reveal the reflection of societal norms and biases in LLMs, which creates a risk of propagating societal stereotypes in downstream tasks. Many studies on bias in LLMs focus on gender bias in various NLP applications. Howe… ▽ More The influence of Large Language Models (LLMs) is rapidly growing, automating more jobs over time. Assessing the fairness of LLMs is crucial due to their expanding impact. Studies reveal the reflection of societal norms and biases in LLMs, which creates a risk of propagating societal stereotypes in downstream tasks. Many studies on bias in LLMs focus on gender bias in various NLP applications. However, there's a gap in research on bias in emotional attributes, despite the close societal link between emotion and gender. This gap is even larger for low-resource languages like Bangla. Historically, women are associated with emotions like empathy, fear, and guilt, while men are linked to anger, bravado, and authority. This pattern reflects societal norms in Bangla-speaking regions. We offer the first thorough investigation of gendered emotion attribution in Bangla for both closed and open source LLMs in this work. Our aim is to elucidate the intricate societal relationship between gender and emotion specifically within the context of Bangla. We have been successful in showing the existence of gender bias in the context of emotions in Bangla through analytical methods and also show how emotion attribution changes on the basis of gendered role selection in LLMs. All of our resources including code and data are made publicly available to support future research on Bangla NLP. Warning: This paper contains explicit stereotypical statements that many may find offensive. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: Accepted at the 5th Workshop on Gender Bias in Natural Language Processing at the ACL 2024 Conference

arXiv:2407.03536 [pdf, other]

Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias

Authors: Jayanta Sadhu, Maneesha Rani Saha, Rifat Shahriyar

Abstract: The rapid growth of Large Language Models (LLMs) has put forward the study of biases as a crucial field. It is important to assess the influence of different types of biases embedded in LLMs to ensure fair use in sensitive fields. Although there have been extensive works on bias assessment in English, such efforts are rare and scarce for a major language like Bangla. In this work, we examine two t… ▽ More The rapid growth of Large Language Models (LLMs) has put forward the study of biases as a crucial field. It is important to assess the influence of different types of biases embedded in LLMs to ensure fair use in sensitive fields. Although there have been extensive works on bias assessment in English, such efforts are rare and scarce for a major language like Bangla. In this work, we examine two types of social biases in LLM generated outputs for Bangla language. Our main contributions in this work are: (1) bias studies on two different social biases for Bangla (2) a curated dataset for bias measurement benchmarking (3) two different probing techniques for bias detection in the context of Bangla. This is the first work of such kind involving bias assessment of LLMs for Bangla to the best of our knowledge. All our code and resources are publicly available for the progress of bias related research in Bangla NLP. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2406.17375 [pdf, other]

An Empirical Study on the Characteristics of Bias upon Context Length Variation for Bangla

Authors: Jayanta Sadhu, Ayan Antik Khan, Abhik Bhattacharjee, Rifat Shahriyar

Abstract: Pretrained language models inherently exhibit various social biases, prompting a crucial examination of their social impact across various linguistic contexts due to their widespread usage. Previous studies have provided numerous methods for intrinsic bias measurements, predominantly focused on high-resource languages. In this work, we aim to extend these investigations to Bangla, a low-resource l… ▽ More Pretrained language models inherently exhibit various social biases, prompting a crucial examination of their social impact across various linguistic contexts due to their widespread usage. Previous studies have provided numerous methods for intrinsic bias measurements, predominantly focused on high-resource languages. In this work, we aim to extend these investigations to Bangla, a low-resource language. Specifically, in this study, we (1) create a dataset for intrinsic gender bias measurement in Bangla, (2) discuss necessary adaptations to apply existing bias measurement methods for Bangla, and (3) examine the impact of context length variation on bias measurement, a factor that has been overlooked in previous studies. Through our experiments, we demonstrate a clear dependency of bias metrics on context length, highlighting the need for nuanced considerations in Bangla bias analysis. We consider our work as a step** stone for bias measurement in the Bangla Language and make all of our resources publicly available to support future research. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: Accepted in Findings of ACL, 2024

arXiv:2403.15952 [pdf, other]

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Authors: Haz Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, Rifat Shahriyar

Abstract: The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present Illusi… ▽ More The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best-performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of GeminiPro on the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example. △ Less

Submitted 30 March, 2024; v1 submitted 23 March, 2024; originally announced March 2024.

arXiv:2210.05109 [pdf, other]

BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset

Authors: Ajwad Akil, Najrin Sultana, Abhik Bhattacharjee, Rifat Shahriyar

Abstract: In this work, we present BanglaParaphrase, a high-quality synthetic Bangla Paraphrase dataset curated by a novel filtering pipeline. We aim to take a step towards alleviating the low resource status of the Bangla language in the NLP domain through the introduction of BanglaParaphrase, which ensures quality by preserving both semantics and diversity, making it particularly useful to enhance other B… ▽ More In this work, we present BanglaParaphrase, a high-quality synthetic Bangla Paraphrase dataset curated by a novel filtering pipeline. We aim to take a step towards alleviating the low resource status of the Bangla language in the NLP domain through the introduction of BanglaParaphrase, which ensures quality by preserving both semantics and diversity, making it particularly useful to enhance other Bangla datasets. We show a detailed comparative analysis between our dataset and models trained on it with other existing works to establish the viability of our synthetic paraphrase data generation pipeline. We are making the dataset and models publicly available at https://github.com/csebuetnlp/banglaparaphrase to further the state of Bangla NLP. △ Less

Submitted 10 October, 2022; originally announced October 2022.

Comments: AACL 2022 (camera-ready)

arXiv:2206.11249 [pdf, other]

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

Authors: Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di **, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter , et al. (52 additional authors not shown)

Abstract: Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, an… ▽ More Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark. △ Less

Submitted 24 June, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

arXiv:2205.11081 [pdf, other]

BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla

Authors: Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Rifat Shahriyar

Abstract: This work presents BanglaNLG, a comprehensive benchmark for evaluating natural language generation (NLG) models in Bangla, a widely spoken yet low-resource language. We aggregate six challenging conditional text generation tasks under the BanglaNLG benchmark, introducing a new dataset on dialogue generation in the process. Furthermore, using a clean corpus of 27.5 GB of Bangla data, we pretrain Ba… ▽ More This work presents BanglaNLG, a comprehensive benchmark for evaluating natural language generation (NLG) models in Bangla, a widely spoken yet low-resource language. We aggregate six challenging conditional text generation tasks under the BanglaNLG benchmark, introducing a new dataset on dialogue generation in the process. Furthermore, using a clean corpus of 27.5 GB of Bangla data, we pretrain BanglaT5, a sequence-to-sequence Transformer language model for Bangla. BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming several multilingual models by up to 9% absolute gain and 32% relative gain. We are making the new dialogue dataset and the BanglaT5 model publicly available at https://github.com/csebuetnlp/BanglaNLG in the hope of advancing future research on Bangla NLG. △ Less

Submitted 11 February, 2023; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: Findings of EACL 2023 (camera-ready)

arXiv:2112.13760 [pdf, other]

A Crowd-enabled Solution for Privacy-Preserving and Personalized Safe Route Planning for Fixed or Flexible Destinations (Full Version)

Authors: Fariha Tabassum Islam, Tanzima Hashem, Rifat Shahriyar

Abstract: Ensuring travelers' safety on roads has become a research challenge in recent years. We introduce a novel safe route planning problem and develop an efficient solution to ensure the travelers' safety on roads. Though few research attempts have been made in this regard, all of them assume that people share their sensitive travel experiences with a centralized entity for finding the safest routes, w… ▽ More Ensuring travelers' safety on roads has become a research challenge in recent years. We introduce a novel safe route planning problem and develop an efficient solution to ensure the travelers' safety on roads. Though few research attempts have been made in this regard, all of them assume that people share their sensitive travel experiences with a centralized entity for finding the safest routes, which is not ideal in practice for privacy reasons. Furthermore, existing works formulate safe route planning in ways that do not meet a traveler's need for safe travel on roads. Our approach finds the safest routes within a user-specified distance threshold based on the personalized travel experience of the knowledgeable crowd without involving any centralized computation. We develop a privacy-preserving model to quantify the travel experience of a user into personalized safety scores. Our algorithms for finding the safest route further enhance user privacy by minimizing the exposure of personalized safety scores with others. Our safe route planner can find the safest routes for individuals and groups by considering both a fixed and a set of flexible destination locations. Extensive experiments using real datasets show that our approach finds the safest route in seconds. Compared to the direct algorithm, our iterative algorithm requires 47% less exposure of personalized safety scores. △ Less

Submitted 9 September, 2022; v1 submitted 27 December, 2021; originally announced December 2021.

arXiv:2112.08804 [pdf, other]

CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs

Authors: Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Yuan-Fang Li, Yong-Bin Kang, Rifat Shahriyar

Abstract: We present CrossSum, a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs. We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset and perform a controlled human evaluation to validate its quality. We propose a multistage da… ▽ More We present CrossSum, a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs. We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset and perform a controlled human evaluation to validate its quality. We propose a multistage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language. We also introduce LaSE, an embedding-based metric for automatically evaluating model-generated summaries. LaSE is strongly correlated with ROUGE and, unlike ROUGE, can be reliably measured even in the absence of references in the target language. Performance on ROUGE and LaSE indicate that our proposed model consistently outperforms baseline models. To the best of our knowledge, CrossSum is the largest cross-lingual summarization dataset and the first ever that is not centered around English. We are releasing the dataset, training and evaluation scripts, and models to spur future research on cross-lingual summarization. The resources can be found at https://github.com/csebuetnlp/CrossSum △ Less

Submitted 25 May, 2023; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: ACL 2023 (camera-ready)

arXiv:2107.09587 [pdf, other]

A Survey-Based Qualitative Study to Characterize Expectations of Software Developers from Five Stakeholders

Authors: Khalid Hasan, Partho Chakraborty, Rifat Shahriyar, Anindya Iqbal, Gias Uddin

Abstract: Background: Studies on developer productivity and well-being find that the perceptions of productivity in a software team can be a socio-technical problem. Intuitively, problems and challenges can be better handled by managing expectations in software teams. Aim: Our goal is to understand whether the expectations of software developers vary towards diverse stakeholders in software teams. Method: W… ▽ More Background: Studies on developer productivity and well-being find that the perceptions of productivity in a software team can be a socio-technical problem. Intuitively, problems and challenges can be better handled by managing expectations in software teams. Aim: Our goal is to understand whether the expectations of software developers vary towards diverse stakeholders in software teams. Method: We surveyed 181 professional software developers to understand their expectations from five different stakeholders: (1) organizations, (2) managers, (3) peers, (4) new hires, and (5) government and educational institutions. The five stakeholders are determined by conducting semi-formal interviews of software developers. We ask open-ended survey questions and analyze the responses using open coding. Results: We observed 18 multi-faceted expectations types. While some expectations are more specific to a stakeholder, other expectations are cross-cutting. For example, developers expect work-benefits from their organizations, but expect the adoption of standard software engineering (SE) practices from their organizations, peers, and new hires. Conclusion: Out of the 18 categories, three categories are related to career growth. This observation supports previous research that happiness cannot be assured by simply offering more money or a promotion. Among the most number of responses, we find expectations from educational institutions to offer relevant teaching and from governments to improve job stability, which indicate the increasingly important roles of these organizations to help software developers. This observation can be especially true during the COVID-19 pandemic. △ Less

Submitted 20 July, 2021; originally announced July 2021.

Comments: 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2021 (camera-ready)

arXiv:2106.13822 [pdf, other]

XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

Authors: Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Islam, Kazi Samin, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, Rifat Shahriyar

Abstract: Contemporary works on abstractive text summarization have focused primarily on high-resource languages like English, mostly due to the limited availability of datasets for low/mid-resource ones. In this work, we present XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. Th… ▽ More Contemporary works on abstractive text summarization have focused primarily on high-resource languages like English, mostly due to the limited availability of datasets for low/mid-resource ones. In this work, we present XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation. We fine-tune mT5, a state-of-the-art pretrained multilingual model, with XL-Sum and experiment on multilingual and low-resource summarization tasks. XL-Sum induces competitive results compared to the ones obtained using similar monolingual datasets: we show higher than 11 ROUGE-2 scores on 10 languages we benchmark on, with some of them exceeding 15, as obtained by multilingual training. Additionally, training on low-resource languages individually also provides competitive performance. To the best of our knowledge, XL-Sum is the largest abstractive summarization dataset in terms of the number of samples collected from a single source and the number of languages covered. We are releasing our dataset and models to encourage future research on multilingual abstractive summarization. The resources can be found at \url{https://github.com/csebuetnlp/xl-sum}. △ Less

Submitted 25 June, 2021; originally announced June 2021.

Comments: Findings of the Association for Computational Linguistics, ACL 2021 (camera-ready)

arXiv:2105.14220 [pdf, other]

CoDesc: A Large Code-Description Parallel Dataset

Authors: Masum Hasan, Tanveer Muttaqueen, Abdullah Al Ishtiaq, Kazi Sajeed Mehrab, Md. Mahim Anjum Haque, Tahmid Hasan, Wasi Uddin Ahmad, Anindya Iqbal, Rifat Shahriyar

Abstract: Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal method… ▽ More Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal methods, and evaluation benchmarks. This leaves researchers to collect new small-scale datasets, resulting in inconsistencies across published works. In this study, we present CoDesc -- a large parallel dataset composed of 4.2 million Java methods and natural language descriptions. With extensive analysis, we identify and remove prevailing noise patterns from the dataset. We demonstrate the proficiency of CoDesc in two complementary tasks for code-description pairs: code summarization and code search. We show that the dataset helps improve code search by up to 22\% and achieves the new state-of-the-art in code summarization. Furthermore, we show CoDesc's effectiveness in pre-training--fine-tuning setup, opening possibilities in building pretrained language models for Java. To facilitate future research, we release the dataset, a data processing tool, and a benchmark at \url{https://github.com/csebuetnlp/CoDesc}. △ Less

Submitted 29 May, 2021; originally announced May 2021.

Comments: Findings of the Association for Computational Linguistics, ACL 2021 (camera-ready)

arXiv:2105.01709 [pdf, other]

doi 10.1016/j.infsof.2021.106603

How do developers discuss and support new programming languages in technical Q&A site? An empirical study of Go, Swift, and Rust in Stack Overflow

Authors: Partha Chakraborty, Rifat Shahriyar, Anindya Iqbal, Gias Uddin

Abstract: New programming languages (e.g., Swift, Go, Rust, etc.) are being introduced to provide a better opportunity for the developers to make software development robust and easy. At the early stage, a programming language is likely to have resource constraints that encourage the developers to seek help frequently from experienced peers active in QA sites such as Stack Overflow (SO). In this study, we h… ▽ More New programming languages (e.g., Swift, Go, Rust, etc.) are being introduced to provide a better opportunity for the developers to make software development robust and easy. At the early stage, a programming language is likely to have resource constraints that encourage the developers to seek help frequently from experienced peers active in QA sites such as Stack Overflow (SO). In this study, we have formally studied the discussions on three popular new languages introduced after the inception of SO (2008) and match those with the relevant activities in GitHub whenever appropriate. For that purpose, we have mined 4,17,82,536 questions and answers from SO and 7,846 issue information along with 6,60,965 repository information from GitHub. Initially, the development of new languages is relatively slow compared to mature languages (e.g., C, C++, Java). The expected outcome of this study is to reveal the difficulties and challenges faced by the developers working with these languages so that appropriate measures can be taken to expedite the generation of relevant resources. We have used the LDA method on SO's questions and answers to identify different topics of new languages. We have extracted several features of the answer pattern of the new languages from SO to study their characteristics. These attributes were used to identify difficult topics. We explored the background of developers who are contributing to these languages. We have created a model by combining Stack Overflow data and issues, repository, user data of GitHub. Finally, we have used that model to identify factors that affect language evolution. We believe that the outcome of our study is likely to help the owner/sponsor of these languages to design better features and documentation. It will also help the software developers or students to prepare themselves to work on these languages in an informed way. △ Less

Submitted 4 May, 2021; originally announced May 2021.

Comments: Information and Software Technology

Journal ref: Information and Software Technology 137 (2021) 106603

arXiv:2104.08301 [pdf, other]

Text2App: A Framework for Creating Android Apps from Text Descriptions

Authors: Masum Hasan, Kazi Sajeed Mehrab, Wasi Uddin Ahmad, Rifat Shahriyar

Abstract: We present Text2App -- a framework that allows users to create functional Android applications from natural language specifications. The conventional method of source code generation tries to generate source code directly, which is impractical for creating complex software. We overcome this limitation by transforming natural language into an abstract intermediate formal language representing an ap… ▽ More We present Text2App -- a framework that allows users to create functional Android applications from natural language specifications. The conventional method of source code generation tries to generate source code directly, which is impractical for creating complex software. We overcome this limitation by transforming natural language into an abstract intermediate formal language representing an application with a substantially smaller number of tokens. The intermediate formal representation is then compiled into target source codes. This abstraction of programming details allows seq2seq networks to learn complex application structures with less overhead. In order to train sequence models, we introduce a data synthesis method grounded in a human survey. We demonstrate that Text2App generalizes well to unseen combination of app components and it is capable of handling noisy natural language instructions. We explore the possibility of creating applications from highly abstract instructions by coupling our system with GPT-3 -- a large pretrained language model. We perform an extensive human evaluation and identify the capabilities and limitations of our system. The source code, a ready-to-run demo notebook, and a demo video are publicly available at \url{https://github.com/text2app/Text2App}. △ Less

Submitted 7 July, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: Submitted to EMNLP 2021 System Demonstrations

arXiv:2104.08017 [pdf, other]

BERT2Code: Can Pretrained Language Models be Leveraged for Code Search?

Authors: Abdullah Al Ishtiaq, Masum Hasan, Md. Mahim Anjum Haque, Kazi Sajeed Mehrab, Tanveer Muttaqueen, Tahmid Hasan, Anindya Iqbal, Rifat Shahriyar

Abstract: Millions of repetitive code snippets are submitted to code repositories every day. To search from these large codebases using simple natural language queries would allow programmers to ideate, prototype, and develop easier and faster. Although the existing methods have shown good performance in searching codes when the natural language description contains keywords from the code, they are still fa… ▽ More Millions of repetitive code snippets are submitted to code repositories every day. To search from these large codebases using simple natural language queries would allow programmers to ideate, prototype, and develop easier and faster. Although the existing methods have shown good performance in searching codes when the natural language description contains keywords from the code, they are still far behind in searching codes based on the semantic meaning of the natural language query and semantic structure of the code. In recent years, both natural language and programming language research communities have created techniques to embed them in vector spaces. In this work, we leverage the efficacy of these embedding models using a simple, lightweight 2-layer neural network in the task of semantic code search. We show that our model learns the inherent relationship between the embedding spaces and further probes into the scope of improvement by empirically analyzing the embedding methods. In this analysis, we show that the quality of the code embedding model is the bottleneck for our model's performance, and discuss future directions of study in this area. △ Less

Submitted 16 April, 2021; originally announced April 2021.

Comments: Submitted to ICANN2021

arXiv:2101.00204 [pdf, other]

BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

Authors: Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Kazi Samin, Md Saiful Islam, Anindya Iqbal, M. Sohel Rahman, Rifat Shahriyar

Abstract: In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answ… ▽ More In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at https://github.com/csebuetnlp/banglabert to advance Bangla NLP. △ Less

Submitted 10 May, 2022; v1 submitted 1 January, 2021; originally announced January 2021.

Comments: Findings of North American Chapter of the Association for Computational Linguistics, NAACL 2022 (camera-ready)

arXiv:2009.09359 [pdf, other]

Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation

Authors: Tahmid Hasan, Abhik Bhattacharjee, Kazi Samin, Masum Hasan, Madhusudan Basak, M. Sohel Rahman, Rifat Shahriyar

Abstract: Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources. Most publicly available parallel corpora for Bengali are not large enough; and have rather poor quality, mostly because of incorrect sentence alignments resulting from erroneous sentence segmentation, and also because of a hig… ▽ More Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources. Most publicly available parallel corpora for Bengali are not large enough; and have rather poor quality, mostly because of incorrect sentence alignments resulting from erroneous sentence segmentation, and also because of a high volume of noise present in them. In this work, we build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups: aligner ensembling and batch filtering. With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs, more than 2 million of which were not available before. Training on neural models, we achieve an improvement of more than 9 BLEU score over previous approaches to Bengali-English machine translation. We also evaluate on a new test set of 1000 pairs made with extensive quality control. We release the segmenter, parallel corpus, and the evaluation set, thus elevating Bengali from its low-resource status. To the best of our knowledge, this is the first ever large scale study on Bengali-English machine translation. We believe our study will pave the way for future research on Bengali-English machine translation as well as other low-resource languages. Our data and code are available at https://github.com/csebuetnlp/banglanmt. △ Less

Submitted 7 October, 2020; v1 submitted 20 September, 2020; originally announced September 2020.

Comments: EMNLP 2020

arXiv:1912.03437 [pdf, ps, other]

doi 10.1016/j.infsof.2021.106756

Early Prediction for Merged vs Abandoned Code Changes in Modern Code Reviews

Authors: Md. Khairul Islam, Toufique Ahmed, Rifat Shahriyar, Anindya Iqbal, Gias Uddin

Abstract: The modern code review process is an integral part of the current software development practice. Considerable effort is given here to inspect code changes, find defects, suggest an improvement, and address the suggestions of the reviewers. In a code review process, usually, several iterations take place where an author submits code changes and a reviewer gives feedback until is happy to accept the… ▽ More The modern code review process is an integral part of the current software development practice. Considerable effort is given here to inspect code changes, find defects, suggest an improvement, and address the suggestions of the reviewers. In a code review process, usually, several iterations take place where an author submits code changes and a reviewer gives feedback until is happy to accept the change. In around 12% cases, the changes are abandoned, eventually wasting all the efforts. In this research, our objective is to design a tool that can predict whether a code change would be merged or abandoned at an early stage to reduce the waste of efforts of all stakeholders (e.g., program author, reviewer, project management, etc.) involved. The real-world demand for such a tool was formally identified by a study by Fan et al. [1]. We have mined 146,612 code changes from the code reviews of three large and popular open-source software and trained and tested a suite of supervised machine learning classifiers, both shallow and deep learning based. We consider a total of 25 features in each code change during the training and testing of the models. The best performing model named PredCR (Predicting Code Review), a LightGBM-based classifier achieves around 85% AUC score on average and relatively improves the state-of-the-art [1] by 14-23%. In our empirical study on the 146,612 code changes from the three software projects, we find that (1) The new features like reviewer dimensions that are introduced in PredCR are the most informative. (2) Compared to the baseline, PredCR is more effective towards reducing bias against new developers. (3) PredCR uses historical data in the code review repository and as such the performance of PredCR improves as a software system evolves with new and more data. △ Less

Submitted 30 August, 2021; v1 submitted 6 December, 2019; originally announced December 2019.

arXiv:1811.04169 [pdf, other]

Understanding the Motivations, Challenges and Needs of Blockchain Software Developers: A Survey

Authors: Amiangshu Bosu, Anindya Iqbal, Rifat Shahriyar, Partha Chakroborty

Abstract: The blockchain technology has potential applications in various areas such as smart-contracts, Internet of Things (IoT), land registry, supply chain management, storing medical data, and identity management. Although the Github currently hosts more than six thousand active Blockchain software (BCS) projects, few software engineering research has investigated these projects and its' contributors. A… ▽ More The blockchain technology has potential applications in various areas such as smart-contracts, Internet of Things (IoT), land registry, supply chain management, storing medical data, and identity management. Although the Github currently hosts more than six thousand active Blockchain software (BCS) projects, few software engineering research has investigated these projects and its' contributors. Although the number of BCS projects is growing rapidly, the motivations, challenges, and needs of BCS developers remain a puzzle. Therefore, the primary objective of this study is to understand the motivations, challenges, and needs of BCS developers and analyze the differences between BCS and non-BCS development. On this goal, we sent an online survey to 1,604 active BCS developers identified via mining the Github repositories of 145 popular BCS projects. The survey received 156 responses that met our criteria for analysis. The results suggest that the majority of the BCS developers are experienced in non-BCS development and are primarily motivated by the ideology of creating a decentralized financial system. Although most of the BCS projects are Open Source Software (OSS) projects by nature, more than 93% of our respondents found BCS development somewhat different from a non-BCS development as BCS projects have higher emphasis on security and reliability than most of the non-BCS projects. Other differences include: higher costs of defects, decentralized and hostile environment, technological complexity, and difficulty in upgrading the software after release. Software development tools that are tuned for non-BCS development are inadequate for BCS and the ecosystem needs an array of new or improved tools, such as: customized IDE for BCS development tasks, debuggers for smart-contracts, testing support, easily deployable simulators, and BCS domain specific design notations. △ Less

Submitted 19 March, 2019; v1 submitted 9 November, 2018; originally announced November 2018.

Journal ref: Empirical Software Engineering, 2019

Showing 1–19 of 19 results for author: Shahriyar, R