Search | arXiv e-print repository

arXiv:2404.01399 [pdf, other]

Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models?

Authors: Shaina Raza, Oluwanifemi Bamgbose, Shardul Ghuge, Fatemeh Tavakol, Deepak John Reji, Syed Raza Bashir

Abstract: Large Language Models (LLMs) have significantly advanced various NLP tasks. However, these models often risk generating unsafe text that perpetuates biases. Current approaches to produce unbiased outputs from LLMs can reduce biases but at the expense of knowledge retention. In this research, we address the question of whether producing safe (unbiased) outputs through LLMs can retain knowledge and… ▽ More Large Language Models (LLMs) have significantly advanced various NLP tasks. However, these models often risk generating unsafe text that perpetuates biases. Current approaches to produce unbiased outputs from LLMs can reduce biases but at the expense of knowledge retention. In this research, we address the question of whether producing safe (unbiased) outputs through LLMs can retain knowledge and language understanding. In response, we developed the Safety and Responsible Large Language Model (\textbf{SR}$_{\text{LLM}}$), an LLM that has been instruction fine-tuned on top of already safe LLMs (e.g., Llama2 or related) to diminish biases in generated text. To achieve our goals, we compiled a specialized dataset designed to train our model in identifying and correcting biased text. We conduct experiments, both on this custom data and out-of-distribution test sets, to show the bias reduction and knowledge retention. The results confirm that \textbf{SR}$_{\text{LLM}}$ outperforms traditional fine-tuning and prompting methods in both reducing biases and preserving the integrity of language knowledge. The significance of our findings lies in demonstrating that instruction fine-tuning can provide a more robust solution for bias reduction in LLMs. We have made our code and data available at \href{https://github.com/shainarazavi/Safe-Responsible-LLM}{Safe-LLM}. △ Less

Submitted 1 July, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2401.11305 [pdf, other]

Progress in Privacy Protection: A Review of Privacy Preserving Techniques in Recommender Systems, Edge Computing, and Cloud Computing

Authors: Syed Raza Bashir, Shaina Raza, Vojislav Misic

Abstract: As digital technology evolves, the increasing use of connected devices brings both challenges and opportunities in the areas of mobile crowdsourcing, edge computing, and recommender systems. This survey focuses on these dynamic fields, emphasizing the critical need for privacy protection in our increasingly data-oriented world. It explores the latest trends in these interconnected areas, with a sp… ▽ More As digital technology evolves, the increasing use of connected devices brings both challenges and opportunities in the areas of mobile crowdsourcing, edge computing, and recommender systems. This survey focuses on these dynamic fields, emphasizing the critical need for privacy protection in our increasingly data-oriented world. It explores the latest trends in these interconnected areas, with a special emphasis on privacy and data security. Our method involves an in-depth analysis of various academic works, which helps us to gain a comprehensive understanding of these sectors and their shifting focus towards privacy concerns. We present new insights and marks a significant advancement in addressing privacy issues within these technologies. The survey is a valuable resource for researchers, industry practitioners, and policy makers, offering an extensive overview of these fields and their related privacy challenges, catering to a wide audience in the modern digital era. △ Less

Submitted 20 January, 2024; originally announced January 2024.

arXiv:2308.01681 [pdf, other]

NBIAS: A Natural Language Processing Framework for Bias Identification in Text

Authors: Shaina Raza, Muskan Garg, Deepak John Reji, Syed Raza Bashir, Chen Ding

Abstract: Bias in textual data can lead to skewed interpretations and outcomes when the data is used. These biases could perpetuate stereotypes, discrimination, or other forms of unfair treatment. An algorithm trained on biased data may end up making decisions that disproportionately impact a certain group of people. Therefore, it is crucial to detect and remove these biases to ensure the fair and ethical u… ▽ More Bias in textual data can lead to skewed interpretations and outcomes when the data is used. These biases could perpetuate stereotypes, discrimination, or other forms of unfair treatment. An algorithm trained on biased data may end up making decisions that disproportionately impact a certain group of people. Therefore, it is crucial to detect and remove these biases to ensure the fair and ethical use of data. To this end, we develop a comprehensive and robust framework NBIAS that consists of four main layers: data, corpus construction, model development and an evaluation layer. The dataset is constructed by collecting diverse data from various domains, including social media, healthcare, and job hiring portals. As such, we applied a transformer-based token classification model that is able to identify bias words/ phrases through a unique named entity BIAS. In the evaluation procedure, we incorporate a blend of quantitative and qualitative measures to gauge the effectiveness of our models. We achieve accuracy improvements ranging from 1% to 8% compared to baselines. We are also able to generate a robust understanding of the model functioning. The proposed approach is applicable to a variety of biases and contributes to the fair and ethical use of textual data. △ Less

Submitted 29 August, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

Comments: Under review

arXiv:2305.07041 [pdf, other]

Fairness in Machine Learning meets with Equity in Healthcare

Authors: Shaina Raza, Parisa Osivand Pour, Syed Raza Bashir

Abstract: With the growing utilization of machine learning in healthcare, there is increasing potential to enhance healthcare outcomes. However, this also brings the risk of perpetuating biases in data and model design that can harm certain demographic groups based on factors such as age, gender, and race. This study proposes an artificial intelligence framework, grounded in software engineering principles,… ▽ More With the growing utilization of machine learning in healthcare, there is increasing potential to enhance healthcare outcomes. However, this also brings the risk of perpetuating biases in data and model design that can harm certain demographic groups based on factors such as age, gender, and race. This study proposes an artificial intelligence framework, grounded in software engineering principles, for identifying and mitigating biases in data and models while ensuring fairness in healthcare settings. A case study is presented to demonstrate how systematic biases in data can lead to amplified biases in model predictions, and machine learning methods are suggested to prevent such biases. Future research aims to test and validate the proposed ML framework in real-world clinical settings to evaluate its impact on promoting health equity. △ Less

Submitted 14 August, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

Comments: Accepted in Association for the Advancement of Artificial Intelligence (AAAI) 2023 , Responsible Medical AI, Design, and Operationalization Symposium

arXiv:2303.13314 [pdf]

Leveraging Foundation Models for Clinical Text Analysis

Authors: Shaina Raza, Syed Raza Bashir

Abstract: Infectious diseases are a significant public health concern globally, and extracting relevant information from scientific literature can facilitate the development of effective prevention and treatment strategies. However, the large amount of clinical data available presents a challenge for information extraction. To address this challenge, this study proposes a natural language processing (NLP) f… ▽ More Infectious diseases are a significant public health concern globally, and extracting relevant information from scientific literature can facilitate the development of effective prevention and treatment strategies. However, the large amount of clinical data available presents a challenge for information extraction. To address this challenge, this study proposes a natural language processing (NLP) framework that uses a pre-trained transformer model fine-tuned on task-specific data to extract key information related to infectious diseases from free-text clinical data. The proposed framework includes three components: a data layer for preparing datasets from clinical texts, a foundation model layer for entity extraction, and an assessment layer for performance analysis. The results of the evaluation indicate that the proposed method outperforms standard methods, and leveraging prior knowledge through the pre-trained transformer model makes it useful for investigating other infectious diseases in the future. △ Less

Submitted 20 March, 2023; originally announced March 2023.

arXiv:2303.07024 [pdf]

Addressing Biases in the Texts using an End-to-End Pipeline Approach

Authors: Shaina Raza, Syed Raza Bashir, Sneha, Urooj Qamar

Abstract: The concept of fairness is gaining popularity in academia and industry. Social media is especially vulnerable to media biases and toxic language and comments. We propose a fair ML pipeline that takes a text as input and determines whether it contains biases and toxic content. Then, based on pre-trained word embeddings, it suggests a set of new words by substituting the bi-ased words, the idea is t… ▽ More The concept of fairness is gaining popularity in academia and industry. Social media is especially vulnerable to media biases and toxic language and comments. We propose a fair ML pipeline that takes a text as input and determines whether it contains biases and toxic content. Then, based on pre-trained word embeddings, it suggests a set of new words by substituting the bi-ased words, the idea is to lessen the effects of those biases by replacing them with alternative words. We compare our approach to existing fairness models to determine its effectiveness. The results show that our proposed pipeline can de-tect, identify, and mitigate biases in social media data △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: Accepted in Bias @ ECIR 2023

arXiv:2208.01375 [pdf]

BERT4Loc: BERT for Location -- POI Recommender System

Authors: Syed Raza Bashir, Shaina Raza, Vojislav Misic

Abstract: Recommending points of interest (POIs) is a challenging task that requires extracting comprehensive location data from location-based social media platforms. To provide effective location-based recommendations, it's important to analyze users' historical behavior and preferences. In this study, we present a sophisticated location-aware recommendation system that uses Bidirectional Encoder Represen… ▽ More Recommending points of interest (POIs) is a challenging task that requires extracting comprehensive location data from location-based social media platforms. To provide effective location-based recommendations, it's important to analyze users' historical behavior and preferences. In this study, we present a sophisticated location-aware recommendation system that uses Bidirectional Encoder Representations from Transformers (BERT) to offer personalized location-based suggestions. Our model combines location information and user preferences to provide more relevant recommendations compared to models that predict the next POI in a sequence. Our experiments on two benchmark dataset show that our BERT-based model outperforms various state-of-the-art sequential models. Moreover, we see the effectiveness of the proposed model for quality through additional experiments. △ Less

Submitted 16 May, 2023; v1 submitted 2 August, 2022; originally announced August 2022.

arXiv:2207.03938 [pdf]

An Approach to Ensure Fairness in News Articles

Authors: Shaina Raza, Deepak John Reji, Dora D. Liu, Syed Raza Bashir, Usman Naseem

Abstract: Recommender systems, information retrieval, and other information access systems present unique challenges for examining and applying concepts of fairness and bias mitigation in unstructured text. This paper introduces Dbias, which is a Python package to ensure fairness in news articles. Dbias is a trained Machine Learning (ML) pipeline that can take a text (e.g., a paragraph or news story) and de… ▽ More Recommender systems, information retrieval, and other information access systems present unique challenges for examining and applying concepts of fairness and bias mitigation in unstructured text. This paper introduces Dbias, which is a Python package to ensure fairness in news articles. Dbias is a trained Machine Learning (ML) pipeline that can take a text (e.g., a paragraph or news story) and detects if the text is biased or not. Then, it detects the biased words in the text, masks them, and recommends a set of sentences with new words that are bias-free or at least less biased. We incorporate the elements of data science best practices to ensure that this pipeline is reproducible and usable. We show in experiments that this pipeline can be effective for mitigating biases and outperforms the common neural network architectures in ensuring fairness in the news articles. △ Less

Submitted 8 July, 2022; originally announced July 2022.

Comments: Accepted in KDD 2022 Workshop on Data Science and Artificial Intelligence for Responsible Recommendations (DS4RRS)

arXiv:2202.08751 [pdf]

Improving Rating and Relevance with Point-of-Interest Recommender System

Authors: Syed Raza Bashir, Vojislav Misic

Abstract: The recommendation of points of interest (POIs) is essential in location-based social networks. It makes it easier for users and locations to share information. Recently, researchers tend to recommend POIs by treating them as large-scale retrieval systems that require a large amount of training data representing query-item relevance. However, gathering user feedback in retrieval systems is an expe… ▽ More The recommendation of points of interest (POIs) is essential in location-based social networks. It makes it easier for users and locations to share information. Recently, researchers tend to recommend POIs by treating them as large-scale retrieval systems that require a large amount of training data representing query-item relevance. However, gathering user feedback in retrieval systems is an expensive task. Existing POI recommender systems make recommendations based on user and item (location) interactions solely. However, there are numerous sources of feedback to consider. For example, when the user visits a POI, what is the POI is about and such. Integrating all these different types of feedback is essential when develo** a POI recommender. In this paper, we propose using user and item information and auxiliary information to improve the recommendation modelling in a retrieval system. We develop a deep neural network architecture to model query-item relevance in the presence of both collaborative and content information. We also improve the quality of the learned representations of queries and items by including the contextual information from the user feedback data. The application of these learned representations to a large-scale dataset resulted in significant improvements. △ Less

Submitted 17 February, 2022; originally announced February 2022.

arXiv:2202.02824 [pdf]

A Summary of COVID-19 Datasets

Authors: Syed Raza Bashir, Shaina Raza, Vidhi Thakkar, Usman Naseem

Abstract: This research presents a review of main datasets that are developed for COVID-19 research. We hope this collection will continue to bring together members of the computing community, biomedical experts, and policymakers in the pursuit of effective COVID-19 treatments and management policies. Many organizations, such as the World Health Organization (WHO), John Hopkins, National Institute of Health… ▽ More This research presents a review of main datasets that are developed for COVID-19 research. We hope this collection will continue to bring together members of the computing community, biomedical experts, and policymakers in the pursuit of effective COVID-19 treatments and management policies. Many organizations, such as the World Health Organization (WHO), John Hopkins, National Institute of Health (NIH), COVID-19 open science table4 and such, in the world, have made numerous datasets available to the public. However, these datasets originate from a variety of different sources and initiatives. The purpose of this research is to summarize the open COVID-19 datasets to make them more accessible to the research community for health systems design and analysis. △ Less

Submitted 27 July, 2022; v1 submitted 6 February, 2022; originally announced February 2022.

Comments: Accepted in CAIML 2022: International Conference on Artificial Intelligence and Machine Learning

arXiv:2111.06003 [pdf]

Detecting Fake Points of Interest from Location Data

Authors: Syed Raza Bashir, Vojislav Misic

Abstract: The pervasiveness of GPS-enabled mobile devices and the widespread use of location-based services have resulted in the generation of massive amounts of geo-tagged data. In recent times, the data analysis now has access to more sources, including reviews, news, and images, which also raises questions about the reliability of Point-of-Interest (POI) data sources. While previous research attempted to… ▽ More The pervasiveness of GPS-enabled mobile devices and the widespread use of location-based services have resulted in the generation of massive amounts of geo-tagged data. In recent times, the data analysis now has access to more sources, including reviews, news, and images, which also raises questions about the reliability of Point-of-Interest (POI) data sources. While previous research attempted to detect fake POI data through various security mechanisms, the current work attempts to capture the fake POI data in a much simpler way. The proposed work is focused on supervised learning methods and their capability to find hidden patterns in location-based data. The ground truth labels are obtained through real-world data, and the fake data is generated using an API, so we get a dataset with both the real and fake labels on the location data. The objective is to predict the truth about a POI using the Multi-Layer Perceptron (MLP) method. In the proposed work, MLP based on data classification technique is used to classify location data accurately. The proposed method is compared with traditional classification and robust and recent deep neural methods. The results show that the proposed method is better than the baseline methods. △ Less

Submitted 10 November, 2021; originally announced November 2021.

Comments: Accepted in IEEE

Showing 1–11 of 11 results for author: Bashir, S R