Search | arXiv e-print repository

BD-SAT: High-resolution Land Use Land Cover Dataset & Benchmark Results for Develo** Division: Dhaka, BD

Authors: Ovi Paul, Abu Bakar Siddik Nayem, Anis Sarker, Amin Ahsan Ali, M Ashraful Amin, AKM Mahbubur Rahman

Abstract: Land Use Land Cover (LULC) analysis on satellite images using deep learning-based methods is significantly helpful in understanding the geography, socio-economic conditions, poverty levels, and urban sprawl in develo** countries. Recent works involve segmentation with LULC classes such as farmland, built-up areas, forests, meadows, water bodies, etc. Training deep learning methods on satellite i… ▽ More Land Use Land Cover (LULC) analysis on satellite images using deep learning-based methods is significantly helpful in understanding the geography, socio-economic conditions, poverty levels, and urban sprawl in develo** countries. Recent works involve segmentation with LULC classes such as farmland, built-up areas, forests, meadows, water bodies, etc. Training deep learning methods on satellite images requires large sets of images annotated with LULC classes. However, annotated data for develo** countries are scarce due to a lack of funding, absence of dedicated residential/industrial/economic zones, a large population, and diverse building materials. BD-SAT provides a high-resolution dataset that includes pixel-by-pixel LULC annotations for Dhaka metropolitan city and surrounding rural/urban areas. Using a strict and standardized procedure, the ground truth is created using Bing satellite imagery with a ground spatial distance of 2.22 meters per pixel. A three-stage, well-defined annotation process has been followed with support from GIS experts to ensure the reliability of the annotations. We performed several experiments to establish benchmark results. The results show that the annotated BD-SAT is sufficient to train large deep learning models with adequate accuracy for five major LULC classes: forest, farmland, built-up areas, water bodies, and meadows. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: 26 pages, 15 figures and 12 tables

arXiv:2405.19519 [pdf, other]

Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data

Authors: Sudeshna Das, Yao Ge, Yuting Guo, Swati Rajwal, JaMor Hairston, Jeanne Powell, Drew Walker, Snigdha Peddireddy, Sahithi Lakamana, Selen Bozkurt, Matthew Reyna, Reza Sameni, Yunyu Xiao, Sangmi Kim, Rasheeta Chandler, Natalie Hernandez, Danielle Mowery, Rachel Wightman, Jennifer Love, Anthony Spadaro, Jeanmarie Perrone, Abeed Sarker

Abstract: Retrieval augmented generation (RAG) provides the capability to constrain generative model outputs, and mitigate the possibility of hallucination, by providing relevant in-context text. The number of tokens a generative large language model (LLM) can incorporate as context is finite, thus limiting the volume of knowledge from which to generate an answer. We propose a two-layer RAG framework for qu… ▽ More Retrieval augmented generation (RAG) provides the capability to constrain generative model outputs, and mitigate the possibility of hallucination, by providing relevant in-context text. The number of tokens a generative large language model (LLM) can incorporate as context is finite, thus limiting the volume of knowledge from which to generate an answer. We propose a two-layer RAG framework for query-focused answer generation and evaluate a proof-of-concept for this framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. The evaluations demonstrate the effectiveness of the two-layer framework in resource constrained settings to enable researchers in obtaining near real-time data from users. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.18015 [pdf, other]

MultiADE: A Multi-domain Benchmark for Adverse Drug Event Extraction

Authors: Xiang Dai, Sarvnaz Karimi, Abeed Sarker, Ben Hachey, Cecile Paris

Abstract: Objective. Active adverse event surveillance monitors Adverse Drug Events (ADE) from different data sources, such as electronic health records, medical literature, social media and search engine logs. Over years, many datasets are created, and shared tasks are organised to facilitate active adverse event surveillance. However, most-if not all-datasets or shared tasks focus on extracting ADEs from… ▽ More Objective. Active adverse event surveillance monitors Adverse Drug Events (ADE) from different data sources, such as electronic health records, medical literature, social media and search engine logs. Over years, many datasets are created, and shared tasks are organised to facilitate active adverse event surveillance. However, most-if not all-datasets or shared tasks focus on extracting ADEs from a particular type of text. Domain generalisation-the ability of a machine learning model to perform well on new, unseen domains (text types)-is under-explored. Given the rapid advancements in natural language processing, one unanswered question is how far we are from having a single ADE extraction model that are effective on various types of text, such as scientific literature and social media posts}. Methods. We contribute to answering this question by building a multi-domain benchmark for adverse drug event extraction, which we named MultiADE. The new benchmark comprises several existing datasets sampled from different text types and our newly created dataset-CADECv2, which is an extension of CADEC (Karimi, et al., 2015), covering online posts regarding more diverse drugs than CADEC. Our new dataset is carefully annotated by human annotators following detailed annotation guidelines. Conclusion. Our benchmark results show that the generalisation of the trained models is far from perfect, making it infeasible to be deployed to process different types of text. In addition, although intermediate transfer learning is a promising approach to utilising existing resources, further investigation is needed on methods of domain adaptation, particularly cost-effective methods to select useful training instances. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Under review; feedback welcome

arXiv:2405.06145 [pdf, other]

Reddit-Impacts: A Named Entity Recognition Dataset for Analyzing Clinical and Social Effects of Substance Use Derived from Social Media

Authors: Yao Ge, Sudeshna Das, Karen O'Connor, Mohammed Ali Al-Garadi, Graciela Gonzalez-Hernandez, Abeed Sarker

Abstract: Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entit… ▽ More Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entity Recognition (NER) dataset curated from subreddits dedicated to discussions on prescription and illicit opioids, as well as medications for opioid use disorder. The dataset specifically concentrates on the lesser-studied, yet critically important, aspects of substance use--its clinical and social impacts. We collected data from chosen subreddits using the publicly available Application Programming Interface for Reddit. We manually annotated text spans representing clinical and social impacts reported by people who also reported personal nonmedical use of substances including but not limited to opioids, stimulants and benzodiazepines. Our objective is to create a resource that can enable the development of systems that can automatically detect clinical and social impacts of substance use from text-based social media data. The successful development of such systems may enable us to better understand how nonmedical use of substances affects individual health and societal dynamics, aiding the development of effective public health strategies. In addition to creating the annotated data set, we applied several machine learning models to establish baseline performances. Specifically, we experimented with transformer models like BERT, and RoBERTa, one few-shot learning model DANN by leveraging the full training dataset, and GPT-3.5 by using one-shot learning, for automatic NER of clinical and social impacts. The dataset has been made available through the 2024 SMM4H shared tasks. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 7 pages, 1 figure, 4 tables

arXiv:2405.05204 [pdf]

CARE-SD: Classifier-based analysis for recognizing and eliminating stigmatizing and doubt marker labels in electronic health records: model development and validation

Authors: Drew Walker, Annie Thorne, Sudeshna Das, Jennifer Love, Hannah LF Cooper, Melvin Livingston III, Abeed Sarker

Abstract: Objective: To detect and classify features of stigmatizing and biased language in intensive care electronic health records (EHRs) using natural language processing techniques. Materials and Methods: We first created a lexicon and regular expression lists from literature-driven stem words for linguistic features of stigmatizing patient labels, doubt markers, and scare quotes within EHRs. The lexico… ▽ More Objective: To detect and classify features of stigmatizing and biased language in intensive care electronic health records (EHRs) using natural language processing techniques. Materials and Methods: We first created a lexicon and regular expression lists from literature-driven stem words for linguistic features of stigmatizing patient labels, doubt markers, and scare quotes within EHRs. The lexicon was further extended using Word2Vec and GPT 3.5, and refined through human evaluation. These lexicons were used to search for matches across 18 million sentences from the de-identified Medical Information Mart for Intensive Care-III (MIMIC-III) dataset. For each linguistic bias feature, 1000 sentence matches were sampled, labeled by expert clinical and public health annotators, and used to supervised learning classifiers. Results: Lexicon development from expanded literature stem-word lists resulted in a doubt marker lexicon containing 58 expressions, and a stigmatizing labels lexicon containing 127 expressions. Classifiers for doubt markers and stigmatizing labels had the highest performance, with macro F1-scores of .84 and .79, positive-label recall and precision values ranging from .71 to .86, and accuracies aligning closely with human annotator agreement (.87). Discussion: This study demonstrated the feasibility of supervised classifiers in automatically identifying stigmatizing labels and doubt markers in medical text, and identified trends in stigmatizing language use in an EHR setting. Additional labeled data may help improve lower scare quote model performance. Conclusions: Classifiers developed in this study showed high model performance and can be applied to identify patterns and target interventions to reduce stigmatizing labels and doubt markers in healthcare systems. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: 28 pages, 3 figures, 4 tables. 5 Appendices

arXiv:2403.19031 [pdf]

Evaluating Large Language Models for Health-Related Text Classification Tasks with Public Social Media Data

Authors: Yuting Guo, Anthony Ovadje, Mohammed Ali Al-Garadi, Abeed Sarker

Abstract: Large language models (LLMs) have demonstrated remarkable success in NLP tasks. However, there is a paucity of studies that attempt to evaluate their performances on social media-based health-related natural language processing tasks, which have traditionally been difficult to achieve high scores in. We benchmarked one supervised classic machine learning model based on Support Vector Machines (SVM… ▽ More Large language models (LLMs) have demonstrated remarkable success in NLP tasks. However, there is a paucity of studies that attempt to evaluate their performances on social media-based health-related natural language processing tasks, which have traditionally been difficult to achieve high scores in. We benchmarked one supervised classic machine learning model based on Support Vector Machines (SVMs), three supervised pretrained language models (PLMs) based on RoBERTa, BERTweet, and SocBERT, and two LLM based classifiers (GPT3.5 and GPT4), across 6 text classification tasks. We developed three approaches for leveraging LLMs for text classification: employing LLMs as zero-shot classifiers, us-ing LLMs as annotators to annotate training data for supervised classifiers, and utilizing LLMs with few-shot examples for augmentation of manually annotated data. Our comprehensive experiments demonstrate that employ-ing data augmentation using LLMs (GPT-4) with relatively small human-annotated data to train lightweight supervised classification models achieves superior results compared to training with human-annotated data alone. Supervised learners also outperform GPT-4 and GPT-3.5 in zero-shot settings. By leveraging this data augmentation strategy, we can harness the power of LLMs to develop smaller, more effective domain-specific NLP models. LLM-annotated data without human guidance for training light-weight supervised classification models is an ineffective strategy. However, LLM, as a zero-shot classifier, shows promise in excluding false negatives and potentially reducing the human effort required for data annotation. Future investigations are imperative to explore optimal training data sizes and the optimal amounts of augmented data. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.15721 [pdf, other]

Design and Implementation of an Analysis Pipeline for Heterogeneous Data

Authors: Arup Kumar Sarker, Aymen Alsaadi, Niranda Perera, Mills Staylor, Gregor von Laszewski, Matteo Turilli, Ozgur Ozan Kilic, Mikhail Titov, Andre Merzky, Shantenu Jha, Geoffrey Fox

Abstract: Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. In… ▽ More Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms is crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon's design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4~15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community. △ Less

Submitted 7 April, 2024; v1 submitted 23 March, 2024; originally announced March 2024.

Comments: 14 pages, 16 figures, 2 tables

ACM Class: H.2.4; D.2.7; D.2.2

arXiv:2403.00821 [pdf, other]

Social Media as a Sensor: Analyzing Twitter Data for Breast Cancer Medication Effects Using Natural Language Processing

Authors: Seibi Kobara, Alireza Rafiei, Masoud Nateghi, Selen Bozkurt, Rishikesan Kamaleswaran, Abeed Sarker

Abstract: Breast cancer is a significant public health concern and is the leading cause of cancer-related deaths among women. Despite advances in breast cancer treatments, medication non-adherence remains a major problem. As electronic health records do not typically capture patient-reported outcomes that may reveal information about medication-related experiences, social media presents an attractive resour… ▽ More Breast cancer is a significant public health concern and is the leading cause of cancer-related deaths among women. Despite advances in breast cancer treatments, medication non-adherence remains a major problem. As electronic health records do not typically capture patient-reported outcomes that may reveal information about medication-related experiences, social media presents an attractive resource for enhancing our understanding of the patients' treatment experiences. In this paper, we developed natural language processing (NLP) based methodologies to study information posted by an automatically curated breast cancer cohort from social media. We employed a transformer-based classifier to identify breast cancer patients/survivors on X (Twitter) based on their self-reported information, and we collected longitudinal data from their profiles. We then designed a multi-layer rule-based model to develop a breast cancer therapy-associated side effect lexicon and detect patterns of medication usage and associated side effects among breast cancer patients. 1,454,637 posts were available from 583,962 unique users, of which 62,042 were detected as breast cancer members using our transformer-based model. 198 cohort members mentioned breast cancer medications with tamoxifen as the most common. Our side effect lexicon identified well-known side effects of hormone and chemotherapy. Furthermore, it discovered a subject feeling towards cancer and medications, which may suggest a pre-clinical phase of side effects or emotional distress. This analysis highlighted not only the utility of NLP techniques in unstructured social media data to identify self-reported breast cancer posts, medication usage patterns, and treatment side effects but also the richness of social data on such clinical questions. △ Less

Submitted 26 February, 2024; originally announced March 2024.

arXiv:2402.01826 [pdf, other]

Leveraging Large Language Models for Analyzing Blood Pressure Variations Across Biological Sex from Scientific Literature

Authors: Yuting Guo, Seyedeh Somayyeh Mousavi, Reza Sameni, Abeed Sarker

Abstract: Hypertension, defined as blood pressure (BP) that is above normal, holds paramount significance in the realm of public health, as it serves as a critical precursor to various cardiovascular diseases (CVDs) and significantly contributes to elevated mortality rates worldwide. However, many existing BP measurement technologies and standards might be biased because they do not consider clinical outcom… ▽ More Hypertension, defined as blood pressure (BP) that is above normal, holds paramount significance in the realm of public health, as it serves as a critical precursor to various cardiovascular diseases (CVDs) and significantly contributes to elevated mortality rates worldwide. However, many existing BP measurement technologies and standards might be biased because they do not consider clinical outcomes, comorbidities, or demographic factors, making them inconclusive for diagnostic purposes. There is limited data-driven research focused on studying the variance in BP measurements across these variables. In this work, we employed GPT-35-turbo, a large language model (LLM), to automatically extract the mean and standard deviation values of BP for both males and females from a dataset comprising 25 million abstracts sourced from PubMed. 993 article abstracts met our predefined inclusion criteria (i.e., presence of references to blood pressure, units of blood pressure such as mmHg, and mention of biological sex). Based on the automatically-extracted information from these articles, we conducted an analysis of the variations of BP values across biological sex. Our results showed the viability of utilizing LLMs to study the BP variations across different demographic factors. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2402.01598 [pdf, other]

Learning from Two Decades of Blood Pressure Data: Demography-Specific Patterns Across 75 Million Patient Encounters

Authors: Seyedeh Somayyeh Mousavi, Yuting Guo, Abeed Sarker, Reza Sameni

Abstract: Hypertension is a global health concern with an increasing prevalence, underscoring the need for effective monitoring and analysis of blood pressure (BP) dynamics. We analyzed a substantial BP dataset comprising 75,636,128 records from 2,054,462 unique patients collected between 2000 and 2022 at Emory Healthcare in Georgia, USA, representing a demographically diverse population. We examined and co… ▽ More Hypertension is a global health concern with an increasing prevalence, underscoring the need for effective monitoring and analysis of blood pressure (BP) dynamics. We analyzed a substantial BP dataset comprising 75,636,128 records from 2,054,462 unique patients collected between 2000 and 2022 at Emory Healthcare in Georgia, USA, representing a demographically diverse population. We examined and compared population-wide statistics of bivariate changes in systolic BP (SBP) and diastolic BP (DBP) across sex, age, and race/ethnicity. The analysis revealed that males have higher BP levels than females and exhibit a distinct BP profile with age. Notably, average SBP consistently rises with age, whereas average DBP peaks in the forties age group. Among the ethnic groups studied, Blacks have marginally higher BPs and a greater standard deviation. We also discovered a significant correlation between SBP and DBP at the population level, a phenomenon not previously researched. These results emphasize the importance of demography-specific BP analysis for clinical diagnosis and provide valuable insights for develo** personalized, demography-specific healthcare interventions. △ Less

Submitted 23 April, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

arXiv:2308.10783 [pdf, other]

Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis

Authors: Md. Arid Hasan, Shudipta Das, Afiyat Anjum, Firoj Alam, Anika Anjum, Avijit Sarker, Sheak Rashed Haider Noori

Abstract: The rapid expansion of the digital world has propelled sentiment analysis into a critical tool across diverse sectors such as marketing, politics, customer service, and healthcare. While there have been significant advancements in sentiment analysis for widely spoken languages, low-resource languages, such as Bangla, remain largely under-researched due to resource constraints. Furthermore, the rec… ▽ More The rapid expansion of the digital world has propelled sentiment analysis into a critical tool across diverse sectors such as marketing, politics, customer service, and healthcare. While there have been significant advancements in sentiment analysis for widely spoken languages, low-resource languages, such as Bangla, remain largely under-researched due to resource constraints. Furthermore, the recent unprecedented performance of Large Language Models (LLMs) in various applications highlights the need to evaluate them in the context of low-resource languages. In this study, we present a sizeable manually annotated dataset encompassing 33,606 Bangla news tweets and Facebook comments. We also investigate zero- and few-shot in-context learning with several language models, including Flan-T5, GPT-4, and Bloomz, offering a comparative analysis against fine-tuned models. Our findings suggest that monolingual transformer-based models consistently outperform other models, even in zero and few-shot scenarios. To foster continued exploration, we intend to make this dataset and our research tools publicly available to the broader research community. △ Less

Submitted 4 April, 2024; v1 submitted 21 August, 2023; originally announced August 2023.

Comments: Accepted at LREC-COLING 2024. Zero-Shot Prompting, Few-Shot Prompting, LLMs, Comparative Study, Fine-tuned Models, Bangla, Sentiment Analysis

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2307.01394 [pdf, ps, other]

In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Authors: Niranda Perera, Arup Kumar Sarker, Mills Staylor, Gregor von Laszewski, Kaiying Shan, Supun Kamburugamuve, Chathura Widanage, Vibhatha Abeykoon, Thejaka Amila Kanewela, Geoffrey Fox

Abstract: The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amoun… ▽ More The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its e fficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer. △ Less

Submitted 3 July, 2023; originally announced July 2023.

Report number: FGCS-D-23-00577R1

arXiv:2304.04314 [pdf, ps, other]

RIS-aided Mixed RF-FSO Wireless Networks: Secrecy Performance Analysis with Simultaneous Eavesdrop**

Authors: Md. Mijanur Rahman, A. S. M. Badrudduza, Noor Ahmad Sarker, Md. Ibrahim, Imran Shafique Ansari

Abstract: The appearance of sixth-generation networks has resulted in the proposal of several solutions to tackle signal loss. One of these solutions is the utilization of reconfigurable intelligent surfaces (RIS), which can reflect or refract signals as required. This integration offers significant potential to improve the coverage area from the sender to the receiver. In this paper, we present a comprehen… ▽ More The appearance of sixth-generation networks has resulted in the proposal of several solutions to tackle signal loss. One of these solutions is the utilization of reconfigurable intelligent surfaces (RIS), which can reflect or refract signals as required. This integration offers significant potential to improve the coverage area from the sender to the receiver. In this paper, we present a comprehensive framework for analyzing the secrecy performance of a RIS-aided mixed radio frequency (RF)-free space optics (FSO) system, for the first time. Our study assumes that a secure message is transmitted from a RF transmitter to a FSO receiver through an intermediate relay. The RF link experiences Rician fading while the FSO link experiences Málaga distributed turbulence with pointing errors. We examine three scenarios: 1) RF-link eavesdrop**, 2) FSO-link eavesdrop**, and 3) a simultaneous eavesdrop** attack on both RF and FSO links. We evaluate the secrecy performance using analytical expressions to compute secrecy metrics such as the average secrecy capacity, secrecy outage probability, strictly positive secrecy capacity, effective secrecy throughput, and intercept probability. Our results are confirmed via Monte-Carlo simulations and demonstrate that fading parameters, atmospheric turbulence conditions, pointing errors, and detection techniques play a crucial role in enhancing secrecy performance. △ Less

Submitted 9 April, 2023; originally announced April 2023.

Comments: No comments

arXiv:2301.11806 [pdf, other]

PCV: A Point Cloud-Based Network Verifier

Authors: Arup Kumar Sarker, Farzana Yasmin Ahmad, Matthew B. Dwyer

Abstract: 3D vision with real-time LiDAR-based point cloud data became a vital part of autonomous system research, especially perception and prediction modules use for object classification, segmentation, and detection. Despite their success, point cloud-based network models are vulnerable to multiple adversarial attacks, where the certain factor of changes in the validation set causes significant performan… ▽ More 3D vision with real-time LiDAR-based point cloud data became a vital part of autonomous system research, especially perception and prediction modules use for object classification, segmentation, and detection. Despite their success, point cloud-based network models are vulnerable to multiple adversarial attacks, where the certain factor of changes in the validation set causes significant performance drop in well-trained networks. Most of the existing verifiers work perfectly on 2D convolution. Due to complex architecture, dimension of hyper-parameter, and 3D convolution, no verifiers can perform the basic layer-wise verification. It is difficult to conclude the robustness of a 3D vision model without performing the verification. Because there will be always corner cases and adversarial input that can compromise the model's effectiveness. In this project, we describe a point cloud-based network verifier that successfully deals state of the art 3D classifier PointNet verifies the robustness by generating adversarial inputs. We have used extracted properties from the trained PointNet and changed certain factors for perturbation input. We calculate the impact on model accuracy versus property factor and can test PointNet network's robustness against a small collection of perturbing input states resulting from adversarial attacks like the suggested hybrid reverse signed attack. The experimental results reveal that the resilience property of PointNet is affected by our hybrid reverse signed perturbation strategy △ Less

Submitted 30 January, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

Comments: 11 pages, 12 figures

ACM Class: D.2.2; D.2.3; D.2.4; D.2.5; I.2.10; I.5.4

arXiv:2301.07896 [pdf, other]

Supercharging Distributed Computing Environments For High Performance Data Engineering

Authors: Niranda Perera, Kaiying Shan, Supun Kamburugamuwe, Thejaka Amila Kanewela, Chathura Widanage, Arup Sarker, Mills Staylor, Tianle Zhong, Vibhatha Abeykoon, Geoffrey Fox

Abstract: The data engineering and data science community has embraced the idea of using Python & R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the capabilities of a single machine, but also demand significant developer time & effort. Therefore it is esse… ▽ More The data engineering and data science community has embraced the idea of using Python & R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the capabilities of a single machine, but also demand significant developer time & effort. Therefore it is essential to design scalable dataframe solutions. There have been multiple attempts to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask/Ray distributed computing features look very promising, we perceive that the Dask Dataframes/Ray Datasets still have room for optimization. In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask/Ray infrastructure (thereby supercharging them!). To achieve this, we integrate a high performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30x more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to the native C++ execution of Cylon. We believe the success of Cylon & CylonFlow extends beyond the data engineering domain, and can be used to consolidate high performance computing and distributed computing ecosystems. △ Less

Submitted 19 January, 2023; originally announced January 2023.

arXiv:2301.04591 [pdf, ps, other]

MVAM: Multi-variant Attacks on Memory for IoT Trust Computing

Authors: Arup Kumar Sarker, Md Khairul Islam, Yuan Tian

Abstract: With the significant development of the Internet of Things and low-cost cloud services, the sensory and data processing requirements of IoT systems are continually going up. TrustZone is a hardware-protected Trusted Execution Environment (TEE) for ARM processors specifically designed for IoT handheld systems. It provides memory isolation techniques to protect trusted application data from being ex… ▽ More With the significant development of the Internet of Things and low-cost cloud services, the sensory and data processing requirements of IoT systems are continually going up. TrustZone is a hardware-protected Trusted Execution Environment (TEE) for ARM processors specifically designed for IoT handheld systems. It provides memory isolation techniques to protect trusted application data from being exploited by malicious entities. In this work, we focus on identifying different vulnerabilities of the TrustZone extension of ARM Cortex-M processors. Then design and implement a threat model to execute those attacks. We have found that TrustZone is vulnerable to buffer overflow-based attacks. We have used this to create an attack called MOFlow and successfully leaked the data of another trusted app. This is done by intentionally overflowing the memory of one app to access the encrypted memory of other apps inside the secure world. We have also found that, by not validating the input parameters in the entry function, TrustZone has exposed a security weakness. We call this Achilles heel and present an attack model showing how to exploit this weakness too. Our proposed novel attacks are implemented and successfully tested on two recent ARM Cortex-M processors available on the market (M23 and M33). △ Less

Submitted 11 January, 2023; originally announced January 2023.

Comments: 12 pages, 6 figures, 6 code blocks

ACM Class: F.2.2; I.2.7

arXiv:2212.13732 [pdf, ps, other]

Hybrid Cloud and HPC Approach to High-Performance Dataframes

Authors: Kaiying Shan, Niranda Perera, Damitha Lenadora, Tianle Zhong, Arup Sarker, Supun Kamburugamuve, Thejaka Amila Kanewela, Chathura Widanage, Geoffrey Fox

Abstract: Data pre-processing is a fundamental component in any data-driven application. With the increasing complexity of data processing operations and volume of data, Cylon, a distributed dataframe system, is developed to facilitate data processing both as a standalone application and as a library, especially for Python applications. While Cylon shows promising performance results, we experienced difficu… ▽ More Data pre-processing is a fundamental component in any data-driven application. With the increasing complexity of data processing operations and volume of data, Cylon, a distributed dataframe system, is developed to facilitate data processing both as a standalone application and as a library, especially for Python applications. While Cylon shows promising performance results, we experienced difficulties trying to integrate with frameworks incompatible with the traditional Message Passing Interface (MPI). While MPI implementations encompass scalable and efficient communication routines, their process launching mechanisms work well with mainstream HPC systems but are incompatible with some environments that adopt their own resource management systems. In this work, we alleviated this issue by directly integrating the Unified Communication X (UCX) framework, which supports a variety of classic HPC and non-HPC process-bootstrap** mechanisms as our communication framework. While we experimented with our methodology on Cylon, the same technique can be used to bring MPI communication to other applications that do not employ MPI's built-in process management approach. △ Less

Submitted 29 December, 2022; v1 submitted 28 December, 2022; originally announced December 2022.

arXiv:2212.12454 [pdf]

Generalizable Natural Language Processing Framework for Migraine Reporting from Social Media

Authors: Yuting Guo, Swati Rajwal, Sahithi Lakamana, Chia-Chun Chiang, Paul C. Menell, Adnan H. Shahid, Yi-Chieh Chen, Nikita Chhabra, Wan-Ju Chao, Chieh-Ju Chao, Todd J. Schwedt, Imon Banerjee, Abeed Sarker

Abstract: Migraine is a high-prevalence and disabling neurological disorder. However, information migraine management in real-world settings could be limited to traditional health information sources. In this paper, we (i) verify that there is substantial migraine-related chatter available on social media (Twitter and Reddit), self-reported by migraine sufferers; (ii) develop a platform-independent text cla… ▽ More Migraine is a high-prevalence and disabling neurological disorder. However, information migraine management in real-world settings could be limited to traditional health information sources. In this paper, we (i) verify that there is substantial migraine-related chatter available on social media (Twitter and Reddit), self-reported by migraine sufferers; (ii) develop a platform-independent text classification system for automatically detecting self-reported migraine-related posts, and (iii) conduct analyses of the self-reported posts to assess the utility of social media for studying this problem. We manually annotated 5750 Twitter posts and 302 Reddit posts. Our system achieved an F1 score of 0.90 on Twitter and 0.93 on Reddit. Analysis of information posted by our 'migraine cohort' revealed the presence of a plethora of relevant information about migraine therapies and patient sentiments associated with them. Our study forms the foundation for conducting an in-depth analysis of migraine-related information using social media data. △ Less

Submitted 23 December, 2022; originally announced December 2022.

Comments: Accepted by AMIA 2023 Informatics Summit

arXiv:2211.10443 [pdf]

Social media mining for toxicovigilance of prescription medications: End-to-end pipeline, challenges and future work

Authors: Abeed Sarker

Abstract: Substance use, substance use disorder, and overdoses related to substance use are major public health problems globally and in the United States. A key aspect of addressing these problems from a public health standpoint is improved surveillance. Traditional surveillance systems are laggy, and social media are potentially useful sources of timely data. However, mining knowledge from social media is… ▽ More Substance use, substance use disorder, and overdoses related to substance use are major public health problems globally and in the United States. A key aspect of addressing these problems from a public health standpoint is improved surveillance. Traditional surveillance systems are laggy, and social media are potentially useful sources of timely data. However, mining knowledge from social media is challenging, and requires the development of advanced artificial intelligence, specifically natural language processing (NLP) and machine learning methods. We developed a sophisticated end-to-end pipeline for mining information about nonmedical prescription medication use from social media, namely Twitter and Reddit. Our pipeline employs supervised machine learning and NLP for filtering out noise and characterizing the chatter. In this paper, we describe our end-to-end pipeline developed over four years. In addition to describing our data mining infrastructure, we discuss existing challenges in social media mining for toxicovigilance, and possible future research directions. △ Less

Submitted 2 September, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

arXiv:2210.01849 [pdf, other]

Link Partitioning on Simplicial Complexes Using Higher-Order Laplacians

Authors: Xinyi Wu, Arnab Sarker, Ali Jadbabaie

Abstract: Link partitioning is a popular approach in network science used for discovering overlap** communities by identifying clusters of strongly connected links. Current link partitioning methods are specifically designed for networks modelled by graphs representing pairwise relationships. Therefore, these methods are incapable of utilizing higher-order information about group interactions in network d… ▽ More Link partitioning is a popular approach in network science used for discovering overlap** communities by identifying clusters of strongly connected links. Current link partitioning methods are specifically designed for networks modelled by graphs representing pairwise relationships. Therefore, these methods are incapable of utilizing higher-order information about group interactions in network data which is increasingly available. Simplicial complexes extend the dyadic model of graphs and can model polyadic relationships which are ubiquitous and crucial in many complex social and technological systems. In this paper, we introduce a link partitioning method that leverages higher-order (i.e. triadic and higher) information in simplicial complexes for better community detection. Our method utilizes a novel random walk on links of simplicial complexes defined by the higher-order Laplacian--a generalization of the graph Laplacian that incorporates polyadic relationships of the network. We transform this random walk into a graph-based random walk on a lifted line graph--a dual graph in which links are nodes while nodes and higher-order connections are links--and optimize for the standard notion of modularity. We show that our method is guaranteed to provide interpretable link partitioning results under mild assumptions. We also offer new theoretical results on the spectral properties of simplicial complexes by studying the spectrum of the link random walk. Experiment results on real-world community detection tasks show that our higher-order approach significantly outperforms existing graph-based link partitioning methods. △ Less

Submitted 10 October, 2022; v1 submitted 4 October, 2022; originally announced October 2022.

Comments: Accepted to 22nd IEEE International Conference on Data Mining (ICDM 2022). Fixed some typos in v1

arXiv:2207.11335 [pdf, other]

Generalizing Homophily to Simplicial Complexes

Authors: Arnab Sarker, Natalie Northrup, Ali Jadbabaie

Abstract: Group interactions occur frequently in social settings, yet their properties beyond pairwise relationships in network models remain unexplored. In this work, we study homophily, the nearly ubiquitous phenomena wherein similar individuals are more likely than random to form connections with one another, and define it on simplicial complexes, a generalization of network models that goes beyond dyadi… ▽ More Group interactions occur frequently in social settings, yet their properties beyond pairwise relationships in network models remain unexplored. In this work, we study homophily, the nearly ubiquitous phenomena wherein similar individuals are more likely than random to form connections with one another, and define it on simplicial complexes, a generalization of network models that goes beyond dyadic interactions. While some group homophily definitions have been proposed in the literature, we provide theoretical and empirical evidence that prior definitions mostly inherit properties of homophily in pairwise interactions rather than capture the homophily of group dynamics. Hence, we propose a new measure, $k$-simplicial homophily, which properly identifies homophily in group dynamics. Across 16 empirical networks, $k$-simplicial homophily provides information uncorrelated with homophily measures on pairwise interactions. Moreover, we show the empirical value of $k$-simplicial homophily in identifying when metadata on nodes is useful for predicting group interactions, whereas previous measures are uninformative. △ Less

Submitted 22 July, 2022; originally announced July 2022.

Comments: Preprint submitted to International Conference on Complex Networks and their Applications

arXiv:2204.14081 [pdf]

Few-shot learning for medical text: A systematic review

Authors: Yao Ge, Yuting Guo, Yuan-Chi Yang, Mohammed Ali Al-Garadi, Abeed Sarker

Abstract: Objective: Few-shot learning (FSL) methods require small numbers of labeled instances for training. As many medical topics have limited annotated textual data in practical settings, FSL-based natural language processing (NLP) methods hold substantial promise. We aimed to conduct a systematic review to explore the state of FSL methods for medical NLP. Materials and Methods: We searched for articles… ▽ More Objective: Few-shot learning (FSL) methods require small numbers of labeled instances for training. As many medical topics have limited annotated textual data in practical settings, FSL-based natural language processing (NLP) methods hold substantial promise. We aimed to conduct a systematic review to explore the state of FSL methods for medical NLP. Materials and Methods: We searched for articles published between January 2016 and August 2021 using PubMed/Medline, Embase, ACL Anthology, and IEEE Xplore Digital Library. To identify the latest relevant methods, we also searched other sources such as preprint servers (eg., medRxiv) via Google Scholar. We included all articles that involved FSL and any type of medical text. We abstracted articles based on data source(s), aim(s), training set size(s), primary method(s)/approach(es), and evaluation method(s). Results: 31 studies met our inclusion criteria-all published after 2018; 22 (71%) since 2020. Concept extraction/named entity recognition was the most frequently addressed task (13/31; 42%), followed by text classification (10/31; 32%). Twenty-one (68%) studies reconstructed existing datasets to create few-shot scenarios synthetically, and MIMIC-III was the most frequently used dataset (7/31; 23%). Common methods included FSL with attention mechanisms (12/31; 39%), prototypical networks (8/31; 26%), and meta-learning (6/31; 19%). Discussion: Despite the potential for FSL in biomedical NLP, progress has been limited compared to domain-independent FSL. This may be due to the paucity of standardized, public datasets, and the relative underperformance of FSL methods on biomedical topics. Creation and release of specialized datasets for biomedical FSL may aid method development by enabling comparative analyses. △ Less

Submitted 21 April, 2022; originally announced April 2022.

arXiv:2201.04960 [pdf, other]

Unifying Epidemic Models with Mixtures

Authors: Arnab Sarker, Ali Jadbabaie, Devavrat Shah

Abstract: The COVID-19 pandemic has emphasized the need for a robust understanding of epidemic models. Current models of epidemics are classified as either mechanistic or non-mechanistic: mechanistic models make explicit assumptions on the dynamics of disease, whereas non-mechanistic models make assumptions on the form of observed time series. Here, we introduce a simple mixture-based model which bridges th… ▽ More The COVID-19 pandemic has emphasized the need for a robust understanding of epidemic models. Current models of epidemics are classified as either mechanistic or non-mechanistic: mechanistic models make explicit assumptions on the dynamics of disease, whereas non-mechanistic models make assumptions on the form of observed time series. Here, we introduce a simple mixture-based model which bridges the two approaches while retaining benefits of both. The model represents time series of cases and fatalities as a mixture of Gaussian curves, providing a flexible function class to learn from data compared to traditional mechanistic models. Although the model is non-mechanistic, we show that it arises as the natural outcome of a stochastic process based on a networked SIR framework. This allows learned parameters to take on a more meaningful interpretation compared to similar non-mechanistic models, and we validate the interpretations using auxiliary mobility data collected during the COVID-19 pandemic. We provide a simple learning algorithm to identify model parameters and establish theoretical results which show the model can be efficiently learned from data. Empirically, we find the model to have low prediction error. The model is available live at covidpredictions.mit.edu. Ultimately, this allows us to systematically understand the impacts of interventions on COVID-19, which is critical in develo** data-driven solutions to controlling epidemics. △ Less

Submitted 7 January, 2022; originally announced January 2022.

arXiv:2112.07723 [pdf, other]

Autonomous Navigation System from Simultaneous Localization and Map**

Authors: Micheal Caracciolo, Owen Casciotti, Christopher Lloyd, Ernesto Sola-Thomas, Matthew Weaver, Kyle Bielby, Md Abdul Baset Sarker, Masudul H. Imtiaz

Abstract: This paper presents the development of a Simultaneous Localization and Map** (SLAM) based Autonomous Navigation system. The motivation for this study was to find a solution for navigating interior spaces autonomously. Interior navigation is challenging as it can be forever evolving. Solving this issue is necessary for multitude of services, like cleaning, the health industry, and in manufacturin… ▽ More This paper presents the development of a Simultaneous Localization and Map** (SLAM) based Autonomous Navigation system. The motivation for this study was to find a solution for navigating interior spaces autonomously. Interior navigation is challenging as it can be forever evolving. Solving this issue is necessary for multitude of services, like cleaning, the health industry, and in manufacturing industries. The focus of this paper is the description of the SLAM-based software architecture developed for this proposed autonomous system. A potential application of this system, oriented to a smart wheelchair, was evaluated. Current interior navigation solutions require some sort of guiding line, like a black line on the floor. With this proposed solution, interiors do not require renovation to accommodate this solution. The source code of this application has been made open source so that it could be re-purposed for a similar application. Also, this open-source project is envisioned to be improved by the broad open-source community upon past its current state. △ Less

Submitted 14 December, 2021; originally announced December 2021.

arXiv:2109.05171 [pdf, other]

On the Intercept Probability and Secure Outage Analysis of Mixed ($α$-$κ$-$μ$)-shadowed and Málaga Turbulent Model

Authors: N. A. Sarker, A. S. M. Badrudduza, S. M. R. Islam, S. H. Islam, M. K. Kundu, I. S. Ansari, K. -S. Kwak

Abstract: This work deals with the secrecy performance analysis of a dual-hop RF-FSO DF relaying network composed of a source, a relay, a destination, and an eavesdropper. We assume the eavesdropper is located close to the destination and overhears the relay's transmitted optical signal. The RF and FSO links undergo ($α$-$κ$-$μ$)-shadowed fading and unified Málaga turbulence with pointing error. The secrecy… ▽ More This work deals with the secrecy performance analysis of a dual-hop RF-FSO DF relaying network composed of a source, a relay, a destination, and an eavesdropper. We assume the eavesdropper is located close to the destination and overhears the relay's transmitted optical signal. The RF and FSO links undergo ($α$-$κ$-$μ$)-shadowed fading and unified Málaga turbulence with pointing error. The secrecy performance of the mixed system is studied by deriving closed-form analytical expressions of secure outage probability (SOP), strictly positive secrecy capacity (SPSC), and intercept probability (IP). Besides, we also derive the asymptotic SOP, SPSC, and IP upon utilizing the unfolding of Meijer's G function where the electrical SNR of the FSO link tends to infinity. Finally, the Monte-Carlo simulation is performed to corroborate the analytical expressions. Our results illustrate that fading, shadowing, detection techniques (i.e., heterodyne detection (HD) and intensity modulation and direct detection (IM/DD)), atmospheric turbulence, and pointing error significantly affect the secrecy performance. In addition, better performance is obtained exploiting the HD technique at the destination relative to IM/DD technique. △ Less

Submitted 10 September, 2021; originally announced September 2021.

arXiv:2108.02091 [pdf, other]

Which Bridges Are Weak Ties? Algebraic Topological Insights on Network Structure and Tie Strength

Authors: Arnab Sarker, Jean-Baptiste Seby, Austin R. Benson, Ali Jadbabaie

Abstract: Bridging relationships between individuals situated in different parts of a social network are important conduits for information and resources in social and organizational settings. Dyadic tie strength has often been used as an indicator for whether a relationship is bridging, under the assumption that bridging ties are always weak ties. However, recent empirical evidence suggests that bridging t… ▽ More Bridging relationships between individuals situated in different parts of a social network are important conduits for information and resources in social and organizational settings. Dyadic tie strength has often been used as an indicator for whether a relationship is bridging, under the assumption that bridging ties are always weak ties. However, recent empirical evidence suggests that bridging ties are often strong, forcing us to rethink the relationship between social network structure and dyadic tie strength. Here, we provide an analysis based on algebraic topology which clarifies this relationship between network structure and dyadic tie strength. Rather than model the network as a graph, we use a simplicial complex which can explicitly encode group interactions between three or more individuals. First, we show theoretically and empirically that Edge PageRank, an algebraic topological measure originally defined as an extension of the classical PageRank measure, is a valid continuous measure of how well a relationship acts as a bridge. Second, we use the tool of Hodge Decomposition, which allows us to decompose any flow in a simplicial complex into three orthogonal components, to clarify the relationship between dyadic tie strength and network structure. We find that individuals invest less in relationships associated with topological holes in the network, replicating and explaining recent empirical results that bridging relationships spanning short network distances tend to be weak, whereas those spanning longer distances are strong. Our results are validated on 15 large scale datasets and suggest the value of algebraic topological methods in empirical network analysis. △ Less

Submitted 5 January, 2023; v1 submitted 4 August, 2021; originally announced August 2021.

arXiv:2107.01284 [pdf, other]

doi 10.1109/ICPR48806.2021.9412504

A Novel Disaster Image Dataset and Characteristics Analysis using Attention Model

Authors: Fahim Faisal Niloy, Arif, Abu Bakar Siddik Nayem, Anis Sarker, Ovi Paul, M. Ashraful Amin, Amin Ahsan Ali, Moinul Islam Zaber, AKM Mahbubur Rahman

Abstract: The advancement of deep learning technology has enabled us to develop systems that outperform any other classification technique. However, success of any empirical system depends on the quality and diversity of the data available to train the proposed system. In this research, we have carefully accumulated a relatively challenging dataset that contains images collected from various sources for thr… ▽ More The advancement of deep learning technology has enabled us to develop systems that outperform any other classification technique. However, success of any empirical system depends on the quality and diversity of the data available to train the proposed system. In this research, we have carefully accumulated a relatively challenging dataset that contains images collected from various sources for three different disasters: fire, water and land. Besides this, we have also collected images for various damaged infrastructure due to natural or man made calamities and damaged human due to war or accidents. We have also accumulated image data for a class named non-damage that contains images with no such disaster or sign of damage in them. There are 13,720 manually annotated images in this dataset, each image is annotated by three individuals. We are also providing discriminating image class information annotated manually with bounding box for a set of 200 test images. Images are collected from different news portals, social media, and standard datasets made available by other researchers. A three layer attention model (TLAM) is trained and average five fold validation accuracy of 95.88% is achieved. Moreover, on the 200 unseen test images this accuracy is 96.48%. We also generate and compare attention maps for these test images to determine the characteristics of the trained attention model. Our dataset is available at https://niloy193.github.io/Disaster-Dataset △ Less

Submitted 2 July, 2021; originally announced July 2021.

Comments: ICPR 2020

arXiv:2106.06951 [pdf, other]

Effects of Eavesdropper on the Performance of Mixed η-μ and DGG Cooperative Relaying System

Authors: Noor Ahmed Sarker, A. S. M. Badrudduza, Milton Kumar Kundu, Imran Shafique Ansari

Abstract: Free-space optical (FSO) channel offers line-of-sight wireless communication with high data rates and high secrecy utilizing unlicensed optical spectrum and also paves the way to the solution of the last-mile access problem. Since atmospheric turbulence is a hindrance to an enhanced secrecy performance, the mixed radio frequency (RF)-FSO system is gaining enormous research interest in recent days.… ▽ More Free-space optical (FSO) channel offers line-of-sight wireless communication with high data rates and high secrecy utilizing unlicensed optical spectrum and also paves the way to the solution of the last-mile access problem. Since atmospheric turbulence is a hindrance to an enhanced secrecy performance, the mixed radio frequency (RF)-FSO system is gaining enormous research interest in recent days. But conventional FSO models except for the double generalized Gamma (DGG) model can not demonstrate secrecy performance for all ranges of turbulence severity. This reason has led us to propose a dual-hop eta-mu and unified DGG mixed RF-FSO network while considering eavesdrop** at both RF and FSO hops. The security of these proposed scenarios is investigated in terms of two metrics, i.e., strictly positive secrecy capacity and secure outage probability. Exploiting these expressions, we further investigate how the secrecy performance is affected by various system parameters, i.e., fading, turbulence, and pointing errors. A demonstration is made between heterodyne detection (HD) and intensity modulation and direct detection (IM/DD) techniques while exhibiting superior secrecy performance for HD technique over IM/DD technique. Finally, all analytical results are corroborated via Monte-Carlo simulations. △ Less

Submitted 13 June, 2021; originally announced June 2021.

arXiv:2104.14029 [pdf, other]

Reducing Risk and Uncertainty of Deep Neural Networks on Diagnosing COVID-19 Infection

Authors: Krishanu Sarker, Sharbani Pandit, Anupam Sarker, Saeid Belkasim, Shihao Ji

Abstract: Effective and reliable screening of patients via Computer-Aided Diagnosis can play a crucial part in the battle against COVID-19. Most of the existing works focus on develo** sophisticated methods yielding high detection performance, yet not addressing the issue of predictive uncertainty. In this work, we introduce uncertainty estimation to detect confusing cases for expert referral to address t… ▽ More Effective and reliable screening of patients via Computer-Aided Diagnosis can play a crucial part in the battle against COVID-19. Most of the existing works focus on develo** sophisticated methods yielding high detection performance, yet not addressing the issue of predictive uncertainty. In this work, we introduce uncertainty estimation to detect confusing cases for expert referral to address the unreliability of state-of-the-art (SOTA) DNNs on COVID-19 detection. To the best of our knowledge, we are the first to address this issue on the COVID-19 detection problem. In this work, we investigate a number of SOTA uncertainty estimation methods on publicly available COVID dataset and present our experimental findings. In collaboration with medical professionals, we further validate the results to ensure the viability of the best performing method in clinical practice. △ Less

Submitted 28 April, 2021; originally announced April 2021.

Comments: AAAI, TAIH workshop, 2021

arXiv:2104.07849 [pdf, other]

Task Space Planning with Complementarity Constraint-based Obstacle Avoidance

Authors: Anirban Sinha, Anik Sarker, Nilanjan Chakraborty

Abstract: In this paper, we present a task space-based local motion planner that incorporates collision avoidance and constraints on end-effector motion during the execution of a task. Our key technical contribution is the development of a novel kinematic state evolution model of the robot where the collision avoidance is encoded as a complementarity constraint. We show that the kinematic state evolution wi… ▽ More In this paper, we present a task space-based local motion planner that incorporates collision avoidance and constraints on end-effector motion during the execution of a task. Our key technical contribution is the development of a novel kinematic state evolution model of the robot where the collision avoidance is encoded as a complementarity constraint. We show that the kinematic state evolution with collision avoidance can be represented as a Linear Complementarity Problem (LCP). Using the LCP model along with Screw Linear Interpolation (ScLERP) in SE(3), we show that it may be possible to compute a path between two given task space poses by directly moving from the start to the goal pose, even if there are potential collisions with obstacles. The scalability of the planner is demonstrated with experiments using a physical robot. We present simulation and experimental results with both collision avoidance and task constraints to show the efficacy of our approach. △ Less

Submitted 15 April, 2021; originally announced April 2021.

arXiv:2103.16761 [pdf, ps, other]

Blockwise Phase Rotation-Aided Analog Transmit Beamforming for 5G mmWave Systems

Authors: Md. Abdul Latif Sarker, Igbafe Orikumhi, Dong Seog Han, Sunwoo Kim

Abstract: In this letter, we propose a blockwise phase rotation-aided analog transmit beamforming (BPR-ATB) scheme to improve the spectral efficiency and the bit-error-rate (BER) performance in millimeter wave (mmWave) communication systems. Due to the phase angle optimization issues of the conventional analog beamforming, we design the BPR-ATB for reducing the rotated beamspace of the equivalent channel an… ▽ More In this letter, we propose a blockwise phase rotation-aided analog transmit beamforming (BPR-ATB) scheme to improve the spectral efficiency and the bit-error-rate (BER) performance in millimeter wave (mmWave) communication systems. Due to the phase angle optimization issues of the conventional analog beamforming, we design the BPR-ATB for reducing the rotated beamspace of the equivalent channel and improving the minimum Euclidean distance. To verify the effectiveness of the proposed BPR-ATB scheme, we employ an Alamouti coding technique at the transmitter and evaluate the bit-error-rate performance for mmWave multiple-input and single-output systems. The simulation results show that the proposed BPR-ATB scheme outperforms the conventional discrete Fourier transform-based ATB scheme. △ Less

Submitted 27 July, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

Comments: 5 pages, 3 figures, 2 tables, Submit to IEEE Wireless Communication Letters

arXiv:2011.12847 [pdf, other]

Deep-learning coupled with novel classification method to classify the urban environment of the develo** world

Authors: Qianwei Cheng, AKM Mahbubur Rahman, Anis Sarker, Abu Bakar Siddik Nayem, Ovi Paul, Amin Ahsan Ali, M Ashraful Amin, Ryosuke Shibasaki, Moinul Zaber

Abstract: Rapid globalization and the interdependence of humanity that engender tremendous in-flow of human migration towards the urban spaces. With advent of high definition satellite images, high resolution data, computational methods such as deep neural network, capable hardware; urban planning is seeing a paradigm shift. Legacy data on urban environments are now being complemented with high-volume, high… ▽ More Rapid globalization and the interdependence of humanity that engender tremendous in-flow of human migration towards the urban spaces. With advent of high definition satellite images, high resolution data, computational methods such as deep neural network, capable hardware; urban planning is seeing a paradigm shift. Legacy data on urban environments are now being complemented with high-volume, high-frequency data. In this paper we propose a novel classification method that is readily usable for machine analysis and show applicability of the methodology on a develo** world setting. The state-of-the-art is mostly dominated by classification of building structures, building types etc. and largely represents the developed world which are insufficient for develo** countries such as Bangladesh where the surrounding is crucial for the classification. Moreover, the traditional methods propose small-scale classifications, which give limited information with poor scalability and are slow to compute. We categorize the urban area in terms of informal and formal spaces taking the surroundings into account. 50 km x 50 km Google Earth image of Dhaka, Bangladesh was visually annotated and categorized by an expert. The classification is based broadly on two dimensions: urbanization and the architectural form of urban environment. Consequently, the urban space is divided into four classes: 1) highly informal; 2) moderately informal; 3) moderately formal; and 4) highly formal areas. In total 16 sub-classes were identified. For semantic segmentation, Google's DeeplabV3+ model was used which increases the field of view of the filters to incorporate larger context. Image encompassing 70% of the urban space was used for training and the remaining 30% was used for testing and validation. The model is able to segment with 75% accuracy and 60% Mean IoU. △ Less

Submitted 7 January, 2021; v1 submitted 25 November, 2020; originally announced November 2020.

Comments: Accepted paper at 2nd International Conference on Signal Processing and Machine Learning (SIGML 2021); 20 pages, 7 figures, 1 table

arXiv:2010.10192 [pdf, other]

A Particle Swarm Inspired Approach for Continuous Distributed Constraint Optimization Problems

Authors: Moumita Choudhury, Amit Sarker, Md. Mosaddek Khan, William Yeoh

Abstract: Distributed Constraint Optimization Problems (DCOPs) are a widely studied framework for coordinating interactions in cooperative multi-agent systems. In classical DCOPs, variables owned by agents are assumed to be discrete. However, in many applications, such as target tracking or sleep scheduling in sensor networks, continuous-valued variables are more suitable than discrete ones. To better model… ▽ More Distributed Constraint Optimization Problems (DCOPs) are a widely studied framework for coordinating interactions in cooperative multi-agent systems. In classical DCOPs, variables owned by agents are assumed to be discrete. However, in many applications, such as target tracking or sleep scheduling in sensor networks, continuous-valued variables are more suitable than discrete ones. To better model such applications, researchers have proposed Continuous DCOPs (C-DCOPs), an extension of DCOPs, that can explicitly model problems with continuous variables. The state-of-the-art approaches for solving C-DCOPs experience either onerous memory or computation overhead and unsuitable for non-differentiable optimization problems. To address this issue, we propose a new C-DCOP algorithm, namely Particle Swarm Optimization Based C-DCOP (PCD), which is inspired by Particle Swarm Optimization (PSO), a well-known centralized population-based approach for solving continuous optimization problems. In recent years, population-based algorithms have gained significant attention in classical DCOPs due to their ability in producing high-quality solutions. Nonetheless, to the best of our knowledge, this class of algorithms has not been utilized to solve C-DCOPs and there has been no work evaluating the potential of PSO in solving classical DCOPs or C-DCOPs. In light of this observation, we adapted PSO, a centralized algorithm, to solve C-DCOPs in a decentralized manner. The resulting PCD algorithm not only produces good-quality solutions but also finds solutions without any requirement for derivative calculations. Moreover, we design a crossover operator that can be used by PCD to further improve the quality of solutions found. Finally, we theoretically prove that PCD is an anytime algorithm and empirically evaluate PCD against the state-of-the-art C-DCOP algorithms in a wide variety of benchmarks. △ Less

Submitted 20 October, 2020; originally announced October 2020.

arXiv:2008.10736 [pdf]

LULC Segmentation of RGB Satellite Image Using FCN-8

Authors: Abu Bakar Siddik Nayem, Anis Sarker, Ovi Paul, Amin Ali, Md. Ashraful Amin, AKM Mahbubur Rahman

Abstract: This work presents use of Fully Convolutional Network (FCN-8) for semantic segmentation of high-resolution RGB earth surface satel-lite images into land use land cover (LULC) categories. Specically, we propose a non-overlap** grid-based approach to train a Fully Convo-lutional Network (FCN-8) with vgg-16 weights to segment satellite im-ages into four (forest, built-up, farmland and water) classe… ▽ More This work presents use of Fully Convolutional Network (FCN-8) for semantic segmentation of high-resolution RGB earth surface satel-lite images into land use land cover (LULC) categories. Specically, we propose a non-overlap** grid-based approach to train a Fully Convo-lutional Network (FCN-8) with vgg-16 weights to segment satellite im-ages into four (forest, built-up, farmland and water) classes. The FCN-8 semantically projects the discriminating features in lower resolution learned by the encoder onto the pixel space in higher resolution to get a dense classi cation. We experimented the proposed system with Gaofen-2 image dataset, that contains 150 images of over 60 di erent cities in china. For comparison, we used available ground-truth along with images segmented using a widely used commeriial GIS software called eCogni-tion. With the proposed non-overlap** grid-based approach, FCN-8 obtains signi cantly improved performance, than the eCognition soft-ware. Our model achieves average accuracy of 91.0% and average Inter-section over Union (IoU) of 0.84. In contrast, eCognitions average accu-racy is 74.0% and IoU is 0.60. This paper also reports a detail analysis of errors occurred at the LULC boundary. △ Less

Submitted 24 August, 2020; originally announced August 2020.

Comments: Accepted paper at 3rd SLAAI-International Conference on Artificial Intelligence; 13 pages, 7 figures, 3 tables

arXiv:2006.12687 [pdf, other]

Accurate Parameter Estimation for Risk-aware Autonomous Systems

Authors: Arnab Sarker, Peter Fisher, Joseph E. Gaudio, Anuradha M. Annaswamy

Abstract: Analysis and synthesis of safety-critical autonomous systems are carried out using models which are often dynamic. Two central features of these dynamic systems are parameters and unmodeled dynamics. This paper addresses the use of a spectral lines-based approach for estimating parameters of the dynamic model of an autonomous system. Existing literature has treated all unmodeled components of the… ▽ More Analysis and synthesis of safety-critical autonomous systems are carried out using models which are often dynamic. Two central features of these dynamic systems are parameters and unmodeled dynamics. This paper addresses the use of a spectral lines-based approach for estimating parameters of the dynamic model of an autonomous system. Existing literature has treated all unmodeled components of the dynamic system as sub-Gaussian noise and proposed parameter estimation using Gaussian noise-based exogenous signals. In contrast, we allow the unmodeled part to have deterministic unmodeled dynamics, which are almost always present in physical systems, in addition to sub-Gaussian noise. In addition, we propose a deterministic construction of the exogenous signal in order to carry out parameter estimation. We introduce a new tool kit which employs the theory of spectral lines, retains the stochastic setting, and leads to non-asymptotic bounds on the parameter estimation error. Unlike the existing stochastic approach, these bounds are tunable through an optimal choice of the spectrum of the exogenous signal leading to accurate parameter estimation. We also show that this estimation is robust to unmodeled dynamics, a property that is not assured by the existing approach. Finally, we show that under ideal conditions with no unmodeled dynamics, the proposed approach can ensure a $\tilde{O}(\sqrt{T})$ regret, matching existing literature. Experiments are provided to support all theoretical derivations, which show that the spectral lines-based approach outperforms the Gaussian noise-based method when unmodeled dynamics are present, in terms of both parameter estimation error and Regret obtained using the parameter estimates with a Linear Quadratic Regulator in feedback. △ Less

Submitted 16 March, 2022; v1 submitted 22 June, 2020; originally announced June 2020.

arXiv:2005.00072 [pdf, other]

Two Burning Questions on COVID-19: Did shutting down the economy help? Can we (partially) reopen the economy without risking the second wave?

Authors: Anish Agarwal, Abdullah Alomar, Arnab Sarker, Devavrat Shah, Dennis Shen, Cindy Yang

Abstract: As we reach the apex of the COVID-19 pandemic, the most pressing question facing us is: can we even partially reopen the economy without risking a second wave? We first need to understand if shutting down the economy helped. And if it did, is it possible to achieve similar gains in the war against the pandemic while partially opening up the economy? To do so, it is critical to understand the effec… ▽ More As we reach the apex of the COVID-19 pandemic, the most pressing question facing us is: can we even partially reopen the economy without risking a second wave? We first need to understand if shutting down the economy helped. And if it did, is it possible to achieve similar gains in the war against the pandemic while partially opening up the economy? To do so, it is critical to understand the effects of the various interventions that can be put into place and their corresponding health and economic implications. Since many interventions exist, the key challenge facing policy makers is understanding the potential trade-offs between them, and choosing the particular set of interventions that works best for their circumstance. In this memo, we provide an overview of Synthetic Interventions (a natural generalization of Synthetic Control), a data-driven and statistically principled method to perform what-if scenario planning, i.e., for policy makers to understand the trade-offs between different interventions before having to actually enact them. In essence, the method leverages information from different interventions that have already been enacted across the world and fits it to a policy maker's setting of interest, e.g., to estimate the effect of mobility-restricting interventions on the U.S., we use daily death data from countries that enforced severe mobility restrictions to create a "synthetic low mobility U.S." and predict the counterfactual trajectory of the U.S. if it had indeed applied a similar intervention. Using Synthetic Interventions, we find that lifting severe mobility restrictions and only retaining moderate mobility restrictions (at retail and transit locations), seems to effectively flatten the curve. We hope this provides guidance on weighing the trade-offs between the safety of the population, strain on the healthcare system, and impact on the economy. △ Less

Submitted 10 May, 2020; v1 submitted 30 April, 2020; originally announced May 2020.

arXiv:2002.12427 [pdf, other]

C-CoCoA: A Continuous Cooperative Constraint Approximation Algorithm to Solve Functional DCOPs

Authors: Amit Sarker, Abdullahil Baki Arif, Moumita Choudhury, Md. Mosaddek Khan

Abstract: Distributed Constraint Optimization Problems (DCOPs) have been widely used to coordinate interactions (i.e. constraints) in cooperative multi-agent systems. The traditional DCOP model assumes that variables owned by the agents can take only discrete values and constraints' cost functions are defined for every possible value assignment of a set of variables. While this formulation is often reasonab… ▽ More Distributed Constraint Optimization Problems (DCOPs) have been widely used to coordinate interactions (i.e. constraints) in cooperative multi-agent systems. The traditional DCOP model assumes that variables owned by the agents can take only discrete values and constraints' cost functions are defined for every possible value assignment of a set of variables. While this formulation is often reasonable, there are many applications where the variables are continuous decision variables and constraints are in functional form. To overcome this limitation, Functional DCOP (F-DCOP) model is proposed that is able to model problems with continuous variables. The existing F-DCOPs algorithms experience huge computation and communication overhead. This paper applies continuous non-linear optimization methods on Cooperative Constraint Approximation (CoCoA) algorithm. We empirically show that our algorithm is able to provide high-quality solutions at the expense of smaller communication cost and execution time compared to the existing F-DCOP algorithms. △ Less

Submitted 27 February, 2020; originally announced February 2020.

Comments: 7 pages, 4 figures

arXiv:1909.13184 [pdf, ps, other]

Towards Automatic Bot Detection in Twitter for Health-related Tasks

Authors: Anahita Davoudi, Ari Z. Klein, Abeed Sarker, Graciela Gonzalez-Hernandez

Abstract: With the increasing use of social media data for health-related research, the credibility of the information from this source has been questioned as the posts may originate from automated accounts or "bots". While automatic bot detection approaches have been proposed, there are none that have been evaluated on users posting health-related information. In this paper, we extend an existing bot detec… ▽ More With the increasing use of social media data for health-related research, the credibility of the information from this source has been questioned as the posts may originate from automated accounts or "bots". While automatic bot detection approaches have been proposed, there are none that have been evaluated on users posting health-related information. In this paper, we extend an existing bot detection system and customize it for health-related research. Using a dataset of Twitter users, we first show that the system, which was designed for political bot detection, underperforms when applied to health-related Twitter users. We then incorporate additional features and a statistical machine learning classifier to significantly improve bot detection performance. Our approach obtains F_1 scores of 0.7 for the "bot" class, representing improvements of 0.339. Our approach is customizable and generalizable for bot detection in other health-related social media cohorts. △ Less

Submitted 28 September, 2019; originally announced September 2019.

arXiv:1904.05308 [pdf]

doi 10.1093/jamia/ocz156

Deep Neural Networks Ensemble for Detecting Medication Mentions in Tweets

Authors: Davy Weissenbacher, Abeed Sarker, Ari Klein, Karen O'Connor, Arjun Magge Ranganatha, Graciela Gonzalez-Hernandez

Abstract: Objective: After years of research, Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step to incorporating Twitter data in pharmacoepidemiological research is to automatically recognize medication mentions in tweets. Given that lexical searches for medication names may fail due to misspellings or ambi… ▽ More Objective: After years of research, Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step to incorporating Twitter data in pharmacoepidemiological research is to automatically recognize medication mentions in tweets. Given that lexical searches for medication names may fail due to misspellings or ambiguity with common words, we propose a more advanced method to recognize them. Methods: We present Kusuri, an Ensemble Learning classifier, able to identify tweets mentioning drug products and dietary supplements. Kusuri ("medication" in Japanese) is composed of two modules. First, four different classifiers (lexicon-based, spelling-variant-based, pattern-based and one based on a weakly-trained neural network) are applied in parallel to discover tweets potentially containing medication names. Second, an ensemble of deep neural networks encoding morphological, semantical and long-range dependencies of important words in the tweets discovered is used to make the final decision. Results: On a balanced (50-50) corpus of 15,005 tweets, Kusuri demonstrated performances close to human annotators with 93.7% F1-score, the best score achieved thus far on this corpus. On a corpus made of all tweets posted by 113 Twitter users (98,959 tweets, with only 0.26% mentioning medications), Kusuri obtained 76.3% F1-score. There is not a prior drug extraction system that compares running on such an extremely unbalanced dataset. Conclusion: The system identifies tweets mentioning drug names with performance high enough to ensure its usefulness and ready to be integrated in larger natural language processing systems. △ Less

Submitted 30 September, 2019; v1 submitted 10 April, 2019; originally announced April 2019.

Comments: This is a pre-copy-editing, author-produced PDF of an article accepted for publication in JAMIA following peer review. The definitive publisher-authenticated version is "D. Weissenbacher, A. Sarker, A. Klein, K. O'Connor, A. Magge, G. Gonzalez-Hernandez, Deep neural networks ensemble for detecting medication mentions in tweets, Journal of the American Medical Informatics Association, ocz156, 2019"

Journal ref: Journal of the American Medical Informatics Association, ocz156, 2019

arXiv:1903.08712 [pdf, other]

doi 10.1109/TITS.2019.2892399

A Review of Sensing and Communication, Human Factors, and Controller Aspects for Information-Aware Connected and Automated Vehicles

Authors: Ankur Sarker, Haiying Shen, Mizanur Rahman, Mashrur Chowdhury, Kakan Dey, Fangjian Li, Yue Wang, Husnu S. Narman

Abstract: Information-aware connected and automated vehicles (CAVs) have drawn great attention in recent years due to its potentially significant positive impacts on roadway safety and operational efficiency. In this paper, we conduct an in-depth review of three basic and key interrelated aspects of a CAV: sensing and communication technologies, human factors, and information-aware controller design. First,… ▽ More Information-aware connected and automated vehicles (CAVs) have drawn great attention in recent years due to its potentially significant positive impacts on roadway safety and operational efficiency. In this paper, we conduct an in-depth review of three basic and key interrelated aspects of a CAV: sensing and communication technologies, human factors, and information-aware controller design. First, different vehicular sensing and communication technologies and their protocol stacks, to provide reliable information to the information-aware CAV controller, are thoroughly discussed. Diverse human factor issues, such as user comfort, preferences, and reliability, to design the CAV systems for mass adaptation are also discussed. Then, different layers of a CAV controller (route planning, driving mode execution, and driving model selection) considering human factors and information through connectivity are reviewed. In addition, critical challenges for the sensing and communication technologies, human factors, and information-aware controller are identified to support the design of a safe and efficient CAV system while considering user acceptance and comfort. Finally, promising future research directions of these three aspects are discussed to overcome existing challenges to realize a safe and operationally efficient CAV. △ Less

Submitted 20 March, 2019; originally announced March 2019.

Comments: IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

arXiv:1810.09506 [pdf]

doi 10.1038/s41746-019-0170-5

Automatically Detecting Self-Reported Birth Defect Outcomes on Twitter for Large-scale Epidemiological Research

Authors: Ari Z. Klein, Abeed Sarker, Davy Weissenbacher, Graciela Gonzalez-Hernandez

Abstract: In recent work, we identified and studied a small cohort of Twitter users whose pregnancies with birth defect outcomes could be observed via their publicly available tweets. Exploiting social media's large-scale potential to complement the limited methods for studying birth defects, the leading cause of infant mortality, depends on the further development of automatic methods. The primary objectiv… ▽ More In recent work, we identified and studied a small cohort of Twitter users whose pregnancies with birth defect outcomes could be observed via their publicly available tweets. Exploiting social media's large-scale potential to complement the limited methods for studying birth defects, the leading cause of infant mortality, depends on the further development of automatic methods. The primary objective of this study was to take the first step towards scaling the use of social media for observing pregnancies with birth defect outcomes, namely, develo** methods for automatically detecting tweets by users reporting their birth defect outcomes. We annotated and pre-processed approximately 23,000 tweets that mention birth defects in order to train and evaluate supervised machine learning algorithms, including feature-engineered and deep learning-based classifiers. We also experimented with various under-sampling and over-sampling approaches to address the class imbalance. A Support Vector Machine (SVM) classifier trained on the original, imbalanced data set, with n-grams, word clusters, and structural features, achieved the best baseline performance for the positive classes: an F1-score of 0.65 for the "defect" class and 0.51 for the "possible defect" class. Our contributions include (i) natural language processing (NLP) and supervised machine learning methods for automatically detecting tweets by users reporting their birth defect outcomes, (ii) a comparison of feature-engineered and deep learning-based classifiers trained on imbalanced, under-sampled, and over-sampled data, and (iii) an error analysis that could inform classification improvements using our publicly available corpus. Future work will focus on automating user-level analyses for cohort inclusion. △ Less

Submitted 22 October, 2018; originally announced October 2018.

Journal ref: npj Digital Medicine. 2019;2:96

arXiv:1806.00910 [pdf, other]

doi 10.1016/j.jbi.2018.11.007

An unsupervised and customizable misspelling generator for mining noisy health-related text sources

Authors: Abeed Sarker, Graciela Gonzalez-Hernandez

Abstract: In this paper, we present a customizable datacentric system that automatically generates common misspellings for complex health-related terms. The spelling variant generator relies on a dense vector model learned from large unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given thr… ▽ More In this paper, we present a customizable datacentric system that automatically generates common misspellings for complex health-related terms. The spelling variant generator relies on a dense vector model learned from large unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. Weighting of intra-word character sequence similarities allows further problem-specific customization of the system. On a dataset prepared for this study, our system outperforms the current state-of-the-art for medication name variant generation with best F1-score of 0.69 and F1/4-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms showed an increase of over 67% in retrieval rate from Twitter posts when the generated variants are included. Our proposed spelling variant generator has several advantages over the current state-of-the-art and other types of variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision maybe employed to adjust weights for task-specific customization. The performance and significant relative simplicity of our proposed approach makes it a much needed misspelling generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research purposes. △ Less

Submitted 3 June, 2018; originally announced June 2018.

Journal ref: J Biomed Inform. 2018 Dec;88:98-107. Epub 2018 Nov 13

arXiv:1710.09142 [pdf, ps, other]

doi 10.1109/LCOMM.2018.2864124

Distortion-free Golden-Hadamard Codebook Design for MISO Systems

Authors: Md. Abdul Latif Sarker, Md. Fazlul Kader, Moon Ho Lee, Dong Seog Han

Abstract: In this letter, a novel Golden-Hadamard codebook (GHC) scheme is proposed to improve the performance of the traditional precoded Alamouti coding for multiple-input and single-output systems. Although the traditional discrete Fourier transform codebook (DFTC) performs satisfactorily with Alamouti coding and offers numerous benefits for the Rayleigh fading channel, this scheme inherently generates h… ▽ More In this letter, a novel Golden-Hadamard codebook (GHC) scheme is proposed to improve the performance of the traditional precoded Alamouti coding for multiple-input and single-output systems. Although the traditional discrete Fourier transform codebook (DFTC) performs satisfactorily with Alamouti coding and offers numerous benefits for the Rayleigh fading channel, this scheme inherently generates huge codeword distortion, which leads to a lower minimum chordal distance (MCD). Furthermore, the uncertain format of all prior versions of codebooks results in poorer minimum determinant (MD) values. Hence, the proposed GHC scheme successfully deals with the issues of traditional DFTC to achieve a better codebook format that completely overcome both MCD and MD problems. The effectiveness of the proposed GHC scheme is confirmed, in terms of bit-error-rate through Monte Carlo simulations. △ Less

Submitted 10 October, 2018; v1 submitted 25 October, 2017; originally announced October 2017.

Comments: 4 pages,4 figures,2 table, Published (Early Access) in IEEE Communications Letters

arXiv:1706.08162 [pdf, other]

Automated text summarisation and evidence-based medicine: A survey of two domains

Authors: Abeed Sarker, Diego Molla, Cecile Paris

Abstract: The practice of evidence-based medicine (EBM) urges medical practitioners to utilise the latest research evidence when making clinical decisions. Because of the massive and growing volume of published research on various medical topics, practitioners often find themselves overloaded with information. As such, natural language processing research has recently commenced exploring techniques for perf… ▽ More The practice of evidence-based medicine (EBM) urges medical practitioners to utilise the latest research evidence when making clinical decisions. Because of the massive and growing volume of published research on various medical topics, practitioners often find themselves overloaded with information. As such, natural language processing research has recently commenced exploring techniques for performing medical domain-specific automated text summarisation (ATS) techniques-- targeted towards the task of condensing large medical texts. However, the development of effective summarisation techniques for this task requires cross-domain knowledge. We present a survey of EBM, the domain-specific needs for EBM, automated summarisation techniques, and how they have been applied hitherto. We envision that this survey will serve as a first resource for the development of future operational text summarisation techniques for EBM. △ Less

Submitted 25 June, 2017; originally announced June 2017.

arXiv:1702.02261 [pdf, ps, other]

Social media mining for identification and exploration of health-related information from pregnant women

Authors: Pramod Bharadwaj Chandrashekar, Arjun Magge, Abeed Sarker, Graciela Gonzalez

Abstract: Widespread use of social media has led to the generation of substantial amounts of information about individuals, including health-related information. Social media provides the opportunity to study health-related information about selected population groups who may be of interest for a particular study. In this paper, we explore the possibility of utilizing social media to perform targeted data c… ▽ More Widespread use of social media has led to the generation of substantial amounts of information about individuals, including health-related information. Social media provides the opportunity to study health-related information about selected population groups who may be of interest for a particular study. In this paper, we explore the possibility of utilizing social media to perform targeted data collection and analysis from a particular population group -- pregnant women. We hypothesize that we can use social media to identify cohorts of pregnant women and follow them over time to analyze crucial health-related information. To identify potentially pregnant women, we employ simple rule-based searches that attempt to detect pregnancy announcements with moderate precision. To further filter out false positives and noise, we employ a supervised classifier using a small number of hand-annotated data. We then collect their posts over time to create longitudinal health timelines and attempt to divide the timelines into different pregnancy trimesters. Finally, we assess the usefulness of the timelines by performing a preliminary analysis to estimate drug intake patterns of our cohort at different trimesters. Our rule-based cohort identification technique collected 53,820 users over thirty months from Twitter. Our pregnancy announcement classification technique achieved an F-measure of 0.81 for the pregnancy class, resulting in 34,895 user timelines. Analysis of the timelines revealed that pertinent health-related information, such as drug-intake and adverse reactions can be mined from the data. Our approach to using user timelines in this fashion has produced very encouraging results and can be employed for other important tasks where cohorts, for which health-related information may not be available from other sources, are required to be followed over time to derive population-based estimates. △ Less

Submitted 7 February, 2017; originally announced February 2017.

Comments: 9 pages

arXiv:1612.02145 [pdf]

A Unified Linear Precoding Design for Multi-user MIMO Systems

Authors: Md. Abdul Latif Sarker

Abstract: We address the problem of the bit error rate (BER) performance gap between the sub-optimal and optimal linear precoder (LP) for a multiuser (MU) multiple input and multiple output (MIMO) broadcast systems in this paper. Particularly, mobile users suffer noise enhancement effect due to a sub-optimal LP that can be suppressed by an optimal LP matrix. A sub-optimal LP matrix such as a linear zero-for… ▽ More We address the problem of the bit error rate (BER) performance gap between the sub-optimal and optimal linear precoder (LP) for a multiuser (MU) multiple input and multiple output (MIMO) broadcast systems in this paper. Particularly, mobile users suffer noise enhancement effect due to a sub-optimal LP that can be suppressed by an optimal LP matrix. A sub-optimal LP matrix such as a linear zero-forcing (LZF) precoder performs in high signal to noise ratio (SNR) regime only, in contrast, an optimal precoder for instance a linear minimum mean-square-error (LMMSE) precoder outperforms in both low and high SNR scenarios. These kinds of precoder illustrates the BER gap distance at least 0.1 when it is used in itself in a MU MIMO systems. Thus, we propose and design a unified linear precoding (ULP) matrix using a precoding selection technique that combines the sub-optimal and optimal LP matrix for a multi-user MIMO systems to ensure zero BER performance gap in this paper. The numerical results show that our proposed ULP technique offers significant performance in both low and high SNR scenarios. △ Less

Submitted 7 December, 2016; originally announced December 2016.

Comments: 4

arXiv:1610.02567 [pdf]

Mining the Web for Pharmacovigilance: the Case Study of Duloxetine and Venlafaxine

Authors: Abbas Chokor, Abeed Sarker, Graciela Gonzalez

Abstract: Adverse reactions caused by drugs following their release into the market are among the leading causes of death in many countries. The rapid growth of electronically available health related information, and the ability to process large volumes of them automatically, using natural language processing (NLP) and machine learning algorithms, have opened new opportunities for pharmacovigilance. Survey… ▽ More Adverse reactions caused by drugs following their release into the market are among the leading causes of death in many countries. The rapid growth of electronically available health related information, and the ability to process large volumes of them automatically, using natural language processing (NLP) and machine learning algorithms, have opened new opportunities for pharmacovigilance. Survey found that more than 70% of US Internet users consult the Internet when they require medical information. In recent years, research in this area has addressed for Adverse Drug Reaction (ADR) pharmacovigilance using social media, mainly Twitter and medical forums and websites. This paper will show the information which can be collected from a variety of Internet data sources and search engines, mainly Google Trends and Google Correlate. While considering the case study of two popular Major depressive Disorder (MDD) drugs, Duloxetine and Venlafaxine, we will provide a comparative analysis for their reactions using publicly-available alternative data sources. △ Less

Submitted 8 October, 2016; originally announced October 2016.

Comments: Masters project report

arXiv:1609.00775 [pdf]

An Error Covariance Splitting Technique for Multi-User MIMO Interference Environment

Authors: Md. Abdul Latif Sarker

Abstract: This paper investigates an error covariance matrix splitting technique for multiuser multiple input and multiple output (MIMO) interference downlink channel. Most of the related work has thus far considered the traditional error covariance matrix which has not been well-shaped for maximizing the system capacity. Thus, we split and propose a new iterative error covariance matrix to mitigate the sys… ▽ More This paper investigates an error covariance matrix splitting technique for multiuser multiple input and multiple output (MIMO) interference downlink channel. Most of the related work has thus far considered the traditional error covariance matrix which has not been well-shaped for maximizing the system capacity. Thus, we split and propose a new iterative error covariance matrix to mitigate the system error and maximize the system capacity in this paper. Numerical results illustrate that our proposed method is strictly better than the traditional method. △ Less

Submitted 2 September, 2016; originally announced September 2016.

arXiv:1606.07137 [pdf, other]

Automated Extraction of Number of Subjects in Randomised Controlled Trials

Authors: Abeed Sarker

Abstract: We present a simple approach for automatically extracting the number of subjects involved in randomised controlled trials (RCT). Our approach first applies a set of rule-based techniques to extract candidate study sizes from the abstracts of the articles. Supervised classification is then performed over the candidates with support vector machines, using a small set of lexical, structural, and cont… ▽ More We present a simple approach for automatically extracting the number of subjects involved in randomised controlled trials (RCT). Our approach first applies a set of rule-based techniques to extract candidate study sizes from the abstracts of the articles. Supervised classification is then performed over the candidates with support vector machines, using a small set of lexical, structural, and contextual features. With only a small annotated training set of 201 RCTs, we obtained an accuracy of 88\%. We believe that this system will aid complex medical text processing tasks such as summarisation and question answering. △ Less

Submitted 22 June, 2016; originally announced June 2016.

Comments: unpublished

arXiv:1509.01880 [pdf]

Mean Capacity of Spatially Semi-Correlated MIMO Fading Channel

Authors: Md. Abdul Latif Sarker, Moon Ho Lee

Abstract: This study investigates the mean capacity of multiple-input multiple-output (MIMO) systems for spatially semi-correlated flat fading channels. In reality, the capacity degrades dramatic due to the channel covariance (CC) when correlations exist at the transmitter or receiver or on both sides. Most existing works have so far considered the traditional channel covariance matrices that have not been… ▽ More This study investigates the mean capacity of multiple-input multiple-output (MIMO) systems for spatially semi-correlated flat fading channels. In reality, the capacity degrades dramatic due to the channel covariance (CC) when correlations exist at the transmitter or receiver or on both sides. Most existing works have so far considered the traditional channel covariance matrices that have not been entirely constructed. Thus, we propose an iterative channel covariance (ICC) matrix using a matrix splitting (MS) technique with a guaranteed zero correlations coefficient in the case of the downlink correlated MIMO channel, to maximize the mean capacity. Our numerical results show that the proposed ICC method achieves the maximum channel gains with high signal-to-noise ratio (SNR) scenarios. △ Less

Submitted 23 September, 2016; v1 submitted 6 September, 2015; originally announced September 2015.

Comments: 4

Showing 1–50 of 56 results for author: Sarker, A