Search | arXiv e-print repository

CodeGemma: Open Code Models Based on Gemma

Authors: CodeGemma Team, Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christopher A. Choquette-Choo, **gyue Shen, Joe Kelley, Kshitij Bansal, Luke Vilnis, Mateo Wirth, Paul Michel, Peter Choy, Pratik Joshi, Ravin Kumar, Sarmad Hashmi, Shubham Agrawal, Zhitao Gong, Jane Fine, Tris Warkentin, Ale Jakse Hartman, Bin Ni, Kathy Korevec , et al. (2 additional authors not shown)

Abstract: This paper introduces CodeGemma, a collection of specialized open code models built on top of Gemma, capable of a variety of code and natural language generation tasks. We release three model variants. CodeGemma 7B pretrained (PT) and instruction-tuned (IT) variants have remarkably resilient natural language understanding, excel in mathematical reasoning, and match code capabilities of other open… ▽ More This paper introduces CodeGemma, a collection of specialized open code models built on top of Gemma, capable of a variety of code and natural language generation tasks. We release three model variants. CodeGemma 7B pretrained (PT) and instruction-tuned (IT) variants have remarkably resilient natural language understanding, excel in mathematical reasoning, and match code capabilities of other open models. CodeGemma 2B is a state-of-the-art code completion model designed for fast code infilling and open-ended generation in latency-sensitive settings. △ Less

Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: v1: 11 pages, 4 figures, 5 tables. v2: Update metadata

arXiv:2309.14383 [pdf]

Towards using Cough for Respiratory Disease Diagnosis by leveraging Artificial Intelligence: A Survey

Authors: Aneeqa Ijaz, Muhammad Nabeel, Usama Masood, Tahir Mahmood, Mydah Sajid Hashmi, Iryna Posokhova, Ali Rizwan, Ali Imran

Abstract: Cough acoustics contain multitudes of vital information about pathomorphological alterations in the respiratory system. Reliable and accurate detection of cough events by investigating the underlying cough latent features and disease diagnosis can play an indispensable role in revitalizing the healthcare practices. The recent application of Artificial Intelligence (AI) and advances of ubiquitous c… ▽ More Cough acoustics contain multitudes of vital information about pathomorphological alterations in the respiratory system. Reliable and accurate detection of cough events by investigating the underlying cough latent features and disease diagnosis can play an indispensable role in revitalizing the healthcare practices. The recent application of Artificial Intelligence (AI) and advances of ubiquitous computing for respiratory disease prediction has created an auspicious trend and myriad of future possibilities in the medical domain. In particular, there is an expeditiously emerging trend of Machine learning (ML) and Deep Learning (DL)-based diagnostic algorithms exploiting cough signatures. The enormous body of literature on cough-based AI algorithms demonstrate that these models can play a significant role for detecting the onset of a specific respiratory disease. However, it is pertinent to collect the information from all relevant studies in an exhaustive manner for the medical experts and AI scientists to analyze the decisive role of AI/ML. This survey offers a comprehensive overview of the cough data-driven ML/DL detection and preliminary diagnosis frameworks, along with a detailed list of significant features. We investigate the mechanism that causes cough and the latent cough features of the respiratory modalities. We also analyze the customized cough monitoring application, and their AI-powered recognition algorithms. Challenges and prospective future research directions to develop practical, robust, and ubiquitous solutions are also discussed in detail. △ Less

Submitted 24 September, 2023; originally announced September 2023.

Comments: 30 pages, 12 figures, 9 tables

arXiv:2306.08431 [pdf, other]

Team Composition in Software Engineering Education

Authors: Sajid Ibrahim Hashmi, Jouni Markkula

Abstract: One of the objectives of software engineering education is to make students to learn essential teamwork skills. This is done by having the students work in groups for course assignments. Student team composition plays a vital role in this, as it significantly affects learning outcomes, what is learned, and how. The study presented in this paper aims to better understand the student team compositio… ▽ More One of the objectives of software engineering education is to make students to learn essential teamwork skills. This is done by having the students work in groups for course assignments. Student team composition plays a vital role in this, as it significantly affects learning outcomes, what is learned, and how. The study presented in this paper aims to better understand the student team composition in software engineering education and investigate the factors affecting it in the international software engineering education context. Those factors should be taken into consideration by software engineering teachers when they design group work assignments in their courses. In this paper, the initial findings of the ongoing Action research study are presented. The results give some identified principles that should be considered when designing student team composition in software engineering courses. △ Less

Submitted 14 June, 2023; originally announced June 2023.

arXiv:2205.08252 [pdf, other]

An Empirical Assessment of Security and Privacy Risks of Web based-Chatbots

Authors: Nazar Waheed, Muhammad Ikram, Saad Sajid Hashmi, Xiangjian He, Priyadarsi Nanda

Abstract: Web-based chatbots provide website owners with the benefits of increased sales, immediate response to their customers, and insight into customer behaviour. While Web-based chatbots are getting popular, they have not received much scrutiny from security researchers. The benefits to owners come at the cost of users' privacy and security. Vulnerabilities, such as tracking cookies and third-party doma… ▽ More Web-based chatbots provide website owners with the benefits of increased sales, immediate response to their customers, and insight into customer behaviour. While Web-based chatbots are getting popular, they have not received much scrutiny from security researchers. The benefits to owners come at the cost of users' privacy and security. Vulnerabilities, such as tracking cookies and third-party domains, can be hidden in the chatbot's iFrame script. This paper presents a large-scale analysis of five Web-based chatbots among the top 1-million Alexa websites. Through our crawler tool, we identify the presence of chatbots in these 1-million websites. We discover that 13,515 out of the top 1-million Alexa websites (1.59%) use one of the five analysed chatbots. Our analysis reveals that the top 300k Alexa ranking websites are dominated by Intercom chatbots that embed the least number of third-party domains. LiveChat chatbots dominate the remaining websites and embed the highest samples of third-party domains. We also find that 850 (6.29%) of the chatbots use insecure protocols to transfer users' chats in plain text. Furthermore, some chatbots heavily rely on cookies for tracking and advertisement purposes. More than two-thirds (68.92%) of the identified cookies in chatbot iFrames are used for ads and tracking users. Our results show that, despite the promises for privacy, security, and anonymity given by the majority of the websites, millions of users may unknowingly be subject to poor security guarantees by chatbot service providers △ Less

Submitted 17 May, 2022; originally announced May 2022.

Comments: Submitted to WISE 2020 Conference

arXiv:2201.06220 [pdf]

Face Detection in Extreme Conditions: A Machine-learning Approach

Authors: Sameer Aqib Hashmi

Abstract: Face detection in unrestricted conditions has been a trouble for years due to various expressions, brightness, and coloration fringing. Recent studies show that deep learning knowledge of strategies can acquire spectacular performance inside the identification of different gadgets and patterns. This face detection in unconstrained surroundings is difficult due to various poses, illuminations, and… ▽ More Face detection in unrestricted conditions has been a trouble for years due to various expressions, brightness, and coloration fringing. Recent studies show that deep learning knowledge of strategies can acquire spectacular performance inside the identification of different gadgets and patterns. This face detection in unconstrained surroundings is difficult due to various poses, illuminations, and occlusions. Figuring out someone with a picture has been popularized through the mass media. However, it's miles less sturdy to fingerprint or retina scanning. The latest research shows that deep mastering techniques can gain mind-blowing performance on those two responsibilities. In this paper, I recommend a deep cascaded multi-venture framework that exploits the inherent correlation among them to boost up their performance. In particular, my framework adopts a cascaded shape with 3 layers of cautiously designed deep convolutional networks that expect face and landmark region in a coarse-to-fine way. Besides, within the gaining knowledge of the procedure, I propose a new online tough sample mining method that can enhance the performance robotically without manual pattern choice. △ Less

Submitted 13 February, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

Comments: 6 pages, 9 figures

arXiv:2112.02221 [pdf, other]

doi 10.1007/s00607-022-01095-0

Orientation Aware Weapons Detection In Visual Data : A Benchmark Dataset

Authors: Nazeef Ul Haq, Muhammad Moazam Fraz, Tufail Sajjad Shah Hashmi, Muhammad Shahzad

Abstract: Automatic detection of weapons is significant for improving security and well being of individuals, nonetheless, it is a difficult task due to large variety of size, shape and appearance of weapons. View point variations and occlusion also are reasons which makes this task more difficult. Further, the current object detection algorithms process rectangular areas, however a slender and long rifle m… ▽ More Automatic detection of weapons is significant for improving security and well being of individuals, nonetheless, it is a difficult task due to large variety of size, shape and appearance of weapons. View point variations and occlusion also are reasons which makes this task more difficult. Further, the current object detection algorithms process rectangular areas, however a slender and long rifle may really cover just a little portion of area and the rest may contain unessential details. To overcome these problem, we propose a CNN architecture for Orientation Aware Weapons Detection, which provides oriented bounding box with improved weapons detection performance. The proposed model provides orientation not only using angle as classification problem by dividing angle into eight classes but also angle as regression problem. For training our model for weapon detection a new dataset comprising of total 6400 weapons images is gathered from the web and then manually annotated with position oriented bounding boxes. Our dataset provides not only oriented bounding box as ground truth but also horizontal bounding box. We also provide our dataset in multiple formats of modern object detectors for further research in this area. The proposed model is evaluated on this dataset, and the comparative analysis with off-the shelf object detectors yields superior performance of proposed model, measured with standard evaluation strategies. The dataset and the model implementation are made publicly available at this link: https://bit.ly/2TyZICF. △ Less

Submitted 3 December, 2021; originally announced December 2021.

Comments: Submitted this paper in Journal

arXiv:2110.09319 [pdf, other]

doi 10.1109/TIM.2021.3122172

Incremental Cross-Domain Adaptation for Robust Retinopathy Screening via Bayesian Deep Learning

Authors: Taimur Hassan, Bilal Hassan, Muhammad Usman Akram, Shahrukh Hashmi, Abdel Hakim Taguri, Naoufel Werghi

Abstract: Retinopathy represents a group of retinal diseases that, if not treated timely, can cause severe visual impairments or even blindness. Many researchers have developed autonomous systems to recognize retinopathy via fundus and optical coherence tomography (OCT) imagery. However, most of these frameworks employ conventional transfer learning and fine-tuning approaches, requiring a decent amount of w… ▽ More Retinopathy represents a group of retinal diseases that, if not treated timely, can cause severe visual impairments or even blindness. Many researchers have developed autonomous systems to recognize retinopathy via fundus and optical coherence tomography (OCT) imagery. However, most of these frameworks employ conventional transfer learning and fine-tuning approaches, requiring a decent amount of well-annotated training data to produce accurate diagnostic performance. This paper presents a novel incremental cross-domain adaptation instrument that allows any deep classification model to progressively learn abnormal retinal pathologies in OCT and fundus imagery via few-shot training. Furthermore, unlike its competitors, the proposed instrument is driven via a Bayesian multi-objective function that not only enforces the candidate classification network to retain its prior learned knowledge during incremental training but also ensures that the network understands the structural and semantic relationships between previously learned pathologies and newly added disease categories to effectively recognize them at the inference stage. The proposed framework, evaluated on six public datasets acquired with three different scanners to screen thirteen retinal pathologies, outperforms the state-of-the-art competitors by achieving an overall accuracy and F1 score of 0.9826 and 0.9846, respectively. △ Less

Submitted 4 November, 2021; v1 submitted 18 October, 2021; originally announced October 2021.

Comments: Accepted in IEEE Transactions on Instrumentation and Measurement. Source code is available at https://github.com/taimurhassan/continual_learning/

Journal ref: IEEE Transactions on Instrumentation and Measurement, 2021

arXiv:2106.10035 [pdf, other]

Longitudinal Compliance Analysis of Android Applications with Privacy Policies

Authors: Saad Sajid Hashmi, Nazar Waheed, Gioacchino Tangari, Muhammad Ikram, Stephen Smith

Abstract: Contemporary mobile applications (apps) are designed to track, use, and share users' data, often without their consent, which results in potential privacy and transparency issues. To investigate whether mobile apps have always been (non-)transparent regarding how they collect information about users, we perform a longitudinal analysis of the historical versions of 268 Android apps. These apps comp… ▽ More Contemporary mobile applications (apps) are designed to track, use, and share users' data, often without their consent, which results in potential privacy and transparency issues. To investigate whether mobile apps have always been (non-)transparent regarding how they collect information about users, we perform a longitudinal analysis of the historical versions of 268 Android apps. These apps comprise 5,240 app releases or versions between 2008 and 2016. We detect inconsistencies between apps' behaviors and the stated use of data collection in privacy policies to reveal compliance issues. We utilize machine learning techniques for the classification of the privacy policy text to identify the purported practices that collect and/or share users' personal information, such as phone numbers and email addresses. We then uncover the data leaks of an app through static and dynamic analysis. Over time, our results show a steady increase in the number of apps' data collection practices that are undisclosed in the privacy policies. This behavior is particularly troubling since privacy policy is the primary tool for describing the app's privacy protection practices. We find that newer versions of the apps are likely to be more non-compliant than their preceding versions. The discrepancies between the purported and the actual data practices show that privacy policies are often incoherent with the apps' behaviors, thus defying the 'notice and choice' principle when users install apps. △ Less

Submitted 28 July, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

Comments: 11 pages

arXiv:2105.13553 [pdf, other]

doi 10.1021/acsami.1c19276

A Machine Learning and Computer Vision Approach to Rapidly Optimize Multiscale Droplet Generation

Authors: Alexander E. Siemenn, Evyatar Shaulsky, Matthew Beveridge, Tonio Buonassisi, Sara M. Hashmi, Iddo Drori

Abstract: Generating droplets from a continuous stream of fluid requires precise tuning of a device to find optimized control parameter conditions. It is analytically intractable to compute the necessary control parameter values of a droplet-generating device that produces optimized droplets. Furthermore, as the length scale of the fluid flow changes, the formation physics and optimized conditions that indu… ▽ More Generating droplets from a continuous stream of fluid requires precise tuning of a device to find optimized control parameter conditions. It is analytically intractable to compute the necessary control parameter values of a droplet-generating device that produces optimized droplets. Furthermore, as the length scale of the fluid flow changes, the formation physics and optimized conditions that induce flow decomposition into droplets also change. Hence, a single proportional integral derivative controller is too inflexible to optimize devices of different length scales or different control parameters, while classification machine learning techniques take days to train and require millions of droplet images. Therefore, the question is posed, can a single method be created that universally optimizes multiple length-scale droplets using only a few data points and is faster than previous approaches? In this paper, a Bayesian optimization and computer vision feedback loop is designed to quickly and reliably discover the control parameter values that generate optimized droplets within different length-scale devices. This method is demonstrated to converge on optimum parameter values using 60 images in only 2.3 hours, 30x faster than previous approaches. Model implementation is demonstrated for two different length-scale devices: a milliscale inkjet device and a microfluidics device. △ Less

Submitted 15 January, 2022; v1 submitted 27 May, 2021; originally announced May 2021.

Comments: Published as a journal article in ACS Applied Materials & Interfaces

arXiv:2007.03841 [pdf, other]

doi 10.1504/IJITST.2020.10033617

Energy Efficient Cross Layer Time Synchronization in Cognitive Radio Networks

Authors: S. M. Usman Hashmi, Muntazir Hussain, S. M. Nashit Arshad, Kashif Inayat, Seong Oun Hwang

Abstract: Time synchronization is a vital concern for any Cognitive Radio Network (CRN) to perform dynamic spectrum management. Each Cognitive Radio (CR) node has to be environment aware and self adaptive and must have the ability to switch between multiple modulation schemes and frequencies. Achieving same notion of time within these CR nodes is essential to fulfill the requirements for simultaneous quiet… ▽ More Time synchronization is a vital concern for any Cognitive Radio Network (CRN) to perform dynamic spectrum management. Each Cognitive Radio (CR) node has to be environment aware and self adaptive and must have the ability to switch between multiple modulation schemes and frequencies. Achieving same notion of time within these CR nodes is essential to fulfill the requirements for simultaneous quiet periods for spectrum sensing. Current application layer time synchronization protocols require multiple timestamp exchanges to estimate skew between the clocks of CRN nodes. The proposed symbol timing recovery method already estimates the skew of hardware clock at the physical layer and use it for skew correction of application layer clock of each node. The heart of application layer clock is the hardware clock and hence application layer clock skew will be same as of physical layer and can be corrected from symbol timing recovery process. So one timestamp is enough to synchronize two CRN nodes. This conserves the energy utilized by application layer protocol and makes a CRN energy efficient and can achieve time synchronization in short span. △ Less

Submitted 7 July, 2020; originally announced July 2020.

Comments: International Journal of Internet Technology and Secured Transactions, 2020

arXiv:2006.14413 [pdf, other]

doi 10.1504/IJITST.2020.10033615

Implementation of Symbol Timing Recovery for Estimation of Clock Skew

Authors: S. M. Usman Hashmi, Muntazir Hussain, Fahad Bin Muslim, Kashif Inayat, Seong Oun Hwang

Abstract: Time synchronization in any distributed network can be achieved by using application layer protocols for time correction. Time synchronization method proposed in this article uses symbol timing recovery at the physical layer to correct application layer clock. This cross layer methodology diminishes the quantity of message trades needed by application layer for time synchronization thus resulting… ▽ More Time synchronization in any distributed network can be achieved by using application layer protocols for time correction. Time synchronization method proposed in this article uses symbol timing recovery at the physical layer to correct application layer clock. This cross layer methodology diminishes the quantity of message trades needed by application layer for time synchronization thus resulting in energy saving. Precision of skew estimate can be increased by using multiple message exchanges. Examination of the cross layer strategy including the simulation results, the experimentation outcomes and mathematical analysis demonstrates that clock skew at physical layer is same as of application layer, which is actually the skew of hardware clock within the node. △ Less

Submitted 25 June, 2020; originally announced June 2020.

Journal ref: International Journal of Internet Technology and Secured Transactions (2020): https://www.inderscience.com/info/ingeneral/forthcoming.php?jcode=ijitst

arXiv:2004.10506 [pdf, ps, other]

doi 10.1109/ICTC52510.2021.9621035

A Non-Ideal NOMA-based mmWave D2D Networks with Hardware and CSI Imperfections

Authors: Leila Tlebaldiyeva, Galymzhan Nauryzbayev, Sultangali Arzykulov, Yerassyl Akhmetkaziyev, Mohammad S. Hashmi, Ahmed M. Eltawil

Abstract: This letter investigates a non-orthogonal multiple access (NOMA) assisted millimeter-wave device-to-device (D2D) network practically limited by multiple interference noises, transceiver hardware impairments, imperfect successive interference cancellation, and channel state information mismatch. Generalized outage probability expressions for NOMA-D2D users are deduced and achieved results, validate… ▽ More This letter investigates a non-orthogonal multiple access (NOMA) assisted millimeter-wave device-to-device (D2D) network practically limited by multiple interference noises, transceiver hardware impairments, imperfect successive interference cancellation, and channel state information mismatch. Generalized outage probability expressions for NOMA-D2D users are deduced and achieved results, validated by Monte Carlo simulations, are compared with the orthogonal multiple access to show the superior performance of the proposed network model △ Less

Submitted 22 April, 2020; originally announced April 2020.

Comments: 4 pages, 3 figures

arXiv:2002.03488 [pdf, other]

Security and Privacy in IoT Using Machine Learning and Blockchain: Threats & Countermeasures

Authors: Nazar Waheed, Xiangjian He, Muhammad Ikram, Muhammad Usman, Saad Sajid Hashmi, Muhammad Usman

Abstract: Security and privacy of the users have become significant concerns due to the involvement of the Internet of things (IoT) devices in numerous applications. Cyber threats are growing at an explosive pace making the existing security and privacy measures inadequate. Hence, everyone on the Internet is a product for hackers. Consequently, Machine Learning (ML) algorithms are used to produce accurate o… ▽ More Security and privacy of the users have become significant concerns due to the involvement of the Internet of things (IoT) devices in numerous applications. Cyber threats are growing at an explosive pace making the existing security and privacy measures inadequate. Hence, everyone on the Internet is a product for hackers. Consequently, Machine Learning (ML) algorithms are used to produce accurate outputs from large complex databases, where the generated outputs can be used to predict and detect vulnerabilities in IoT-based systems. Furthermore, Blockchain (BC) techniques are becoming popular in modern IoT applications to solve security and privacy issues. Several studies have been conducted on either ML algorithms or BC techniques. However, these studies target either security or privacy issues using ML algorithms or BC techniques, thus posing a need for a combined survey on efforts made in recent years addressing both security and privacy issues using ML algorithms and BC techniques. In this paper, we provide a summary of research efforts made in the past few years, starting from 2008 to 2019, addressing security and privacy issues using ML algorithms and BCtechniques in the IoT domain. First, we discuss and categorize various security and privacy threats reported in the past twelve years in the IoT domain. Then, we classify the literature on security and privacy efforts based on ML algorithms and BC techniques in the IoT domain. Finally, we identify and illuminate several challenges and future research directions in using ML algorithms and BC techniques to address security and privacy issues in the IoT domain. △ Less

Submitted 5 August, 2020; v1 submitted 9 February, 2020; originally announced February 2020.

Comments: 35 pages, ACM CSUR Journal

arXiv:1906.00166 [pdf, other]

A Longitudinal Analysis of Online Ad-Blocking Blacklists

Authors: Saad Sajid Hashmi, Muhammad Ikram, Mohamed Ali Kaafar

Abstract: Websites employ third-party ads and tracking services leveraging cookies and JavaScript code, to deliver ads and track users' behavior, causing privacy concerns. To limit online tracking and block advertisements, several ad-blocking (black) lists have been curated consisting of URLs and domains of well-known ads and tracking services. Using Internet Archive's Wayback Machine in this paper, we coll… ▽ More Websites employ third-party ads and tracking services leveraging cookies and JavaScript code, to deliver ads and track users' behavior, causing privacy concerns. To limit online tracking and block advertisements, several ad-blocking (black) lists have been curated consisting of URLs and domains of well-known ads and tracking services. Using Internet Archive's Wayback Machine in this paper, we collect a retrospective view of the Web to analyze the evolution of ads and tracking services and evaluate the effectiveness of ad-blocking blacklists. We propose metrics to capture the efficacy of ad-blocking blacklists to investigate whether these blacklists have been reactive or proactive in tackling the online ad and tracking services. We introduce a stability metric to measure the temporal changes in ads and tracking domains blocked by ad-blocking blacklists, and a diversity metric to measure the ratio of new ads and tracking domains detected. We observe that ads and tracking domains in websites change over time, and among the ad-blocking blacklists that we investigated, our analysis reveals that some blacklists were more informed with the existence of ads and tracking domains, but their rate of change was slower than other blacklists. Our analysis also shows that Alexa top 5K websites in the US, Canada, and the UK have the most number of ads and tracking domains per website, and have the highest proactive scores. This suggests that ad-blocking blacklists are updated by prioritizing ads and tracking domains reported in the popular websites from these countries. △ Less

Submitted 1 June, 2019; originally announced June 2019.

Comments: 9

arXiv:1810.05804 [pdf, ps, other]

doi 10.1109/GLOCOM.2018.8647409

User Transmit Power Minimization through Uplink Resource Allocation and User Association in HetNets

Authors: Umar Bin Farooq, Umair Sajid Hashmi, Junaid Qadir, Ali Imran, Adnan Noor Mian

Abstract: The popularity of cellular internet of things (IoT) is increasing day by day and billions of IoT devices will be connected to the internet. Many of these devices have limited battery life with constraints on transmit power. High user power consumption in cellular networks restricts the deployment of many IoT devices in 5G. To enable the inclusion of these devices, 5G should be supplemented with st… ▽ More The popularity of cellular internet of things (IoT) is increasing day by day and billions of IoT devices will be connected to the internet. Many of these devices have limited battery life with constraints on transmit power. High user power consumption in cellular networks restricts the deployment of many IoT devices in 5G. To enable the inclusion of these devices, 5G should be supplemented with strategies and schemes to reduce user power consumption. Therefore, we present a novel joint uplink user association and resource allocation scheme for minimizing user transmit power while meeting the quality of service. We analyze our scheme for two-tier heterogeneous network (HetNet) and show an average transmit power of -2.8 dBm and 8.2 dBm for our algorithms compared to 20 dBm in state-of-the-art Max reference signal received power (RSRP) and channel individual offset (CIO) based association schemes. △ Less

Submitted 13 October, 2018; originally announced October 2018.

Journal ref: 2018 IEEE Global Communications Conference (GLOBECOM)

Showing 1–15 of 15 results for author: Hashmi, S