-
Teaching Requirements Engineering Concepts using Case-Based Learning
Authors:
Saurabh Tiwari,
Deepti Ameta,
Paramvir Singh,
Ashish Sureka
Abstract:
Requirements Engineering (RE) is known to be critical for the success of software projects, and hence forms an important part of any Software Engineering (SE) education curriculum offered at tertiary level. In this paper, we report the results of an exploratory pilot study conducted to assess the effectiveness of Case-Based Learning (CBL) methodology in facilitating the learning of several RE conc…
▽ More
Requirements Engineering (RE) is known to be critical for the success of software projects, and hence forms an important part of any Software Engineering (SE) education curriculum offered at tertiary level. In this paper, we report the results of an exploratory pilot study conducted to assess the effectiveness of Case-Based Learning (CBL) methodology in facilitating the learning of several RE concepts. The evaluation was made on the basis of graduate students' responses to a set of questions representing various key learning principles, collected after the execution of two CBL sessions at DA-IICT, Gandhinagar (India). We investigate the perceived effectiveness of CBL in students' learning of various RE concepts, based on factors like case difference, gender diversity, and team size. Additionally, we collect and analyze the Teaching Assistants' (TAs) opinions about the conducted CBL sessions. The outcome of this CBL exercise was positive as maximum students were able to achieve all the five stated learning objectives. The authors also report various challenges, recommendations, and lessons learned while experiencing CBL sessions.
△ Less
Submitted 5 April, 2018;
originally announced April 2018.
-
A Comparative Study of Different Source Code Metrics and Machine Learning Algorithms for Predicting Change Proneness of Object Oriented Systems
Authors:
Lov Kumar,
Ashish Sureka
Abstract:
Change-prone classes or modules are defined as software components in the source code which are likely to change in the future. Change-proneness prediction is useful to the maintenance team as they can optimize and focus their testing resources on the modules which have a higher likelihood of change. Change-proneness prediction model can be built by using source code metrics as predictors or featu…
▽ More
Change-prone classes or modules are defined as software components in the source code which are likely to change in the future. Change-proneness prediction is useful to the maintenance team as they can optimize and focus their testing resources on the modules which have a higher likelihood of change. Change-proneness prediction model can be built by using source code metrics as predictors or features within a machine learning classification framework. In this paper, twenty one source code metrics are computed to develop a statistical model for predicting change-proneness modules. Since the performance of the change-proneness model depends on the source code metrics, they are used as independent variables or predictors for the change-proneness model. Eleven different feature selection techniques (including the usage of all the $21$ proposed source code metrics described in the paper) are used to remove irrelevant features and select the best set of features. The effectiveness of the set of source code metrics are evaluated using eighteen different classiffication techniques and three ensemble techniques. Experimental results demonstrate that the model based on selected set of source code metrics after applying feature selection techniques achieves better results as compared to the model using all source code metrics as predictors. Our experimental results reveal that the predictive model developed using LSSVM-RBF yields better result as compared to other classification techniques
△ Less
Submitted 21 December, 2017;
originally announced December 2017.
-
Using Source Code Metrics and Ensemble Methods for Fault Proneness Prediction
Authors:
Lov Kumar,
Santanu Rath,
Ashish Sureka
Abstract:
Software fault prediction model are employed to optimize testing resource allocation by identifying fault-prone classes before testing phases. Several researchers' have validated the use of different classification techniques to develop predictive models for fault prediction. The performance of the statistical models are proven to be influenced by the training and testing dataset. Ensemble method…
▽ More
Software fault prediction model are employed to optimize testing resource allocation by identifying fault-prone classes before testing phases. Several researchers' have validated the use of different classification techniques to develop predictive models for fault prediction. The performance of the statistical models are proven to be influenced by the training and testing dataset. Ensemble method learning algorithms have been widely used because it combines the capabilities of its constituent models towards a dataset to come up with a potentially higher performance as compared to individual models (improves generalizability). In the study presented in this paper, three different ensemble methods have been applied to develop a model for predicting fault proneness. The efficacy and usefulness of a fault prediction model also depends on the source code metrics which are considered as the input for the model.
In this paper, we propose a framework to validate the source code metrics and select the right set of metrics with the objective to improve the performance of the fault prediction model. The fault prediction models are then validated using a cost evaluation framework. We conduct a series of experiments on 45 open source project dataset. Key conclusions from our experiments are: (1) Majority Voting Ensemble (MVE) methods outperformed other methods; (2) selected set of source code metrics using the suggested source code metrics using validation framework as the input achieves better results compared to all other metrics; (3) fault prediction method is effective for software projects with a percentage of faulty classes lower than the threshold value (low - 54.82%, medium - 41.04%, high - 28.10%)
△ Less
Submitted 14 April, 2017;
originally announced April 2017.
-
Empirical Analysis on Comparing the Performance of Alpha Miner Algorithm in SQL Query Language and NoSQL Column-Oriented Databases Using Apache Phoenix
Authors:
Kunal Gupta,
Astha Sachdev,
Ashish Sureka
Abstract:
Process-Aware Information Systems (PAIS) is an IT system that support business processes and generate large amounts of event logs from the execution of business processes. An event log is represented as a tuple of CaseID, Timestamp, Activity and Actor. Process Mining is a new and emerging field that aims at analyzing the event logs to discover, enhance and improve business processes and check conf…
▽ More
Process-Aware Information Systems (PAIS) is an IT system that support business processes and generate large amounts of event logs from the execution of business processes. An event log is represented as a tuple of CaseID, Timestamp, Activity and Actor. Process Mining is a new and emerging field that aims at analyzing the event logs to discover, enhance and improve business processes and check conformance between run time and design time business processes. The large volume of event logs generated are stored in the databases. Relational databases perform well for a certain class of applications. However, there are a certain class of applications for which relational databases are not able to scale. To handle such class of applications, NoSQL database systems emerged. Discovering a process model (workflow model) from event logs is one of the most challenging and important Process Mining task. The $α$-miner algorithm is one of the first and most widely used Process Discovery technique. Our objective is to investigate which of the databases (Relational or NoSQL) performs better for a Process Discovery application under Process Mining. We implement the $α$-miner algorithm on relational (row-oriented) and NoSQL (column-oriented) databases in database query languages so that our algorithm is tightly coupled to the database. We present a performance benchmarking and comparison of the $α$-miner algorithm on row-oriented database and NoSQL column-oriented database so that we can compare which database can efficiently store massive event logs and analyze it in seconds to discover a process model.
△ Less
Submitted 16 March, 2017;
originally announced March 2017.
-
Investigating the Application of Common-Sense Knowledge-Base for Identifying Term Obfuscation in Adversarial Communication
Authors:
Swati Agarwal,
Ashish Sureka
Abstract:
Word obfuscation or substitution means replacing one word with another word in a sentence to conceal the textual content or communication. Word obfuscation is used in adversarial communication by terrorist or criminals for conveying their messages without getting red-flagged by security and intelligence agencies intercepting or scanning messages (such as emails and telephone conversations). Concep…
▽ More
Word obfuscation or substitution means replacing one word with another word in a sentence to conceal the textual content or communication. Word obfuscation is used in adversarial communication by terrorist or criminals for conveying their messages without getting red-flagged by security and intelligence agencies intercepting or scanning messages (such as emails and telephone conversations). ConceptNet is a freely available semantic network represented as a directed graph consisting of nodes as concepts and edges as assertions of common sense about these concepts. We present a solution approach exploiting vast amount of semantic knowledge in ConceptNet for addressing the technically challenging problem of word substitution in adversarial communication. We frame the given problem as a textual reasoning and context inference task and utilize ConceptNet's natural-language-processing tool-kit for determining word substitution. We use ConceptNet to compute the conceptual similarity between any two given terms and define a Mean Average Conceptual Similarity (MACS) metric to identify out-of-context terms. The test-bed to evaluate our proposed approach consists of Enron email dataset (having over 600000 emails generated by 158 employees of Enron Corporation) and Brown corpus (totaling about a million words drawn from a wide variety of sources). We implement word substitution techniques used by previous researches to generate a test dataset. We conduct a series of experiments consisting of word substitution methods used in the past to evaluate our approach. Experimental results reveal that the proposed approach is effective.
△ Less
Submitted 17 January, 2017;
originally announced January 2017.
-
Characterizing Linguistic Attributes for Automatic Classification of Intent Based Racist/Radicalized Posts on Tumblr Micro-Blogging Website
Authors:
Swati Agarwal,
Ashish Sureka
Abstract:
Research shows that many like-minded people use popular microblogging websites for posting hateful speech against various religions and race. Automatic identification of racist and hate promoting posts is required for building social media intelligence and security informatics based solutions. However, just keyword spotting based techniques cannot be used to accurately identify the intent of a pos…
▽ More
Research shows that many like-minded people use popular microblogging websites for posting hateful speech against various religions and race. Automatic identification of racist and hate promoting posts is required for building social media intelligence and security informatics based solutions. However, just keyword spotting based techniques cannot be used to accurately identify the intent of a post. In this paper, we address the challenge of the presence of ambiguity in such posts by identifying the intent of author. We conduct our study on Tumblr microblogging website and develop a cascaded ensemble learning classifier for identifying the posts having racist or radicalized intent. We train our model by identifying various semantic, sentiment and linguistic features from free-form text. Our experimental results shows that the proposed approach is effective and the emotion tone, social tendencies, language cues and personality traits of a narrative are discriminatory features for identifying the racist intent behind a post.
△ Less
Submitted 17 January, 2017;
originally announced January 2017.
-
Parichayana: An Eclipse Plugin for Detecting Exception Handling Anti-Patterns and Code Smells in Java Programs
Authors:
Ashish Sureka
Abstract:
Anti-patterns and code-smells are signs in the source code which are not defects (does not prevent the program from functioning and does not cause compile errors) and are rather indicators of deeper and bigger problems. Exception handling is a programming construct de- signed to handle the occurrence of anomalous or exceptional conditions (that changes the normal flow of program execution). In thi…
▽ More
Anti-patterns and code-smells are signs in the source code which are not defects (does not prevent the program from functioning and does not cause compile errors) and are rather indicators of deeper and bigger problems. Exception handling is a programming construct de- signed to handle the occurrence of anomalous or exceptional conditions (that changes the normal flow of program execution). In this paper, we present an Eclipse plug-in (called as Parichayana) for detecting exception handling anti-patterns and code smells in Java programs. Parichayana is capable of automatically detecting several commonly occurring excep- tion handling programming mistakes. We extend the Eclipse IDE and create new menu entries and associated action via the Parichayana plug- in (free and open-source hosted on GitHub). We compare and contrast Parichayana with several code smell detection tools and demonstrate that our tool provides unique capabilities in context to existing tools. We have created an update site and developers can use the Eclipse up- date manager to install Parichayana from our site. We used Parichyana on several large open-source Java based projects and detected presence of exception handling anti-patterns
△ Less
Submitted 31 December, 2016;
originally announced January 2017.
-
Graph or Relational Databases: A Speed Comparison for Process Mining Algorithm
Authors:
Jeevan Joishi,
Ashish Sureka
Abstract:
Process-Aware Information System (PAIS) are IT systems that manages, supports business processes and generate large event logs from execution of business processes. An event log is represented as a tuple of the form CaseID, TimeStamp, Activity and Actor. Process Mining is an emerging area of research that deals with the study and analysis of business processes based on event logs. Process Mining a…
▽ More
Process-Aware Information System (PAIS) are IT systems that manages, supports business processes and generate large event logs from execution of business processes. An event log is represented as a tuple of the form CaseID, TimeStamp, Activity and Actor. Process Mining is an emerging area of research that deals with the study and analysis of business processes based on event logs. Process Mining aims at analyzing event logs and discover business process models, enhance them or check for conformance with an a priori model. The large volume of event logs generated are stored in databases. Relational databases perform well for certain class of applications. However, there are certain class of applications for which relational databases are not able to scale. A number of NoSQL databases have emerged to encounter the challenges of scalability. Discovering social network from event logs is one of the most challenging and important Process Mining task. Similar-Task and Sub-Contract algorithms are some of the most widely used Organizational Mining techniques. Our objective is to investigate which of the databases (Relational or Graph) perform better for Organizational Mining under Process Mining. An intersection of Process Mining and Graph Databases can be accomplished by modelling these Organizational Mining metrics with graph databases. We implement Similar-Task and Sub-Contract algorithms on relational and NoSQL (graph-oriented) databases using only query language constructs. We conduct empirical analysis on a large real world data set to compare the performance of row-oriented database and NoSQL graph-oriented database. We benchmark performance factors like query execution time, CPU usage and disk/memory space usage for NoSQL graph-oriented database against row-oriented database.
△ Less
Submitted 31 December, 2016;
originally announced January 2017.
-
Application of Case-Based Teaching and Learning in Compiler Design Course
Authors:
Divya Kundra,
Ashish Sureka
Abstract:
Compiler design is a course that discusses ideas used in construction of programming language compilers. Students learn how a program written in high level programming language and designed for humans understanding is systematically converted into low level assembly language understood by machines. We propose and implement a Case-based and Project-based Learning environment for teaching important…
▽ More
Compiler design is a course that discusses ideas used in construction of programming language compilers. Students learn how a program written in high level programming language and designed for humans understanding is systematically converted into low level assembly language understood by machines. We propose and implement a Case-based and Project-based Learning environment for teaching important Compiler design concepts (CPLC) to B.Tech third year students of a Delhi University (India) college. A case is a text that describes a real-life situation providing information but not solution. Previous research shows that case-based teaching helps students to apply the principles discussed in the class for solving complex practical problems. We divide one main project into sub-projects to give to students in order to enhance their practical experience of designing a compiler. To measure the effectiveness of case-based discussions, students complete a survey on their perceptions of benefits of case-based learning. The survey is analyzed using frequency distribution and chi square test of association. The results of the survey show that case-based teaching of compiler concepts does enhance students skills of learning, critical thinking, engagement, communication skills and team work.
△ Less
Submitted 1 November, 2016;
originally announced November 2016.
-
A Bibliometric Study of Asia Pacific Software Engineering Conference from 2010 to 2015
Authors:
Lov Kumar,
Saikrishna Sripada,
Ashish Sureka
Abstract:
The Asia-Pacific Software Engineering Conference (APSEC) is a reputed and a long-running conference which has successfully completed more than two decades as of year 2015. We conduct a bibliometric and scientific publication mining based study to how the conference has evolved over the recent past six years (year 2010 to 2015). Our objective is to perform in-depth examination of the state of APSEC…
▽ More
The Asia-Pacific Software Engineering Conference (APSEC) is a reputed and a long-running conference which has successfully completed more than two decades as of year 2015. We conduct a bibliometric and scientific publication mining based study to how the conference has evolved over the recent past six years (year 2010 to 2015). Our objective is to perform in-depth examination of the state of APSEC so that the APSEC community can identify strengths, areas of improvements and future directions for the conference. Our empirical analysis is based on various perspectives such as: paper submission acceptance rate trends, conference location, scholarly productivity and contributions from various countries, analysis of keynotes, workshops, conference organizers and sponsors, tutorials, identification of prolific authors, computation of citation impact of papers and contributing authors, internal and external collaboration, university and industry participation and collaboration, measurement of gender imbalance, topical analysis, yearly author churn and program committee characteristics.
△ Less
Submitted 30 October, 2016;
originally announced October 2016.
-
An Experimental Study on the Learning Outcome of Teaching Elementary Level Children using Lego Mindstorms EV3 Robotics Education Kit
Authors:
Vidushi Chaudhary,
Vishnu Agrawal,
Ashish Sureka
Abstract:
Skills like computational thinking, problem solving, handling complexity, team-work and project management are essential for future careers and needs to be taught to students at the elementary level itself. Computer programming knowledge and skills, experiencing technology and conducting science and engineering experiments are also important for students at elementary level. However, teaching such…
▽ More
Skills like computational thinking, problem solving, handling complexity, team-work and project management are essential for future careers and needs to be taught to students at the elementary level itself. Computer programming knowledge and skills, experiencing technology and conducting science and engineering experiments are also important for students at elementary level. However, teaching such skills effectively through active learning can be challenging for educators. In this paper, we present our approach and experiences in teaching such skills to several elementary level children using Lego Mindstorms EV3 robotics education kit. We describe our learning environment consisting of lessons, worksheets, hands-on activities and assessment. We taught students how to design, construct and program robots using components such as motors, sensors, wheels, axles, beams, connectors and gears. Students also gained knowledge on basic programming constructs such as control flow, loops, branches and conditions using a visual programming environment. We carefully observed how students performed various tasks and solved problems. We present experimental results which demonstrates that our teaching methodology consisting of both the course content and pedagogy was effective in imparting the desired skills and knowledge to elementary level children. The students also participated in a competitive World Robot Olympiad India event and qualified during the regional round which is an evidence of the effectiveness of the approach.
△ Less
Submitted 30 October, 2016;
originally announced October 2016.
-
Thirteen Years of Mining Software Repositories (MSR) Conference - What is the Bibliography Data Telling Us?
Authors:
Lov Kumar,
Ashish Sureka
Abstract:
The Mining Software Repositories (MSR) conference is a reputed, long-running and flagship conference in the area of Software Analytics which has successfully completed more than one decade as of year 2016. We conduct a bibliometric and scientific publication mining based study to study how the conference has evolved over the recent past 13 years (from 2004 to 2007 as a workshop and then from 2008…
▽ More
The Mining Software Repositories (MSR) conference is a reputed, long-running and flagship conference in the area of Software Analytics which has successfully completed more than one decade as of year 2016. We conduct a bibliometric and scientific publication mining based study to study how the conference has evolved over the recent past 13 years (from 2004 to 2007 as a workshop and then from 2008 to 2016 as a conference). Our objective is to perform an examination of the state of MSR so that the MSR community can identify strengths, areas of improvements and future directions for the conference.
△ Less
Submitted 20 September, 2016;
originally announced September 2016.
-
Spider and the Flies : Focused Crawling on Tumblr to Detect Hate Promoting Communities
Authors:
Swati Agarwal,
Ashish Sureka
Abstract:
Tumblr is one of the largest and most popular microblogging website on the Internet. Studies shows that due to high reachability among viewers, low publication barriers and social networking connectivity, microblogging websites are being misused as a platform to post hateful speech and recruiting new members by existing extremist groups. Manual identification of such posts and communities is overw…
▽ More
Tumblr is one of the largest and most popular microblogging website on the Internet. Studies shows that due to high reachability among viewers, low publication barriers and social networking connectivity, microblogging websites are being misused as a platform to post hateful speech and recruiting new members by existing extremist groups. Manual identification of such posts and communities is overwhelmingly impractical due to large amount of posts and blogs being published every day. We propose a topic based web crawler primarily consisting of multiple phases: training a text classifier model consisting examples of only hate promoting users, extracting posts of an unknown tumblr micro-blogger, classifying hate promoting bloggers based on their activity feeds, crawling through the external links to other bloggers and performing a social network analysis on connected extremist bloggers. To investigate the effectiveness of our approach, we conduct experiments on large real world dataset. Experimental results reveals that the proposed approach is an effective method and has an F-score of 0.80. We apply social network analysis based techniques and identify influential and core bloggers in a community.
△ Less
Submitted 30 March, 2016;
originally announced March 2016.
-
Anvaya: An Algorithm and Case-Study on Improving the Goodness of Software Process Models generated by Mining Event-Log Data in Issue Tracking System
Authors:
Prerna Juneja,
Divya Kundra,
Ashish Sureka
Abstract:
Issue Tracking Systems (ITS) such as Bugzilla can be viewed as Process Aware Information Systems (PAIS) generating event-logs during the life-cycle of a bug report. Process Mining consists of mining event logs generated from PAIS for process model discovery, conformance and enhancement. We apply process map discovery techniques to mine event trace data generated from ITS of open source Firefox bro…
▽ More
Issue Tracking Systems (ITS) such as Bugzilla can be viewed as Process Aware Information Systems (PAIS) generating event-logs during the life-cycle of a bug report. Process Mining consists of mining event logs generated from PAIS for process model discovery, conformance and enhancement. We apply process map discovery techniques to mine event trace data generated from ITS of open source Firefox browser project to generate and study process models. Bug life-cycle consists of diversity and variance. Therefore, the process models generated from the event-logs are spaghetti-like with large number of edges, inter-connections and nodes. Such models are complex to analyse and difficult to comprehend by a process analyst. We improve the Goodness (fitness and structural complexity) of the process models by splitting the event-log into homogeneous subsets by clustering structurally similar traces. We adapt the K-Medoid clustering algorithm with two different distance metrics: Longest Common Subsequence (LCS) and Dynamic Time War** (DTW). We evaluate the goodness of the process models generated from the clusters using complexity and fitness metrics. We study back-forth \& self-loops, bug reopening, and bottleneck in the clusters obtained and show that clustering enables better analysis. We also propose an algorithm to automate the clustering process -the algorithm takes as input the event log and returns the best cluster set.
△ Less
Submitted 22 November, 2015;
originally announced November 2015.
-
Applying Social Media Intelligence for Predicting and Identifying On-line Radicalization and Civil Unrest Oriented Threats
Authors:
Swati Agarwal,
Ashish Sureka
Abstract:
Research shows that various social media platforms on Internet such as Twitter, Tumblr (micro-blogging websites), Facebook (a popular social networking website), YouTube (largest video sharing and hosting website), Blogs and discussion forums are being misused by extremist groups for spreading their beliefs and ideologies, promoting radicalization, recruiting members and creating online virtual co…
▽ More
Research shows that various social media platforms on Internet such as Twitter, Tumblr (micro-blogging websites), Facebook (a popular social networking website), YouTube (largest video sharing and hosting website), Blogs and discussion forums are being misused by extremist groups for spreading their beliefs and ideologies, promoting radicalization, recruiting members and creating online virtual communities sharing a common agenda. Popular microblogging websites such as Twitter are being used as a real-time platform for information sharing and communication during planning and mobilization if civil unrest related events. Applying social media intelligence for predicting and identifying online radicalization and civil unrest oriented threats is an area that has attracted several researchers' attention over past 10 years. There are several algorithms, techniques and tools that have been proposed in existing literature to counter and combat cyber-extremism and predicting protest related events in much advance. In this paper, we conduct a literature review of all these existing techniques and do a comprehensive analysis to understand state-of-the-art, trends and research gaps. We present a one class classification approach to collect scholarly articles targeting the topics and subtopics of our research scope. We perform characterization, classification and an in-depth meta analysis meta-anlaysis of about 100 conference and journal papers to gain a better understanding of existing literature.
△ Less
Submitted 21 November, 2015;
originally announced November 2015.
-
Kernel Based Sequential Data Anomaly Detection in Business Process Event Logs
Authors:
Ashish Sureka
Abstract:
Business Process Management Systems (BPMS) log events and traces of activities during the execution of a process. Anomalies are defined as deviation or departure from the normal or common order. Anomaly detection in business process logs has several applications such as fraud detection and understanding the causes of process errors. In this paper, we present a novel approach for anomaly detection…
▽ More
Business Process Management Systems (BPMS) log events and traces of activities during the execution of a process. Anomalies are defined as deviation or departure from the normal or common order. Anomaly detection in business process logs has several applications such as fraud detection and understanding the causes of process errors. In this paper, we present a novel approach for anomaly detection in business process logs. We model the event logs as a sequential data and apply kernel based anomaly detection techniques to identify outliers and discordant observations. Our technique is unsupervised (does not require a pre-annotated training dataset), employs kNN (k-nearest neighbor) kernel based technique and normalized longest common subsequence (LCS) similarity measure. We conduct experiments on a recent, large and real-world incident management data of an enterprise and demonstrate that our approach is effective.
△ Less
Submitted 5 July, 2015;
originally announced July 2015.
-
Intention-Oriented Process Model Discovery from Incident Management Event Logs
Authors:
Ashish Sureka
Abstract:
Intention-oriented process mining is based on the belief that the fundamental nature of processes is mostly intentional (unlike activity-oriented process) and aims at discovering strategy and intentional process models from event-logs recorded during the process enactment. In this paper, we present an application of intention-oriented process mining for the domain of incident management of an Info…
▽ More
Intention-oriented process mining is based on the belief that the fundamental nature of processes is mostly intentional (unlike activity-oriented process) and aims at discovering strategy and intentional process models from event-logs recorded during the process enactment. In this paper, we present an application of intention-oriented process mining for the domain of incident management of an Information Technology Infrastructure Library (ITIL) process. We apply the Map Miner Method (MMM) on a large real-world dataset for discovering hidden and unobservable user behavior, strategies and intentions. We first discover user strategies from the given activity sequence data by applying Hidden Markov Model (HMM) based unsupervised learning technique. We then process the emission and transition matrices of the discovered HMM to generate a coarse-grained Map Process Model. We present the first application or study of the new and emerging field of Intention-oriented process mining on an incident management event-log dataset and discuss its applicability, effectiveness and challenges.
△ Less
Submitted 4 July, 2015;
originally announced July 2015.
-
Survey Results on Threats To External Validity, Generalizability Concerns, Data Sharing and University-Industry Collaboration in Mining Software Repository (MSR) Research
Authors:
Ashish Sureka,
Ambika Tripathi,
Savita Dabral
Abstract:
Mining Software Repositories (MSR) is an applied and practise-oriented field aimed at solving real problems encountered by practitioners and bringing value to Industry. Replication of results and findings, generalizability and external validity, University-Industry collaboration, data sharing and creation dataset repositories are important issues in MSR research. Research consisting of bibliometri…
▽ More
Mining Software Repositories (MSR) is an applied and practise-oriented field aimed at solving real problems encountered by practitioners and bringing value to Industry. Replication of results and findings, generalizability and external validity, University-Industry collaboration, data sharing and creation dataset repositories are important issues in MSR research. Research consisting of bibliometric analysis of MSR paper shows lack of University-Industry collaboration, deficiency of studies on closed or propriety source dataset and lack of data as well as tool sharing by researchers. We conduct a survey of authors of past three years of MSR conference (2012, 2013 and 2014) to collect data on their views and suggestions to address the stated concerns. We asked 20 questions from more than 100 authors and received a response from 39 authors. Our results shows that about one-third of the respondents always make their dataset publicly available and about one-third believe that data sharing should be a mandatory condition for publication in MSR conferences. Our survey reveals that more than 50% authors used solely open-source software (OSS) dataset for their research. More than 50% of the respondents mentioned that difficulty in sharing Industrial dataset outside the company is one of the major impediments in University-Industry collaboration.
△ Less
Submitted 4 June, 2015;
originally announced June 2015.
-
Chaff from the Wheat : Characterization and Modeling of Deleted Questions on Stack Overflow
Authors:
Denzil Correa,
Ashish Sureka
Abstract:
Stack Overflow is the most popular CQA for programmers on the web with 2.05M users, 5.1M questions and 9.4M answers. Stack Overflow has explicit, detailed guidelines on how to post questions and an ebullient moderation community. Despite these precise communications and safeguards, questions posted on Stack Overflow can be extremely off topic or very poor in quality. Such questions can be deleted…
▽ More
Stack Overflow is the most popular CQA for programmers on the web with 2.05M users, 5.1M questions and 9.4M answers. Stack Overflow has explicit, detailed guidelines on how to post questions and an ebullient moderation community. Despite these precise communications and safeguards, questions posted on Stack Overflow can be extremely off topic or very poor in quality. Such questions can be deleted from Stack Overflow at the discretion of experienced community members and moderators. We present the first study of deleted questions on Stack Overflow. We divide our study into two parts (i) Characterization of deleted questions over approx. 5 years (2008-2013) of data, (ii) Prediction of deletion at the time of question creation. Our characterization study reveals multiple insights on question deletion phenomena. We observe a significant increase in the number of deleted questions over time. We find that it takes substantial time to vote a question to be deleted but once voted, the community takes swift action. We also see that question authors delete their questions to salvage reputation points. We notice some instances of accidental deletion of good quality questions but such questions are voted back to be undeleted quickly. We discover a pyramidal structure of question quality on Stack Overflow and find that deleted questions lie at the bottom (lowest quality) of the pyramid. We also build a predictive model to detect the deletion of question at the creation time. We experiment with 47 features based on User Profile, Community Generated, Question Content and Syntactic style and report an accuracy of 66%. Our feature analysis reveals that all four categories of features are important for the prediction task. Our findings reveal important suggestions for content quality maintenance on community based question answering websites.
△ Less
Submitted 2 January, 2014;
originally announced January 2014.
-
A Case-Study on Teaching Undergraduate-Level Software Engineering Course Using Inverted-Classroom, Large-Group, Real-Client and Studio-Based Instruction Model
Authors:
Ashish Sureka,
Monika Gupta,
Dipto Sarkar,
Vidushi Chaudhary
Abstract:
We present a case-study on teaching an undergraduate level course on Software Engineering (second year and fifth semester of bachelors program in Computer Science) at a State University (New Delhi, India) using a novel teaching instruction model. Our approach has four main elements: inverted or flipped classroom, studio-based learning, real-client projects and deployment, large team and peer evalu…
▽ More
We present a case-study on teaching an undergraduate level course on Software Engineering (second year and fifth semester of bachelors program in Computer Science) at a State University (New Delhi, India) using a novel teaching instruction model. Our approach has four main elements: inverted or flipped classroom, studio-based learning, real-client projects and deployment, large team and peer evaluation. We present our motivation and approach, challenges encountered, pedagogical benefits, findings (both positive and negative) and recommendations. Our motivation was to teach Software Engineering using an active learning (significantly increasing the engagement and collaboration with the Instructor and other students in the class), team-work, balance between theory and practice, imparting both technical and managerial skills encountered in real-world and problem-based learning (through an intensive semester-long project). We conduct a detailed survey (anonymous, optional and online) and present the results of student responses. Survey results reveal that for nearly every students (class size: 89) the instruction model was new, interesting and had a positive impact on the motivation in addition to meeting the learning outcome of the course.
△ Less
Submitted 2 September, 2013;
originally announced September 2013.
-
Fit or Unfit : Analysis and Prediction of 'Closed Questions' on Stack Overflow
Authors:
Denzil Correa,
Ashish Sureka
Abstract:
Stack Overflow is widely regarded as the most popular Community driven Question Answering (CQA) website for programmers. Questions posted on Stack Overflow which are not related to programming topics, are marked as 'closed' by experienced users and community moderators. A question can be 'closed' for five reasons - duplicate, off-topic, subjective, not a real question and too localized. In this wo…
▽ More
Stack Overflow is widely regarded as the most popular Community driven Question Answering (CQA) website for programmers. Questions posted on Stack Overflow which are not related to programming topics, are marked as 'closed' by experienced users and community moderators. A question can be 'closed' for five reasons - duplicate, off-topic, subjective, not a real question and too localized. In this work, we present the first study of 'closed' questions in Stack Overflow. We download 4 years of publicly available data which contains 3.4 Million questions. We first analyze and characterize the complete set of 0.1 Million 'closed' questions. Next, we use a machine learning framework and build a predictive model to identify a 'closed' question at the time of question creation.
One of our key findings is that despite being marked as 'closed', subjective questions contain high information value and are very popular with the users. We observe an increasing trend in the percentage of closed questions over time and find that this increase is positively correlated to the number of newly registered users. In addition, we also see a decrease in community participation to mark a 'closed' question which has led to an increase in moderation job time. We also find that questions closed with the Duplicate and Off Topic labels are relatively more prone to reputation gaming. For the 'closed' question prediction task, we make use of multiple genres of feature sets based on - user profile, community process, textual style and question content. We use a state-of-art machine learning classifier based on an ensemble learning technique and achieve an overall accuracy of 73%. To the best of our knowledge, this is the first experimental study to analyze and predict 'closed' questions on Stack Overflow.
△ Less
Submitted 27 July, 2013;
originally announced July 2013.
-
Solutions to Detect and Analyze Online Radicalization : A Survey
Authors:
Denzil Correa,
Ashish Sureka
Abstract:
Online Radicalization (also called Cyber-Terrorism or Extremism or Cyber-Racism or Cyber- Hate) is widespread and has become a major and growing concern to the society, governments and law enforcement agencies around the world. Research shows that various platforms on the Internet (low barrier to publish content, allows anonymity, provides exposure to millions of users and a potential of a very qu…
▽ More
Online Radicalization (also called Cyber-Terrorism or Extremism or Cyber-Racism or Cyber- Hate) is widespread and has become a major and growing concern to the society, governments and law enforcement agencies around the world. Research shows that various platforms on the Internet (low barrier to publish content, allows anonymity, provides exposure to millions of users and a potential of a very quick and widespread diffusion of message) such as YouTube (a popular video sharing website), Twitter (an online micro-blogging service), Facebook (a popular social networking website), online discussion forums and blogosphere are being misused for malicious intent. Such platforms are being used to form hate groups, racist communities, spread extremist agenda, incite anger or violence, promote radicalization, recruit members and create virtual organi- zations and communities. Automatic detection of online radicalization is a technically challenging problem because of the vast amount of the data, unstructured and noisy user-generated content, dynamically changing content and adversary behavior. There are several solutions proposed in the literature aiming to combat and counter cyber-hate and cyber-extremism. In this survey, we review solutions to detect and analyze online radicalization. We review 40 papers published at 12 venues from June 2003 to November 2011. We present a novel classification scheme to classify these papers. We analyze these techniques, perform trend analysis, discuss limitations of existing techniques and find out research gaps.
△ Less
Submitted 21 January, 2013;
originally announced January 2013.
-
Characterizing Pedophile Conversations on the Internet using Online Grooming
Authors:
Aditi Gupta,
Ponnurangam Kumaraguru,
Ashish Sureka
Abstract:
Cyber-crime targeting children such as online pedophile activity are a major and a growing concern to society. A deep understanding of predatory chat conversations on the Internet has implications in designing effective solutions to automatically identify malicious conversations from regular conversations. We believe that a deeper understanding of the pedophile conversation can result in more soph…
▽ More
Cyber-crime targeting children such as online pedophile activity are a major and a growing concern to society. A deep understanding of predatory chat conversations on the Internet has implications in designing effective solutions to automatically identify malicious conversations from regular conversations. We believe that a deeper understanding of the pedophile conversation can result in more sophisticated and robust surveillance systems than majority of the current systems relying only on shallow processing such as simple word-counting or key-word spotting.
In this paper, we study pedophile conversations from the perspective of online grooming theory and perform a series of linguistic-based empirical analysis on several pedophile chat conversations to gain useful insights and patterns. We manually annotated 75 pedophile chat conversations with six stages of online grooming and test several hypothesis on it. The results of our experiments reveal that relationship forming is the most dominant online grooming stage in contrast to the sexual stage. We use a widely used word-counting program (LIWC) to create psycho-linguistic profiles for each of the six online grooming stages to discover interesting textual patterns useful to improve our understanding of the online pedophile phenomenon. Furthermore, we present empirical results that throw light on various aspects of a pedophile conversation such as probability of state transitions from one stage to another, distribution of a pedophile chat conversation across various online grooming stages and correlations between pre-defined word categories and online grooming stages.
△ Less
Submitted 17 August, 2012;
originally announced August 2012.
-
Mining User Comment Activity for Detecting Forum Spammers in YouTube
Authors:
Ashish Sureka
Abstract:
Research shows that comment spamming (comments which are unsolicited, unrelated, abusive, hateful, commercial advertisements etc) in online discussion forums has become a common phenomenon in Web 2.0 applications and there is a strong need to counter or combat comment spamming. We present a method to automatically detect comment spammer in YouTube (largest and a popular video sharing website) foru…
▽ More
Research shows that comment spamming (comments which are unsolicited, unrelated, abusive, hateful, commercial advertisements etc) in online discussion forums has become a common phenomenon in Web 2.0 applications and there is a strong need to counter or combat comment spamming. We present a method to automatically detect comment spammer in YouTube (largest and a popular video sharing website) forums. The proposed technique is based on mining comment activity log of a user and extracting patterns (such as time interval between subsequent comments, presence of exactly same comment across multiple unrelated videos) indicating spam behavior. We perform empirical analysis on data crawled from YouTube and demonstrate that the proposed method is effective for the task of comment spammer detection.
△ Less
Submitted 25 March, 2011;
originally announced March 2011.