Search | arXiv e-print repository

Teaching Software Metrology: The Science of Measurement for Software Engineering

Authors: Paul Ralph, Miikka Kuutila, Hera Arif, Bimpe Ayoola

Abstract: While the methodological rigor of computing research has improved considerably in the past two decades, quantitative software engineering research is hampered by immature measures and inattention to theory. Measurement-the principled assignment of numbers to phenomena-is intrinsically difficult because observation is predicated upon not only theoretical concepts but also the values and perspective… ▽ More While the methodological rigor of computing research has improved considerably in the past two decades, quantitative software engineering research is hampered by immature measures and inattention to theory. Measurement-the principled assignment of numbers to phenomena-is intrinsically difficult because observation is predicated upon not only theoretical concepts but also the values and perspective of the research. Despite several previous attempts to raise awareness of more sophisticated approaches to measurement and the importance of quantitatively assessing reliability and validity, measurement issues continue to be widely ignored. The reasons are unknown, but differences in typical engineering and computer science graduate training programs (compared to psychology and management, for example) are involved. This chapter therefore reviews key concepts in the science of measurement and applies them to software engineering research. A series of exercises for applying important measurement concepts to the reader's research are included, and a sample dataset for the reader to try some of the statistical procedures mentioned is provided. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.08369 [pdf, other]

Teaching Literature Reviewing for Software Engineering Research

Authors: Sebastian Baltes, Paul Ralph

Abstract: The goal of this chapter is to support teachers in holistically introducing graduate students to literature reviews, with a particular focus on secondary research. It provides an overview of the overall literature review process and the different types of literature review before diving into guidelines for selecting and conducting different types of literature review. The chapter also provides rec… ▽ More The goal of this chapter is to support teachers in holistically introducing graduate students to literature reviews, with a particular focus on secondary research. It provides an overview of the overall literature review process and the different types of literature review before diving into guidelines for selecting and conducting different types of literature review. The chapter also provides recommendations for evaluating the quality of existing literature reviews and concludes with a summary of our learning goals and how the chapter supports teachers in addressing them. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 27 pages, 1 figure, 2 tables. arXiv admin note: text overlap with arXiv:2205.01163

arXiv:2406.01966 [pdf, ps, other]

Creativity, Generative AI, and Software Development: A Research Agenda

Authors: Victoria Jackson, Bogdan Vasilescu, Daniel Russo, Paul Ralph, Maliheh Izadi, Rafael Prikladnicki, Sarah D'Angelo, Sarah Inman, Anielle Lisboa, Andre van der Hoek

Abstract: Creativity has always been considered a major differentiator to separate the good from the great, and we believe the importance of creativity for software development will only increase as GenAI becomes embedded in developer tool-chains and working practices. This paper uses the McLuhan tetrad alongside scenarios of how GenAI may disrupt software development more broadly, to identify potential imp… ▽ More Creativity has always been considered a major differentiator to separate the good from the great, and we believe the importance of creativity for software development will only increase as GenAI becomes embedded in developer tool-chains and working practices. This paper uses the McLuhan tetrad alongside scenarios of how GenAI may disrupt software development more broadly, to identify potential impacts GenAI may have on creativity within software development. The impacts are discussed along with a future research agenda comprising six connected themes that consider how individual capabilities, team capabilities, the product, unintended consequences, society, and human aspects can be affected. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2404.15667 [pdf, other]

The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews

Authors: Aleksi Huotala, Miikka Kuutila, Paul Ralph, Mika Mäntylä

Abstract: Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if Large Language Models (LLMs) can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-ab… ▽ More Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if Large Language Models (LLMs) can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-abstract screening. We performed an experiment where humans screened titles and abstracts for 20 papers with both original and simplified abstracts from a prior SR. The experiment with human screeners was reproduced with GPT-3.5 and GPT-4 LLMs to perform the same screening tasks. We also studied if different prompting techniques (Zero-shot (ZS), One-shot (OS), Few-shot (FS), and Few-shot with Chain-of-Thought (FS-CoT)) improve the screening performance of LLMs. Lastly, we studied if redesigning the prompt used in the LLM reproduction of screening leads to improved performance. Text simplification did not increase the screeners' screening performance, but reduced the time used in screening. Screeners' scientific literacy skills and researcher status predict screening performance. Some LLM and prompt combinations perform as well as human screeners in the screening tasks. Our results indicate that the GPT-4 LLM is better than its predecessor, GPT-3.5. Additionally, Few-shot and One-shot prompting outperforms Zero-shot prompting. Using LLMs for text simplification in the screening process does not significantly improve human performance. Using LLMs to automate title-abstract screening seems promising, but current LLMs are not significantly more accurate than human screeners. To recommend the use of LLMs in the screening process of SRs, more research is needed. We recommend future SR studies publish replication packages with screening data to enable more conclusive experimenting with LLM screening. △ Less

Submitted 8 May, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

Comments: Accepted to the International Conference on Evaluation and Assessment in Software Engineering (EASE), 2024 edition

arXiv:2310.14366 [pdf, other]

Bi-Encoders based Species Normalization -- Pairwise Sentence Learning to Rank

Authors: Zainab Awan, Tim Kahlke, Peter Ralph, Paul Kennedy

Abstract: Motivation: Biomedical named-entity normalization involves connecting biomedical entities with distinct database identifiers in order to facilitate data integration across various fields of biology. Existing systems for biomedical named entity normalization heavily rely on dictionaries, manually created rules, and high-quality representative features such as lexical or morphological characteristic… ▽ More Motivation: Biomedical named-entity normalization involves connecting biomedical entities with distinct database identifiers in order to facilitate data integration across various fields of biology. Existing systems for biomedical named entity normalization heavily rely on dictionaries, manually created rules, and high-quality representative features such as lexical or morphological characteristics. However, recent research has investigated the use of neural network-based models to reduce dependence on dictionaries, manually crafted rules, and features. Despite these advancements, the performance of these models is still limited due to the lack of sufficiently large training datasets. These models have a tendency to overfit small training corpora and exhibit poor generalization when faced with previously unseen entities, necessitating the redesign of rules and features. Contribution: We present a novel deep learning approach for named entity normalization, treating it as a pair-wise learning to rank problem. Our method utilizes the widely-used information retrieval algorithm Best Matching 25 to generate candidate concepts, followed by the application of bi-directional encoder representation from the encoder (BERT) to re-rank the candidate list. Notably, our approach eliminates the need for feature-engineering or rule creation. We conduct experiments on species entity types and evaluate our method against state-of-the-art techniques using LINNAEUS and S800 biomedical corpora. Our proposed approach surpasses existing methods in linking entities to the NCBI taxonomy. To the best of our knowledge, there is no existing neural network-based approach for species normalization in the literature. △ Less

Submitted 22 October, 2023; originally announced October 2023.

arXiv:2305.14488 [pdf, other]

Looking forwards and backwards: dynamics and genealogies of locally regulated populations

Authors: Alison M. Etheridge, Thomas G. Kurtz, Ian Letter, Peter L. Ralph, Terence Tsui Ho Lung

Abstract: We introduce a broad class of spatial models to describe how spatially heterogeneous populations live, die, and reproduce. Individuals are represented by points of a point measure, whose birth and death rates can depend both on spatial position and local population density, defined via the convolution of the point measure with a nonnegative kernel. We pass to three different scaling limits: an int… ▽ More We introduce a broad class of spatial models to describe how spatially heterogeneous populations live, die, and reproduce. Individuals are represented by points of a point measure, whose birth and death rates can depend both on spatial position and local population density, defined via the convolution of the point measure with a nonnegative kernel. We pass to three different scaling limits: an interacting superprocess, a nonlocal partial differential equation (PDE), and a classical PDE. The classical PDE is obtained both by first scaling time and population size to pass to the nonlocal PDE, and then scaling the kernel that determines local population density; and also (when the limit is a reaction-diffusion equation) by simultaneously scaling the kernel width, timescale and population size in our individual based model. A novelty of our model is that we explicitly model a juvenile phase: offspring are thrown off in a Gaussian distribution around the location of the parent, and reach (instant) maturity with a probability that can depend on the population density at the location at which they land. Although we only record mature individuals, a trace of this two-step description remains in our population models, resulting in novel limits governed by a nonlinear diffusion. Using a lookdown representation, we retain information about genealogies and, in the case of deterministic limiting models, use this to deduce the backwards in time motion of the ancestral lineage of a sampled individual. We observe that knowing the history of the population density is not enough to determine the motion of ancestral lineages in our model. We also investigate the behaviour of lineages for three different deterministic models of a population expanding its range as a travelling wave: the Fisher-KPP equation, the Allen-Cahn equation, and a porous medium equation with logistic growth. △ Less

Submitted 30 December, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

MSC Class: 60J25; 92D10; 92D15; 92D25; 92D40 (Primary) 60F05; 60G09; 60G55; 60G57; 60H15; 60J68 (Secondary)

arXiv:2303.06215 [pdf, ps, other]

Post-pandemic Resilience of Hybrid Software Teams

Authors: Ronnie de Souza Santos, Gianisa Adisaputri, Paul Ralph

Abstract: Background. The COVID-19 pandemic triggered a widespread transition to hybrid work models (combinations of co-located and remote work) as software professionals' demanded more flexibility and improved work-life balance. However, hybrid work models reduce the spontaneous, informal face-to-face interactions that promote group maturation, cohesion, and resilience. Little is known about how software c… ▽ More Background. The COVID-19 pandemic triggered a widespread transition to hybrid work models (combinations of co-located and remote work) as software professionals' demanded more flexibility and improved work-life balance. However, hybrid work models reduce the spontaneous, informal face-to-face interactions that promote group maturation, cohesion, and resilience. Little is known about how software companies can successfully transition to a hybrid workforce or the factors that influence the resilience of hybrid software development teams. Goal. The purpose of this study is to explore the relationship between hybrid work and team resilience in the context of software development. Method. Constructivist Grounded Theory was used, based on interviews of 26 software professionals. This sample included professionals of different genders, ethnicities, sexual orientations, and levels of experience. Interviewees came from eight different companies, 22 different projects, and four different countries. Consistent with grounded theory methodology, data collection, and analysis were conducted iteratively, in waves, using theoretical sampling, constant comparison, and initial, focused, and theoretical coding. Results. Software Team Resilience is the ability of a group of software professionals to continue working together effectively under adverse conditions. Resilience depends on the group's maturity. The configuration of a hybrid team (who works where and when) can promote or hinder group maturity depending on the level of intra-group interaction it supports. Conclusion. This paper presents the first study on the resilience of hybrid software teams. Software teams need resilience to maintain their performance in the face of disruptions and crises. Software professionals strongly value hybrid work; therefore, team resilience is a key factor to be considered in the software industry. △ Less

Submitted 10 March, 2023; originally announced March 2023.

arXiv:2301.11129 [pdf, other]

doi 10.1109/ICSE48619.2023.00169

Sustainability is Stratified: Toward a Better Theory of Sustainable Software Engineering

Authors: Sean McGuire, Erin Shultz, Bimpe Ayoola, Paul Ralph

Abstract: Background: Sustainable software engineering (SSE) means creating software in a way that meets present needs without undermining our collective capacity to meet our future needs. It is typically conceptualized as several intersecting dimensions or ``pillars'' -- environmental, social, economic, technical and individual. However; these pillars are theoretically underdeveloped and require refinement… ▽ More Background: Sustainable software engineering (SSE) means creating software in a way that meets present needs without undermining our collective capacity to meet our future needs. It is typically conceptualized as several intersecting dimensions or ``pillars'' -- environmental, social, economic, technical and individual. However; these pillars are theoretically underdeveloped and require refinement. Objectives: The objective of this paper is to generate a better theory of SSE. Method: First, a sco** review was conducted to understand the state of research on SSE and identify existing models thereof. Next, a meta-synthesis of qualitative research on SSE was conducted to critique and improve the existing models identified. Results: 961 potentially relevant articles were extracted from five article databases. These articles were de-duplicated and then screened independently by two screeners, leaving 243 articles to examine. Of these, 109 were non-empirical, the most common empirical method was systematic review, and no randomized controlled experiments were found. Most papers focus on ecological sustainability (158) and the sustainability of software products (148) rather than processes. A meta-synthesis of 36 qualitative studies produced several key propositions, most notably, that sustainability is stratified (has different meanings at different levels of abstraction) and multisystemic (emerges from interactions among multiple social, technical, and sociotechnical systems). Conclusion: The academic literature on SSE is surprisingly non-empirical. More empirical evaluations of specific sustainability interventions are needed. The sustainability of software development products and processes should be conceptualized as multisystemic and stratified, and assessed accordingly. △ Less

Submitted 26 January, 2023; originally announced January 2023.

Comments: 13 pages, 7 figures, 3 tables; accepted for presentation at ICSE 2023

Journal ref: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 2023, pp. 1996-2008,

arXiv:2301.05379 [pdf, other]

Benefits and Limitations of Remote Work to LGBTQIA+ Software Professionals

Authors: Ronnie de Souza Santos, Cleyton Magalhaes, Paul Ralph

Abstract: Background. The mass transition to remote work amid the COVID-19 pandemic profoundly affected software professionals, who abruptly shifted into ostensibly temporary home offices. The effects of this transition on these professionals are complex, depending on the particularities of the context and individuals. Recent studies advocate for remote structures to create opportunities for many equity-des… ▽ More Background. The mass transition to remote work amid the COVID-19 pandemic profoundly affected software professionals, who abruptly shifted into ostensibly temporary home offices. The effects of this transition on these professionals are complex, depending on the particularities of the context and individuals. Recent studies advocate for remote structures to create opportunities for many equity-deserving groups; however, remote work can also be challenging for some individuals, such as women and individuals with disabilities. Objective. This study aims to investigate the effects of remote work on LGBTQIA+ software professionals. Method. Grounded theory methodology was applied based on information collected from two main sources: a survey questionnaire with a sample of 57 LGBTQIA+ software professionals and nine follow-up interviews with individuals from this sample. This sample included professionals of different genders, ethnicities, sexual orientations, and levels of experience. Findings. Our findings demonstrate that (1) remote work benefits LGBTQIA+ people by increasing security and visibility; (2) remote work harms LGBTQIA+ software professionals through isolation and invisibility; (3) the benefits outweigh the drawbacks; (4) the drawbacks can be mitigated by supportive measures developed by software companies. Conclusion. This paper investigated how remote work can affect LGBTQIA+ software professionals and presented a set of recommendations on how software companies can address the benefits and limitations associated with this work model. In summary, we concluded that remote work is crucial in increasing diversity and inclusion in the software industry. △ Less

Submitted 4 June, 2023; v1 submitted 12 January, 2023; originally announced January 2023.

Comments: 10 pages

arXiv:2205.01163 [pdf, ps, other]

doi 10.1145/3540250.3560877

Paving the Way for Mature Secondary Research: The Seven Types of Literature Review

Authors: Paul Ralph, Sebastian Baltes

Abstract: Confusion over different kinds of secondary research, and their divergent purposes, is undermining the effectiveness and usefulness of secondary studies in software engineering. This short paper therefore explains the differences between ad hoc review, case survey, critical review, meta-analysis (aka systematic literature review), meta-synthesis (aka thematic analysis), rapid review and sco** re… ▽ More Confusion over different kinds of secondary research, and their divergent purposes, is undermining the effectiveness and usefulness of secondary studies in software engineering. This short paper therefore explains the differences between ad hoc review, case survey, critical review, meta-analysis (aka systematic literature review), meta-synthesis (aka thematic analysis), rapid review and sco** review (aka systematic map** study). These definitions and associated guidelines help researchers better select and describe their literature reviews, while hel** reviewers select more appropriate evaluation criteria. △ Less

Submitted 5 September, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

Comments: 5 pages, 1 table, 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022)

arXiv:2204.11826 [pdf, other]

Personality Traits in Game Development

Authors: Miriam Sturdee, Matthew Ivory, David Ellis, Patrick Stacey, Paul Ralph

Abstract: Existing work on personality traits in software development excludes game developers as a discrete group. Whilst games are software, game development has unique considerations, so game developers may exhibit different personality traits from other software professionals. We assessed responses from 123 game developers on an International Personality Item Pool Five Factor Model scale and demographic… ▽ More Existing work on personality traits in software development excludes game developers as a discrete group. Whilst games are software, game development has unique considerations, so game developers may exhibit different personality traits from other software professionals. We assessed responses from 123 game developers on an International Personality Item Pool Five Factor Model scale and demographic questionnaire using factor analysis. Programmers reported lower Extraversion than designers, artists and production team members; lower Openness than designers and production, and reported higher Neuroticism than production -- potentially linked to burnout and crunch time. Compared to published norms of software developers, game developers reported lower Openness, Conscientiousness, Extraversion and Agreeableness, but higher Neuroticism. These personality differences have many practical implications: differences in Extraversion among roles may precipitate communication breakdowns; differences in Openness may induce conflict between programmers and designers. Understanding the relationship between personality traits and roles can help recruiters steer new employees into appropriate roles, and help managers apply appropriate stress management techniques. To realise these benefits, individuals must be distinguished from roles: just because an individual occupies a role does not mean they possess personality traits associated with that role. △ Less

Submitted 25 April, 2022; originally announced April 2022.

Comments: 10 pages, 2 figures, 4 tables,

Journal ref: In proceedings of the International Conference on Evaluation and Assessment in Software Engineering (EASE 2022), June 13--15, Gothenburg, Sweden

arXiv:2203.15950 [pdf, other]

doi 10.1145/3524842.3528032

Empirical Standards for Repository Mining

Authors: Preetha Chatterjee, Tushar Sharma, Paul Ralph

Abstract: The purpose of scholarly peer review is to evaluate the quality of scientific manuscripts. However, study after study demonstrates that peer review neither effectively nor reliably assesses research quality. Empirical standards attempt to address this problem by modelling a scientific community's expectations for each kind of empirical study conducted in that community. This should enhance not onl… ▽ More The purpose of scholarly peer review is to evaluate the quality of scientific manuscripts. However, study after study demonstrates that peer review neither effectively nor reliably assesses research quality. Empirical standards attempt to address this problem by modelling a scientific community's expectations for each kind of empirical study conducted in that community. This should enhance not only the quality of research but also the reliability and predictability of peer review, as scientists adopt the standards in both their researcher and reviewer roles. However, these improvements depend on the quality and adoption of the standards. This tutorial will therefore present the empirical standard for mining software repositories, both to communicate its contents and to get feedback from the attendees. The tutorial will be organized into three parts: (1) brief overview of the empirical standards project; (2) detailed presentation of the repository mining standard; (3) discussion and suggestions for improvement. △ Less

Submitted 29 March, 2022; originally announced March 2022.

arXiv:2203.09626 [pdf]

Practices to Improve Teamwork in Software Development During the COVID-19 Pandemic: An Ethnographic Study

Authors: Ronnie E. de Souza Santos, Paul Ralph

Abstract: Context. Due to the COVID-19 pandemic, software professionals had to abruptly shift to ostensibly temporary home offices, which affected teamwork in several ways. Goal. This study aims to explore how these professionals coped with remote work during the pandemic and to identify practices that supported the team activities. Method. Ethnographic methods, including participant observation and qualita… ▽ More Context. Due to the COVID-19 pandemic, software professionals had to abruptly shift to ostensibly temporary home offices, which affected teamwork in several ways. Goal. This study aims to explore how these professionals coped with remote work during the pandemic and to identify practices that supported the team activities. Method. Ethnographic methods, including participant observation and qualitative data analysis, were used. Results. Three practices were created by the observed team to improve their engagement: costume meeting, second-language day, and project happy hour. These practices appear to increase individual involvement, improve team cohesion, reduce monotony, and create opportunities for knowledge acquisition. Conclusions. The three observed practices may help remote software teams cope with adversity. More research is needed to determine if these practices work in other settings, including remote-first and hybrid remote / on-site teams, post-pandemic. △ Less

Submitted 17 March, 2022; originally announced March 2022.

arXiv:2202.10445 [pdf, ps, other]

A Grounded Theory of Coordination in Remote-First and Hybrid Software Teams

Authors: Ronnie E. de Souza Santos, Paul Ralph

Abstract: While the long-term effects of the COVID-19 pandemic on software professionals and organizations are difficult to predict, it seems likely that working from home, remote-first teams, distributed teams, and hybrid (part-remote/part-office) teams will be more common. It is therefore important to investigate the challenges that software teams and organizations face with new remote and hybrid work. Co… ▽ More While the long-term effects of the COVID-19 pandemic on software professionals and organizations are difficult to predict, it seems likely that working from home, remote-first teams, distributed teams, and hybrid (part-remote/part-office) teams will be more common. It is therefore important to investigate the challenges that software teams and organizations face with new remote and hybrid work. Consequently, this paper reports a year-long, participant-observation, constructivist grounded theory study investigating the impact of working from home on software development. This study resulted in a theory of software team coordination. Briefly, shifting from in-office to at-home work fundamentally altered coordination within software teams. While group cohesion and more effective communication appear protective, coordination is undermined by distrust, parenting and communication bricolage. Poor coordination leads to numerous problems including misunderstandings, help requests, lower job satisfaction among team members, and more ill-defined tasks. These problems, in turn, reduce overall project success and prompt professionals to alter their software development processes (in this case, from Scrum to Kanban). Our findings suggest that software organizations with many remote employees can improve performance by encouraging greater engagement within teams and supporting employees with family and childcare responsibilities. △ Less

Submitted 23 February, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

arXiv:2202.07519 [pdf]

Social Science Theories in Software Engineering Research

Authors: Tobias Lorey, Paul Ralph, Michael Felderer

Abstract: As software engineering research becomes more concerned with the psychological, sociological and managerial aspects of software development, relevant theories from reference disciplines are increasingly important for understanding the field's core phenomena of interest. However, the degree to which software engineering research draws on relevant social sciences remains unclear. This study therefor… ▽ More As software engineering research becomes more concerned with the psychological, sociological and managerial aspects of software development, relevant theories from reference disciplines are increasingly important for understanding the field's core phenomena of interest. However, the degree to which software engineering research draws on relevant social sciences remains unclear. This study therefore investigates the use of social science theories in five influential software engineering journals over 13 years. It analyzes not only the extent of theory use but also what, how and where these theories are used. While 87 different theories are used, less than two percent of papers use a social science theory, most theories are used in only one paper, most social sciences are ignored, and the theories are rarely tested for applicability to software engineering contexts. Ignoring relevant social science theories may (1) undermine the community's ability to generate, elaborate and maintain a cumulative body of knowledge; and (2) lead to oversimplified models of software engineering phenomena. More attention to theory is needed for software engineering to mature as a scientific discipline. △ Less

Submitted 15 February, 2022; originally announced February 2022.

arXiv:2201.08058 [pdf, other]

doi 10.1145/3510003.3510100

What Makes Effective Leadership in Agile Software Development Teams?

Authors: Lucas Gren, Paul Ralph

Abstract: Effective leadership is one of the key drivers of business and project success, and one of the most active areas of management research. But how does leadership work in agile software development, which emphasizes self-management and self-organization and marginalizes traditional leadership roles? To find out, this study examines agile leadership from the perspective of thirteen professionals who… ▽ More Effective leadership is one of the key drivers of business and project success, and one of the most active areas of management research. But how does leadership work in agile software development, which emphasizes self-management and self-organization and marginalizes traditional leadership roles? To find out, this study examines agile leadership from the perspective of thirteen professionals who identify as agile leaders, in different roles, at ten different software development companies of varying sizes. Data from semi-structured interviews reveals that leadership: (1) is dynamically shared among team members; (2) engenders a sense of belonging to the team; and (3) involves balancing competing organizational cultures (e.g. balancing the new agile culture with the old milestone-driven culture). In other words, agile leadership is a property of a team, not a role, and effectiveness depends on agile team members' identifying with the team, accepting responsibility, and being sensitive to cultural conflict. △ Less

Submitted 20 January, 2022; originally announced January 2022.

Journal ref: 44th International Conference on Software Engineering (Technical Track ICSE2022), May 21-29, 2022 Pittsburgh, PA, USA

arXiv:2107.10966 [pdf]

doi 10.1016/j.jmarsys.2019.103252

Dynamic variability of the phytoplankton electron requirement for carbon fixation in eastern Australian waters

Authors: David J. Hughes, Joseph R Crosswell, Martina A. Doblin, Kevin Oxborough, Peter J. Ralph, Deepa Varkey, David J. Suggett

Abstract: Fast Repetition Rate fluorometry (FRRf) generates high-resolution measures of phytoplankton primary productivity as electron transport rates (ETRs). How ETRs scale to corresponding inorganic carbon (C) uptake rates (the so-called electron requirement for carbon fixation, e,C), inherently describes the extent and effectiveness with which absorbed light energy drives C-fixation. However, it remains… ▽ More Fast Repetition Rate fluorometry (FRRf) generates high-resolution measures of phytoplankton primary productivity as electron transport rates (ETRs). How ETRs scale to corresponding inorganic carbon (C) uptake rates (the so-called electron requirement for carbon fixation, e,C), inherently describes the extent and effectiveness with which absorbed light energy drives C-fixation. However, it remains unclear whether and how e,C follows predictable patterns for oceanographic datasets spanning physically dynamic, and complex, environmental gradients. We utilise a unique high-throughput approach, coupling ETRs and 14C-incubations to produce a semi-continuous dataset of e,C (n = 80), predominantly from surface waters, along the Australian coast (Brisbane to the Tasman Sea), including the East Australian Current (EAC). Environmental conditions along this transect could be generally grouped into cooler, more nutrient-rich waters dominated by larger size-fractionated Chl-a (>10 um) versus warmer nutrient-poorer waters dominated by smaller size-fractionated Chl-a (< 2 um). Whilst e,C was higher for warmer water samples, environmental conditions alone explained less than 20% variance of e,C, and changes in predominant size-fraction(s) distributions of Chl-a (biomass) failed to explain variance of e,C. Instead, NPQNSV was a better predictor of e,C, explaining 55% of observed variability. NPQNSV is a physiological descriptor that accounts for changes in both long-term driven acclimation in non-radiative decay, and quasi-instantaneous PSII downregulation, and thus may prove a useful predictor of e,C across physically-dynamic regimes, provided the slope describing their relationship is predictable. △ Less

Submitted 22 July, 2021; originally announced July 2021.

Comments: 57 pages, 14 figures, accepted version

Journal ref: J. Mar. Syst. 2020:103252

arXiv:2010.03525 [pdf]

Empirical Standards for Software Engineering Research

Authors: Paul Ralph, Nauman bin Ali, Sebastian Baltes, Domenico Bianculli, Jessica Diaz, Yvonne Dittrich, Neil Ernst, Michael Felderer, Robert Feldt, Antonio Filieri, Breno Bernard Nicolau de França, Carlo Alberto Furia, Greg Gay, Nicolas Gold, Daniel Graziotin, Pinjia He, Rashina Hoda, Natalia Juristo, Barbara Kitchenham, Valentina Lenarduzzi, Jorge Martínez, Jorge Melegati, Daniel Mendez, Tim Menzies, Jefferson Molleri , et al. (18 additional authors not shown)

Abstract: Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around resear… ▽ More Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair. △ Less

Submitted 4 March, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: For the complete standards, supplements and other resources, see https://github.com/acmsigsoft/EmpiricalStandards

arXiv:2005.01127 [pdf, other]

doi 10.1007/s10664-020-09875-y

Pandemic Programming: How COVID-19 affects software developers and how their organizations can help

Authors: Paul Ralph, Sebastian Baltes, Gianisa Adisaputri, Richard Torkar, Vladimir Kovalenko, Marcos Kalinowski, Nicole Novielli, Shin Yoo, Xavier Devroey, Xin Tan, Minghui Zhou, Burak Turhan, Rashina Hoda, Hideaki Hata, Gregorio Robles, Amin Milani Fard, Rana Alkadhi

Abstract: Context. As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions. Objective. This study investigates the effects of the pandemic on developers' wellbeing and productivity. Method. A questionnaire survey was created mainly from existing, validated scales and translated into… ▽ More Context. As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions. Objective. This study investigates the effects of the pandemic on developers' wellbeing and productivity. Method. A questionnaire survey was created mainly from existing, validated scales and translated into 12 languages. The data was analyzed using non-parametric inferential statistics and structural equation modeling. Results. The questionnaire received 2225 usable responses from 53 countries. Factor analysis supported the validity of the scales and the structural model achieved a good fit (CFI = 0.961, RMSEA = 0.051, SRMR = 0.067). Confirmatory results include: (1) the pandemic has had a negative effect on developers' wellbeing and productivity; (2) productivity and wellbeing are closely related; (3) disaster preparedness, fear related to the pandemic and home office ergonomics all affect wellbeing or productivity. Exploratory analysis suggests that: (1) women, parents and people with disabilities may be disproportionately affected; (2) different people need different kinds of support. Conclusions. To improve employee productivity, software companies should focus on maximizing employee wellbeing and improving the ergonomics of employees' home offices. Women, parents and disabled persons may require extra support. △ Less

Submitted 20 July, 2020; v1 submitted 3 May, 2020; originally announced May 2020.

Comments: 34 pages, 7 tables, 5 figures, to appear in Empirical Software Engineering

Journal ref: Empirical Software Engineering, 2020

arXiv:2002.07764 [pdf, ps, other]

Sampling in Software Engineering Research: A Critical Review and Guidelines

Authors: Sebastian Baltes, Paul Ralph

Abstract: Representative sampling appears rare in empirical software engineering research. Not all studies need representative samples, but a general lack of representative sampling undermines a scientific field. This article therefore reports a critical review of the state of sampling in recent, high-quality software engineering research. The key findings are: (1) random sampling is rare; (2) sophisticated… ▽ More Representative sampling appears rare in empirical software engineering research. Not all studies need representative samples, but a general lack of representative sampling undermines a scientific field. This article therefore reports a critical review of the state of sampling in recent, high-quality software engineering research. The key findings are: (1) random sampling is rare; (2) sophisticated sampling strategies are very rare; (3) sampling, representativeness and randomness often appear misunderstood. These findings suggest that software engineering research has a generalizability crisis. To address these problems, this paper synthesizes existing knowledge of sampling into a succinct primer and proposes extensive guidelines for improving the conduct, presentation and evaluation of sampling in software engineering research. It is further recommended that while researchers should strive for more representative samples, disparaging non-probability sampling is generally capricious and particularly misguided for predominately qualitative research. △ Less

Submitted 20 October, 2021; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: 38 pages, 8 tables, accepted for publication in Empirical Software Engineering

arXiv:1904.09847 [pdf, other]

doi 10.1146/annurev-ecolsys-110316-022659

Spatial Population Genetics: It's About Time

Authors: Gideon S. Bradburd, Peter L. Ralph

Abstract: Many questions that we have about the history and dynamics of organisms have a geographical component: How many are there, and where do they live? How do they move and interbreed across the landscape? How were they moving a thousand years ago, and where were the ancestors of a particular individual alive today? Answers to these questions can have profound consequences for our understanding of hist… ▽ More Many questions that we have about the history and dynamics of organisms have a geographical component: How many are there, and where do they live? How do they move and interbreed across the landscape? How were they moving a thousand years ago, and where were the ancestors of a particular individual alive today? Answers to these questions can have profound consequences for our understanding of history, ecology, and the evolutionary process. In this review, we discuss how geographic aspects of the distribution, movement, and reproduction of organisms are reflected in their pedigree across space and time. Because the structure of the pedigree is what determines patterns of relatedness in modern genetic variation, our aim is to thus provide intuition for how these processes leave an imprint in genetic data. We also highlight some current methods and gaps in the statistical toolbox of spatial population genetics. △ Less

Submitted 10 May, 2019; v1 submitted 22 April, 2019; originally announced April 2019.

Journal ref: Annual Review of Ecology, Evolution, and Systematics Vol. 50:427-449 (2019)

arXiv:1902.11278 [pdf]

doi 10.1109/TSE.2019.2909033

Requirements Framing Affects Design Creativity

Authors: Rahul Mohanani, Burak Turhan, Paul Ralph

Abstract: Design creativity, the originality and practicality of a solution concept is critical for the success of many software projects. However, little research has investigated the relationship between the way desiderata are presented and design creativity. This study therefore investigates the impact of presenting desiderata as ideas, requirements or prioritized requirements on design creativity. Two b… ▽ More Design creativity, the originality and practicality of a solution concept is critical for the success of many software projects. However, little research has investigated the relationship between the way desiderata are presented and design creativity. This study therefore investigates the impact of presenting desiderata as ideas, requirements or prioritized requirements on design creativity. Two between-subjects randomized controlled experiments were conducted with 42 and 34 participants. Participants were asked to create design concepts from a list of desiderata. Participants who received desiderata framed as requirements or prioritized requirements created designs that are, on average, less original but more practical than the designs created by participants who received desiderata framed as ideas. This suggests that more formal, structured presentations of desiderata are less appropriate where a creative solution is desired. The results also show that design performance is highly susceptible to minor changes in the vernacular used to communicate desiderata. △ Less

Submitted 28 February, 2019; originally announced February 2019.

arXiv:1802.06321 [pdf]

The Dangerous Dogmas of Software Engineering

Authors: Paul Ralph, Briony J. Oates

Abstract: To legitimize itself as a scientific discipline, the software engineering academic community must let go of its non-empirical dogmas. A dogma is belief held regardless of evidence. This paper analyzes the nature and detrimental effects of four software engineering dogmas - 1) the belief that software has "requirements"; 2) the division of software engineering tasks into analysis, design, coding an… ▽ More To legitimize itself as a scientific discipline, the software engineering academic community must let go of its non-empirical dogmas. A dogma is belief held regardless of evidence. This paper analyzes the nature and detrimental effects of four software engineering dogmas - 1) the belief that software has "requirements"; 2) the division of software engineering tasks into analysis, design, coding and testing; 3) the belief that software engineering is predominantly concerned with designing "software" systems; 4) the belief that software engineering follows methods effectively. Deconstructing these dogmas reveals that they each oversimplify and over-rationalize aspects of software engineering practice, which obscures underlying phenomena and misleads researchers and practitioners. Evidenced-based practice is analyzed as a means to expose and repudiate non-empirical dogmas. This analysis results in several novel recommendations for overcoming the practical challenges of evidence-based practice. △ Less

Submitted 17 February, 2018; originally announced February 2018.

Comments: 12 pages

arXiv:1802.06319 [pdf]

Consensus in Software Engineering: A Cognitive Map** Study

Authors: Pontus Johnson, Paul Ralph, Mathias Ekstedt, Iaakov Exman, Michael Goedicke

Abstract: Background: Philosophers of science including Collins, Feyerabend, Kuhn and Latour have all emphasized the importance of consensus within scientific communities of practice. Consensus is important for maintaining legitimacy with outsiders, orchestrating future research, develo** educational curricula and agreeing industry standards. Low consensus contrastingly undermines a field's reputation and… ▽ More Background: Philosophers of science including Collins, Feyerabend, Kuhn and Latour have all emphasized the importance of consensus within scientific communities of practice. Consensus is important for maintaining legitimacy with outsiders, orchestrating future research, develo** educational curricula and agreeing industry standards. Low consensus contrastingly undermines a field's reputation and hinders peer review. Aim: This paper aims to investigate the degree of consensus within the software engineering academic community concerning members' implicit theories of software engineering. Method: A convenience sample of 60 software engineering researchers produced diagrams describing their personal understanding of causal relationships between core software engineering constructs. The diagrams were then analyzed for patterns and clusters. Results: At least three schools of thought may be forming; however, their interpretation is unclear since they do not correspond to known divisions within the community (e.g. Agile vs. Plan-Driven methods). Furthermore, over one third of participants do not belong to any cluster. Conclusion: Although low consensus is common in social sciences, the rapid pace of innovation observed in software engineering suggests that high consensus is achievable given renewed commitment to empiricism and evidence-based practice. △ Less

Submitted 17 February, 2018; originally announced February 2018.

Comments: 10 pages, 6 Figures, 6 Tables

arXiv:1707.03869 [pdf]

doi 10.1109/TSE.2018.2877759

Cognitive Biases in Software Engineering: A Systematic Map** Study

Authors: Rahul Mohanani, Iflaah Salman, Burak Turhan, Pilar Rodriguez, Paul Ralph

Abstract: One source of software project challenges and failures is the systematic errors introduced by human cognitive biases. Although extensively explored in cognitive psychology, investigations concerning cognitive biases have only recently gained popularity in software engineering (SE) research. This paper therefore systematically maps, aggregates and synthesizes the literature on cognitive biases in s… ▽ More One source of software project challenges and failures is the systematic errors introduced by human cognitive biases. Although extensively explored in cognitive psychology, investigations concerning cognitive biases have only recently gained popularity in software engineering (SE) research. This paper therefore systematically maps, aggregates and synthesizes the literature on cognitive biases in software engineering to generate a comprehensive body of knowledge, understand state of the art research and provide guidelines for future research and practise. Focusing on bias antecedents, effects and mitigation techniques, we identified 65 articles, which investigate 37 cognitive biases, published between 1990 and 2016. Despite strong and increasing interest, the results reveal a scarcity of research on mitigation techniques and poor theoretical foundations in understanding and interpreting cognitive biases. Although bias-related research has generated many new insights in the software engineering community, specific bias mitigation techniques are still needed for software professionals to overcome the deleterious effects of cognitive biases on their work. △ Less

Submitted 23 October, 2018; v1 submitted 12 July, 2017; originally announced July 2017.

Comments: Pre-print submitted to IEEE Transactions on Software Engineering

Journal ref: IEEE Transactions on Software Engineering, 46(12), 1318-1339 (2018)

arXiv:1505.05816 [pdf, other]

An empirical approach to demographic inference with genomic data

Authors: Peter L. Ralph

Abstract: Inference with population genetic data usually treats the population pedigree as a nuisance parameter, the unobserved product of a past history of random mating. However, the history of genetic relationships in a given population is a fixed, unobserved object, and so an alternative approach is to treat this network of relationships as a complex object we wish to learn about, by observing how genom… ▽ More Inference with population genetic data usually treats the population pedigree as a nuisance parameter, the unobserved product of a past history of random mating. However, the history of genetic relationships in a given population is a fixed, unobserved object, and so an alternative approach is to treat this network of relationships as a complex object we wish to learn about, by observing how genomes have been noisily passed down through it. This paper explores this point of view, showing how to translate questions about population genetic data into calculations with a Poisson process of mutations on all ancestral genomes. This method is applied to give a robust interpretation to the $f_4$ statistic used to identify admixture, and to design a new statistic that measures covariances in mean times to most recent common ancestor between two pairs of sequences. The method more generally interprets population genetic statistics in terms of sums of specific functions over ancestral genomes, thereby providing concrete, broadly interpretable interpretations for these statistics. This provides a method for describing demographic history without simplified demographic models. More generally, it brings into focus the population pedigree, which is averaged over in model-based demographic inference. △ Less

Submitted 1 April, 2019; v1 submitted 21 May, 2015; originally announced May 2015.

arXiv:1410.5313 [pdf]

doi 10.1534/g3.116.027581

Conflation of short identity-by-descent segments bias their inferred length distribution

Authors: Charleston W. K. Chiang, Peter Ralph, John Novembre

Abstract: Identity-by-descent (IBD) is a fundamental concept in genetics with many applications. In a common definition, two haplotypes are said to contain an IBD segment if they share a segment that is inherited from a recent shared common ancestor without intervening recombination. Long IBD segments (> 1cM) can be efficiently detected by a number of algorithms using high-density SNP array data from a popu… ▽ More Identity-by-descent (IBD) is a fundamental concept in genetics with many applications. In a common definition, two haplotypes are said to contain an IBD segment if they share a segment that is inherited from a recent shared common ancestor without intervening recombination. Long IBD segments (> 1cM) can be efficiently detected by a number of algorithms using high-density SNP array data from a population sample. However, these approaches detect IBD based on contiguous segments of identity-by-state, and such segments may exist due to the conflation of smaller, nearby IBD segments. We quantified this effect using coalescent simulations, finding that nearly 40% of inferred segments 1-2cM long are results of conflations of two or more shorter segments, under demographic scenarios typical for modern humans. This biases the inferred IBD segment length distribution, and so can affect downstream inferences. We observed this conflation effect universally across different IBD detection programs and human demographic histories, and found inference of segments longer than 2cM to be much more reliable (less than 5% conflation rate). As an example of how this can negatively affect downstream analyses, we present and analyze a novel estimator of the de novo mutation rate using IBD segments, and demonstrate that the biased length distribution of the IBD segments due to conflation can lead to inflated estimates if the conflation is not modeled. Understanding the conflation effect in detail will make its correction in future methods more tractable. △ Less

Submitted 17 August, 2015; v1 submitted 20 October, 2014; originally announced October 2014.

Journal ref: G3 May 1, 2016 vol. 6 no. 5 1287-1296

arXiv:1308.1042 [pdf]

Beyond Gamification: Implications of Purposeful Games for the Information Systems Discipline

Authors: Kafui Monu, Paul Ralph

Abstract: Gamification is an emerging design principle for information systems where game design elements are applied to non-game contexts. IS researchers have suggested that the IS discipline must study this area but there are other applications such as serious games, and simulations that also use games in non-game contexts. Specifically, the management field has been using games and simulations for years… ▽ More Gamification is an emerging design principle for information systems where game design elements are applied to non-game contexts. IS researchers have suggested that the IS discipline must study this area but there are other applications such as serious games, and simulations that also use games in non-game contexts. Specifically, the management field has been using games and simulations for years and these applications are now being supported by information systems. We propose in this paper that we must think beyond gamification, towards other uses of games in non-gaming contexts, which we call purposeful gaming. In this paper we identify how the IS discipline can adapt to purposeful gaming. Specifically, we show how IT artifacts, IS design, and IS theories can be used in the purposeful gaming area. We also provide three conceptual dimensions of purposeful gaming that can aid IS practitioners and researchers to classify and understand purposeful games. △ Less

Submitted 2 August, 2013; originally announced August 2013.

arXiv:1307.1019 [pdf]

Software Engineering Process Theory: A Multi-Method Comparison of Sensemaking-CoevoIution-Implementation Theory and Function-Behavior-Structure Theory

Authors: Paul Ralph

Abstract: Many academics have called for increasing attention to theory in software engineering. Consequently, this paper empirically evaluates two dissimilar software development process theories - one expressing a more traditional, methodical view (FBS) and one expressing an alternative, more improvisational view (SCI). A primarily quantitative survey of more than 1300 software developers is combined with… ▽ More Many academics have called for increasing attention to theory in software engineering. Consequently, this paper empirically evaluates two dissimilar software development process theories - one expressing a more traditional, methodical view (FBS) and one expressing an alternative, more improvisational view (SCI). A primarily quantitative survey of more than 1300 software developers is combined with four qualitative case studies to achieve a simultaneously broad and deep empirical evaluation. Case data analysis using a closed-ended, a priori coding scheme based on the two theories strongly supports SCI, as does analysis of questionnaire response distributions (p<0.001; chi-square goodness of fit test). Furthermore, case-questionnaire triangulation found no evidence that support for SCI varied by participants' gender, education, experience, nationality or the size or nature of their projects. This suggests that instead of iteration between weakly-coupled phases (analysis, design, coding, testing), it is more accurate and useful to conceptualize development as ad hoc oscillation between organizing perceptions of the project context (Sensemaking), simultaneously improving mental pictures of the context and design artifact (Coevolution) and constructing, debugging and deploying software artifacts (Implementation). △ Less

Submitted 3 July, 2013; originally announced July 2013.

Comments: 49 pages, 73 references, 12 tables, 4 figures, 4 appendices

arXiv:1304.0116 [pdf]

doi 10.1063/1.4792587

The Illusion of Requirements in Software Development

Authors: Paul Ralph

Abstract: It is widely accepted that understanding system requirements is important for software development project success. However, this paper presents two novel challenges to the requirements concept. First, where many plausible approaches to achieving a goal are evident, there may be insufficient overlap between approaches to form requirements. Second, while all plausible approaches may have sufficient… ▽ More It is widely accepted that understanding system requirements is important for software development project success. However, this paper presents two novel challenges to the requirements concept. First, where many plausible approaches to achieving a goal are evident, there may be insufficient overlap between approaches to form requirements. Second, while all plausible approaches may have sufficient overlap to state requirements, we cannot know that unless all approaches are identified and we are sure that none have been missed. This suggest that many, if not most, software projects may have too few requirements drive the design process, and that analysts may misrepresent design decisions as requirements to compensate. △ Less

Submitted 30 March, 2013; originally announced April 2013.

Comments: 5 pages, 1 table, 1 figure; accepted for publication in Requirements Engineering: http://link.springer.com/article/10.1007%2Fs00766-012-0161-4

ACM Class: D.2.1

Journal ref: Requirements Engineering, 18(3), 293-296 (2013)

arXiv:1303.5938 [pdf]

The Two Paradigms of Software Design

Authors: Paul Ralph

Abstract: The dominant view of design in information systems and software engineering, the Rational Design Paradigm, views software development as a methodical, plan-centered, approximately rational process of optimizing a design candidate for known constraints and objectives. This paper synthesizes an Alternative Design Paradigm, which views software development as an amethodical, improvisational, emotiona… ▽ More The dominant view of design in information systems and software engineering, the Rational Design Paradigm, views software development as a methodical, plan-centered, approximately rational process of optimizing a design candidate for known constraints and objectives. This paper synthesizes an Alternative Design Paradigm, which views software development as an amethodical, improvisational, emotional process of simultaneously framing the problem and building artifacts to address it. These conflicting paradigms are manifestations of a deeper philosophical conflict between rationalism and empiricism. The paper clarifies the nature, components and assumptions of each paradigm and explores the implications of the paradigmatic conflict for research, practice and education. △ Less

Submitted 24 March, 2013; originally announced March 2013.

Comments: 38 pages, 3 tables, 4 figures. A previous version of this paper was published at the 2010 Mediterranean Conference on Information Systems

arXiv:1302.4061 [pdf]

doi 10.1016/j.scico.2014.11.007

The Sensemaking-Coevolution-Implementation Theory of Software Design

Authors: Paul Ralph

Abstract: Understanding software design practice is critical to understanding modern information systems development. New developments in empirical software engineering, information systems design science and the interdisciplinary design literature combined with recent advances in process theory and testability have created a situation ripe for innovation. Consequently, this paper utilizes these breakthroug… ▽ More Understanding software design practice is critical to understanding modern information systems development. New developments in empirical software engineering, information systems design science and the interdisciplinary design literature combined with recent advances in process theory and testability have created a situation ripe for innovation. Consequently, this paper utilizes these breakthroughs to formulate a process theory of software design practice: Sensemaking-Coevolution-Implementation Theory explains how complex software systems are created by collocated software development teams in organizations. It posits that an independent agent (design team) creates a software system by alternating between three activities: organizing their perceptions about the context, mutually refining their understandings of the context and design space, and manifesting their understanding of the design space in a technological artifact. This theory development paper defines and illustrates Sensemaking-Coevolution-Implementation Theory, grounds its concepts and relationships in existing literature, conceptually evaluates the theory and situates it in the broader context of information systems development. △ Less

Submitted 17 February, 2013; originally announced February 2013.

Comments: 7 tables, 7 Figures, 157 references

Journal ref: Science of Computer Programming Volume 101, 1 April 2015, Pages 21-41

arXiv:1302.3274 [pdf, other]

Disentangling the effects of geographic and ecological isolation on genetic differentiation

Authors: Gideon Bradburd, Peter Ralph, Graham Coop

Abstract: Populations can be genetically isolated both by geographic distance and by differences in their ecology or environment that decrease the rate of successful migration. Empirical studies often seek to investigate the relationship between genetic differentiation and some ecological variable(s) while accounting for geographic distance, but common approaches to this problem (such as the partial Mantel… ▽ More Populations can be genetically isolated both by geographic distance and by differences in their ecology or environment that decrease the rate of successful migration. Empirical studies often seek to investigate the relationship between genetic differentiation and some ecological variable(s) while accounting for geographic distance, but common approaches to this problem (such as the partial Mantel test) have a number of drawbacks. In this article, we present a Bayesian method that enables users to quantify the relative contributions of geographic distance and ecological distance to genetic differentiation between sampled populations or individuals. We model the allele frequencies in a set of populations at a set of unlinked loci as spatially correlated Gaussian processes, in which the covariance structure is a decreasing function of both geographic and ecological distance. Parameters of the model are estimated using a Markov chain Monte Carlo algorithm. We call this method Bayesian Estimation of Differentiation in Alleles by Spatial Structure and Local Ecology (BEDASSLE), and have implemented it in a user-friendly format in the statistical platform R. We demonstrate its utility with a simulation study and empirical applications to human and teosinte datasets. △ Less

Submitted 11 September, 2013; v1 submitted 13 February, 2013; originally announced February 2013.

arXiv:1207.3815 [pdf, other]

doi 10.1371/journal.pbio.1001555

The geography of recent genetic ancestry across Europe

Authors: Peter Ralph, Graham Coop

Abstract: The recent genealogical history of human populations is a complex mosaic formed by individual migration, large-scale population movements, and other demographic events. Population genomics datasets can provide a window into this recent history, as rare traces of recent shared genetic ancestry are detectable due to long segments of shared genomic material. We make use of genomic data for 2,257 Euro… ▽ More The recent genealogical history of human populations is a complex mosaic formed by individual migration, large-scale population movements, and other demographic events. Population genomics datasets can provide a window into this recent history, as rare traces of recent shared genetic ancestry are detectable due to long segments of shared genomic material. We make use of genomic data for 2,257 Europeans (the POPRES dataset) to conduct one of the first surveys of recent genealogical ancestry over the past three thousand years at a continental scale. We detected 1.9 million shared genomic segments, and used the lengths of these to infer the distribution of shared ancestors across time and geography. We find that a pair of modern Europeans living in neighboring populations share around 10-50 genetic common ancestors from the last 1500 years, and upwards of 500 genetic ancestors from the previous 1000 years. These numbers drop off exponentially with geographic distance, but since genetic ancestry is rare, individuals from opposite ends of Europe are still expected to share millions of common genealogical ancestors over the last 1000 years. There is substantial regional variation in the number of shared genetic ancestors: especially high numbers of common ancestors between many eastern populations likely date to the Slavic and/or Hunnic expansions, while much lower levels of common ancestry in the Italian and Iberian peninsulas may indicate weaker demographic effects of Germanic expansions into these areas and/or more stably structured populations. Recent shared ancestry in modern Europeans is ubiquitous, and clearly shows the impact of both small-scale migration and large historical events. Population genomic datasets have considerable power to uncover recent demographic history, and will allow a much fuller picture of the close genealogical kinship of individuals across the world. △ Less

Submitted 8 May, 2013; v1 submitted 16 July, 2012; originally announced July 2012.

Comments: Full size figures available from http://www.eve.ucdavis.edu/~plralph/research.html; or html version at http://ralphlab.usc.edu/ibd/ibd-paper/ibd-writeup.xhtml

MSC Class: 92D25 62M99

Journal ref: PLoS Biology 11(5) 2013: e1001555

arXiv:1112.5218 [pdf, ps, other]

doi 10.1534/genetics.112.141861

Patterns of neutral diversity under general models of selective sweeps

Authors: Graham Coop, Peter Ralph

Abstract: Two major sources of stochasticity in the dynamics of neutral alleles result from resampling of finite populations (genetic drift) and the random genetic background of nearby selected alleles on which the neutral alleles are found (linked selection). There is now good evidence that linked selection plays an important role in sha** polymorphism levels in a number of species. One of the best inves… ▽ More Two major sources of stochasticity in the dynamics of neutral alleles result from resampling of finite populations (genetic drift) and the random genetic background of nearby selected alleles on which the neutral alleles are found (linked selection). There is now good evidence that linked selection plays an important role in sha** polymorphism levels in a number of species. One of the best investigated models of linked selection is the recurrent full sweep model, in which newly arisen selected alleles fix rapidly. However, the bulk of selected alleles that sweep into the population may not be destined for rapid fixation. Here we develop a general model of recurrent selective sweeps in a coalescent framework, one that generalizes the recurrent full sweep model to the case where selected alleles do not sweep to fixation. We show that in a large population, only the initial rapid increase of a selected allele affects the genealogy at partially linked sites, which under fairly general assumptions are unaffected by the subsequent fate of the selected allele. We also apply the theory to a simple model to investigate the impact of recurrent partial sweeps on levels of neutral diversity, and find that for a given reduction in diversity, the impact of recurrent partial sweeps on the frequency spectrum at neutral sites is determined primarily by the frequencies achieved by the selected alleles. Consequently, recurrent sweeps of selected alleles to low frequencies can have a profound effect on levels of diversity but can leave the frequency spectrum relatively unperturbed. In fact, the limiting coalescent model under a high rate of sweeps to low frequency is identical to the standard neutral model. The general model of selective sweeps we describe goes some way towards providing a more flexible framework to describe genomic patterns of diversity than is currently available. △ Less

Submitted 13 January, 2013; v1 submitted 21 December, 2011; originally announced December 2011.

Comments: 44 pages. 5 figures

Journal ref: Genetics September 1, 2012 vol. 192 no. 1 205-224

arXiv:1110.4944 [pdf, other]

doi 10.1111/j.1558-5646.2012.01574.x

Is your phylogeny informative? Measuring the power of comparative methods

Authors: Carl Boettiger, Graham Coop, Peter Ralph

Abstract: Phylogenetic comparative methods may fail to produce meaningful results when either the underlying model is inappropriate or the data contain insufficient information to inform the inference. The ability to measure the statistical power of these methods has become crucial to ensure that data quantity keeps pace with growing model complexity. Through simulations, we show that commonly applied model… ▽ More Phylogenetic comparative methods may fail to produce meaningful results when either the underlying model is inappropriate or the data contain insufficient information to inform the inference. The ability to measure the statistical power of these methods has become crucial to ensure that data quantity keeps pace with growing model complexity. Through simulations, we show that commonly applied model choice methods based on information criteria can have remarkably high error rates; this can be a problem because methods to estimate the uncertainty or power are not widely known or applied. Furthermore, the power of comparative methods can depend significantly on the structure of the data. We describe a Monte Carlo based method which addresses both of these challenges, and show how this approach both quantifies and substantially reduces errors relative to information criteria. The method also produces meaningful confidence intervals for model parameters. We illustrate how the power to distinguish different models, such as varying levels of selection, varies both with number of taxa and structure of the phylogeny. We provide an open-source implementation in the pmc ("Phylogenetic Monte Carlo") package for the R programming language. We hope such power analysis becomes a routine part of model comparison in comparative methods. △ Less

Submitted 22 October, 2011; originally announced October 2011.

Comments: 19 pages, 6 figures, 2 tables

Journal ref: Evolution (2012)

arXiv:1105.2280 [pdf, other]

doi 10.1007/s00285-012-0514-0

Stochastic population growth in spatially heterogeneous environments

Authors: Steven N. Evans, Peter L. Ralph, Sebastian J. Schreiber, Arnab Sen

Abstract: Classical ecological theory predicts that environmental stochasticity increases extinction risk by reducing the average per-capita growth rate of populations. To understand the interactive effects of environmental stochasticity, spatial heterogeneity, and dispersal on population growth, we study the following model for population abundances in $n$ patches: the conditional law of $X_{t+dt}$ given… ▽ More Classical ecological theory predicts that environmental stochasticity increases extinction risk by reducing the average per-capita growth rate of populations. To understand the interactive effects of environmental stochasticity, spatial heterogeneity, and dispersal on population growth, we study the following model for population abundances in $n$ patches: the conditional law of $X_{t+dt}$ given $X_t=x$ is such that when $dt$ is small the conditional mean of $X_{t+dt}^i-X_t^i$ is approximately $[x^iμ_i+\sum_j(x^j D_{ji}-x^i D_{ij})]dt$, where $X_t^i$ and $μ_i$ are the abundance and per capita growth rate in the $i$-th patch respectivly, and $D_{ij}$ is the dispersal rate from the $i$-th to the $j$-th patch, and the conditional covariance of $X_{t+dt}^i-X_t^i$ and $X_{t+dt}^j-X_t^j$ is approximately $x^i x^j σ_{ij}dt$. We show for such a spatially extended population that if $S_t=(X_t^1+...+X_t^n)$ is the total population abundance, then $Y_t=X_t/S_t$, the vector of patch proportions, converges in law to a random vector $Y_\infty$ as $t\to\infty$, and the stochastic growth rate $\lim_{t\to\infty}t^{-1}\log S_t$ equals the space-time average per-capita growth rate $\sum_iμ_i\E[Y_\infty^i]$ experienced by the population minus half of the space-time average temporal variation $\E[\sum_{i,j}σ_{ij}Y_\infty^i Y_\infty^j]$ experienced by the population. We derive analytic results for the law of $Y_\infty$, find which choice of the dispersal mechanism $D$ produces an optimal stochastic growth rate for a freely dispersing population, and investigate the effect on the stochastic growth rate of constraints on dispersal rates. Our results provide fundamental insights into "ideal free" movement in the face of uncertainty, the persistence of coupled sink populations, the evolution of dispersal rates, and the single large or several small (SLOSS) debate in conservation biology. △ Less

Submitted 2 February, 2012; v1 submitted 11 May, 2011; originally announced May 2011.

Comments: 47 pages, 4 figures

MSC Class: 92D40 (Primary); 92D25; 60H10 (Secondary)

Journal ref: Journal of Mathematical Biology February 2013, Volume 66, Issue 3, pp 423-476

arXiv:1103.2397 [pdf, ps, other]

doi 10.1371/journal.pcbi.1001136

Transcriptional regulation: Effects of promoter proximal pausing on speed, synchrony and reliability

Authors: Alistair N. Boettiger, Peter L. Ralph, Steven N. Evans

Abstract: Recent whole genome polymerase binding assays have shown that a large proportion of unexpressed genes have pre-assembled RNA pol II transcription initiation complex stably bound to their promoters. Some such promoter proximally paused genes are regulated at transcription elongation rather than at initiation; it has been proposed that this difference allows these genes to both express faster and ac… ▽ More Recent whole genome polymerase binding assays have shown that a large proportion of unexpressed genes have pre-assembled RNA pol II transcription initiation complex stably bound to their promoters. Some such promoter proximally paused genes are regulated at transcription elongation rather than at initiation; it has been proposed that this difference allows these genes to both express faster and achieve more synchronous expression across populations of cells, thus overcoming molecular "noise" arising from low copy number factors. It has been established experimentally that genes which are regulated at elongation tend to express faster and more synchronously; however, it has not been shown directly whether or not it is the change in the regulated step {\em per se} that causes this increase in speed and synchrony. We investigate this question by proposing and analyzing a continuous-time Markov chain model of polymerase complex assembly regulated at one of two steps: initial polymerase association with DNA, or release from a paused, transcribing state. Our analysis demonstrates that, over a wide range of physical parameters, increased speed and synchrony are functional consequences of elongation control. Further, we make new predictions about the effect of elongation regulation on the consistent control of total transcript number between cells, and identify which elements in the transcription induction pathway are most sensitive to molecular noise and thus may be most evolutionarily constrained. Our methods produce symbolic expressions for quantities of interest with reasonable computational effort and can be used to explore the interplay between interaction topology and molecular noise in a broader class of biochemical networks. We provide general-purpose code implementing these methods. △ Less

Submitted 11 March, 2011; originally announced March 2011.

Comments: 21 pages, 6 figures; to be published in PLoS Computational Biology

MSC Class: 60J22; 60J28; 92C42

Journal ref: PLoS Comput Biol 7(5): e1001136 (2011)

arXiv:1005.0554 [pdf, ps, other]

doi 10.1534/genetics.110.119594

Parallel adaptation: One or many waves of advance of an advantageous allele?

Authors: Peter Ralph, Graham Coop

Abstract: Our models for detecting the effect of adaptation on population genomic diversity are often predicated on a single newly arisen mutation swee** rapidly to fixation. However, a population can also adapt to a new situation by multiple mutations of similar phenotypic effect that arise in parallel. These mutations can each quickly reach intermediate frequency, preventing any single one from rapidly… ▽ More Our models for detecting the effect of adaptation on population genomic diversity are often predicated on a single newly arisen mutation swee** rapidly to fixation. However, a population can also adapt to a new situation by multiple mutations of similar phenotypic effect that arise in parallel. These mutations can each quickly reach intermediate frequency, preventing any single one from rapidly swee** to fixation globally (a "soft" sweep). Here we study models of parallel mutation in a geographically spread population adapting to a global selection pressure. The slow geographic spread of a selected allele can allow other selected alleles to arise and spread elsewhere in the species range. When these different selected alleles meet, their spread can slow dramatically, and so form a geographic patchwork which could be mistaken for a signal of local adaptation. This random spatial tessellation will dissipate over time due to mixing by migration, leaving a set of partial sweeps within the global population. We show that the spatial tessellation initially formed by mutational types is closely connected to Poisson process models of crystallization, which we extend. We find that the probability of parallel mutation and the spatial scale on which parallel mutation occurs is captured by a single characteristic length that reflects the expected distance a spreading allele travels before it encounters a different spreading allele. This characteristic length depends on the mutation rate, the dispersal parameter, the effective local density of individuals, and to a much lesser extent the strength of selection. We argue that even in widely dispersing species, such parallel geographic sweeps may be surprisingly common. Thus, we predict, as more data becomes available, many more examples of intra-species parallel adaptation will be uncovered. △ Less

Submitted 20 July, 2010; v1 submitted 4 May, 2010; originally announced May 2010.

Comments: 52 pages, 10 figures

MSC Class: 92D15 (primary) 60G55 (secondary)

Journal ref: Genetics October 2010 vol. 186 no. 2 647-668

arXiv:0812.1302 [pdf, ps, other]

doi 10.1214/09-AAP616

Dynamics of the time to the most recent common ancestor in a large branching population

Authors: Steven N. Evans, Peter L. Ralph

Abstract: If we follow an asexually reproducing population through time, then the amount of time that has passed since the most recent common ancestor (MRCA) of all current individuals lived will change as time progresses. The resulting "MRCA age" process has been studied previously when the population has a constant large size and evolves via the diffusion limit of standard Wright--Fisher dynamics. For a… ▽ More If we follow an asexually reproducing population through time, then the amount of time that has passed since the most recent common ancestor (MRCA) of all current individuals lived will change as time progresses. The resulting "MRCA age" process has been studied previously when the population has a constant large size and evolves via the diffusion limit of standard Wright--Fisher dynamics. For any population model, the sample paths of the MRCA age process are made up of periods of linear upward drift with slope +1 punctuated by downward jumps. We build other Markov processes that have such paths from Poisson point processes on $\mathbb{R}_{++}\times\mathbb{R}_{++}$ with intensity measures of the form $λ\otimesμ$ where $λ$ is Lebesgue measure, and $μ$ (the "family lifetime measure") is an arbitrary, absolutely continuous measure satisfying $μ((0,\infty))=\infty$ and $μ((x,\infty))<\infty$ for all $x>0$. Special cases of this construction describe the time evolution of the MRCA age in $(1+β)$-stable continuous state branching processes conditioned on nonextinction--a particular case of which, $β=1$, is Feller's continuous state branching process conditioned on nonextinction. As well as the continuous time process, we also consider the discrete time Markov chain that records the value of the continuous process just before and after its successive jumps. We find transition probabilities for both the continuous and discrete time processes, determine when these processes are transient and recurrent and compute stationary distributions when they exist. △ Less

Submitted 13 January, 2010; v1 submitted 6 December, 2008; originally announced December 2008.

Comments: Published in at http://dx.doi.org/10.1214/09-AAP616 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AAP-AAP616 MSC Class: 92D10; 60J80; 60G55; 60G18 (Primary)

Journal ref: Annals of Applied Probability 2010, Vol. 20, No. 1, 1-25

arXiv:math/0509270 [pdf, ps, other]

doi 10.1214/193940307000000383

Brownian motion on disconnected sets, basic hypergeometric functions, and some continued fractions of Ramanujan

Authors: Shankar Bhamidi, Steven N. Evans, Ron Peled, Peter Ralph

Abstract: Motivated by Lévy's characterization of Brownian motion on the line, we propose an analogue of Brownian motion that has as its state space an arbitrary closed subset of the line that is unbounded above and below: such a process will be a martingale, will have the identity function as its quadratic variation process, and will be ``continuous'' in the sense that its sample paths don't skip over po… ▽ More Motivated by Lévy's characterization of Brownian motion on the line, we propose an analogue of Brownian motion that has as its state space an arbitrary closed subset of the line that is unbounded above and below: such a process will be a martingale, will have the identity function as its quadratic variation process, and will be ``continuous'' in the sense that its sample paths don't skip over points. We show that there is a unique such process, which turns out to be automatically a reversible Feller-Dynkin Markov process. We find its generator, which is a natural generalization of the operator $f\mapsto{1/2}f''$. We then consider the special case where the state space is the self-similar set $\{\pm q^k:k\in \mathbb{Z}\}\cup\{0\}$ for some $q>1$. Using the scaling properties of the process, we represent the Laplace transforms of various hitting times as certain continued fractions that appear in Ramanujan's ``lost'' notebook and evaluate these continued fractions in terms of basic hypergeometric functions (that is, $q$-analogues of classical hypergeometric functions). The process has 0 as a regular instantaneous point, and hence its sample paths can be decomposed into a Poisson process of excursions from 0 using the associated continuous local time. Using the reversibility of the process with respect to the natural measure on the state space, we find the entrance laws of the corresponding Itô excursion measure and the Laplace exponent of the inverse local time -- both again in terms of basic hypergeometric functions. By combining these ingredients, we obtain explicit formulae for the resolvent of the process. We also compute the moments of the process in closed form. Some of our results involve $q$-analogues of classical distributions such as the Poisson distribution that have appeared elsewhere in the literature. △ Less

Submitted 20 May, 2008; v1 submitted 12 September, 2005; originally announced September 2005.

Comments: Published in at http://dx.doi.org/10.1214/193940307000000383 the IMS Collections (http://www.imstat.org/publications/imscollections.htm) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-COLL2-IMSCOLL205 MSC Class: 60J65; 60J75 (Primary) 30B70; 30D15 (Secondary)

Journal ref: IMS Collections 2008, Vol. 2, 42-75

Showing 1–41 of 41 results for author: Ralph, P