Showing 1–2 of 2 results for author: Simmons, A J

Search v0.5.6 released 2020-02-24

arXiv:2007.08978 [pdf, other]

cs.SE

doi 10.1145/3382494.3410680

A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Authors: Andrew J. Simmons, Scott Barnett, Jessica Rivera-Villicana, Akshat Bajaj, Rajesh Vasa

Abstract: Background: Meeting the growing industry demand for Data Science requires cross-disciplinary teams that can translate machine learning research into production-ready code. Software engineering teams value adherence to coding standards as an indication of code readability, maintainability, and developer expertise. However, there are no large-scale empirical studies of coding standards focused speci… ▽ More Background: Meeting the growing industry demand for Data Science requires cross-disciplinary teams that can translate machine learning research into production-ready code. Software engineering teams value adherence to coding standards as an indication of code readability, maintainability, and developer expertise. However, there are no large-scale empirical studies of coding standards focused specifically on Data Science projects. Aims: This study investigates the extent to which Data Science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? Method: We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity. Results: Data Science projects suffer from a significantly higher rate of functions that use an excessive numbers of parameters and local variables. Data Science projects also follow different variable naming conventions to non-Data Science projects. Conclusions: The differences indicate that Data Science codebases are distinct from traditional software codebases and do not follow traditional software engineering conventions. Our conjecture is that this may be because traditional software engineering conventions are inappropriate in the context of Data Science projects. △ Less

Submitted 28 July, 2020; v1 submitted 17 July, 2020; originally announced July 2020.

Comments: 11 pages, 7 figures. To appear in ESEM 2020. Updated based on peer review
arXiv:1812.05804 [pdf, other]

cs.DB cs.HC

Data Provenance for Sport

Authors: Andrew J. Simmons, Scott Barnett, Simon Vajda, Rajesh Vasa

Abstract: Data analysts often discover irregularities in their underlying dataset, which need to be traced back to the original source and corrected. Standards for representing data provenance (i.e. the origins of the data), such as the W3C PROV standard, can assist with this process, however require a map** between abstract provenance concepts and the domain of use in order to apply them effectively. We… ▽ More Data analysts often discover irregularities in their underlying dataset, which need to be traced back to the original source and corrected. Standards for representing data provenance (i.e. the origins of the data), such as the W3C PROV standard, can assist with this process, however require a map** between abstract provenance concepts and the domain of use in order to apply them effectively. We propose a custom notation for expressing provenance of information in the sport performance analysis domain, and map our notation to concepts in the W3C PROV standard where possible. We evaluate the functionality of W3C PROV (without specialisations) and the VisTrails workflow manager (without extensions), and find that as is, neither are able to fully capture sport performance analysis workflows, notably due to limitations surrounding capture of automated and manual activities respectively. Furthermore, their notations suffer from ineffective use of visual design space, and present potential usability issues as their terminology is unlikely to match that of sport practitioners. Our findings suggest that one-size-fits-all provenance and workflow systems are a poor fit in practice, and that their notation and functionality need to be optimised for the domain of use. △ Less

Submitted 14 December, 2018; originally announced December 2018.

Comments: 12 pages, 6 figures

Search v0.5.6 released 2020-02-24