-
DocGen: Generating Detailed Parameter Docstrings in Python
Authors:
Vatsal Venkatkrishna,
Durga Shree Nagabushanam,
Emmanuel Iko-Ojo Simon,
Melina Vidoni
Abstract:
Documentation debt hinders the effective utilization of open-source software. Although code summarization tools have been helpful for developers, most would prefer a detailed account of each parameter in a function rather than a high-level summary. However, generating such a summary is too intricate for a single generative model to produce reliably due to the lack of high-quality training data. Th…
▽ More
Documentation debt hinders the effective utilization of open-source software. Although code summarization tools have been helpful for developers, most would prefer a detailed account of each parameter in a function rather than a high-level summary. However, generating such a summary is too intricate for a single generative model to produce reliably due to the lack of high-quality training data. Thus, we propose a multi-step approach that combines multiple task-specific models, each adept at producing a specific section of a docstring. The combination of these models ensures the inclusion of each section in the final docstring. We compared the results from our approach with existing generative models using both automatic metrics and a human-centred evaluation with 17 participating developers, which proves the superiority of our approach over existing methods.
△ Less
Submitted 17 November, 2023; v1 submitted 10 November, 2023;
originally announced November 2023.
-
Developers Need Protection, Too: Perspectives and Research Challenges for Privacy in Social Coding Platforms
Authors:
Nicolás E. Díaz Ferreyra,
Abdessamad Imine,
Melina Vidoni,
Riccardo Scandariato
Abstract:
Social Coding Platforms (SCPs) like GitHub have become central to modern software engineering thanks to their collaborative and version-control features. Like in mainstream Online Social Networks (OSNs) such as Facebook, users of SCPs are subjected to privacy attacks and threats given the high amounts of personal and project-related data available in their profiles and software repositories. Howev…
▽ More
Social Coding Platforms (SCPs) like GitHub have become central to modern software engineering thanks to their collaborative and version-control features. Like in mainstream Online Social Networks (OSNs) such as Facebook, users of SCPs are subjected to privacy attacks and threats given the high amounts of personal and project-related data available in their profiles and software repositories. However, unlike in OSNs, the privacy concerns and practices of SCP users have not been extensively explored nor documented in the current literature. In this work, we present the preliminary results of an online survey (N=105) addressing developers' concerns and perceptions about privacy threats steaming from SCPs. Our results suggest that, although users express concern about social and organisational privacy threats, they often feel safe sharing personal and project-related information on these platforms. Moreover, attacks targeting the inference of sensitive attributes are considered more likely than those seeking to re-identify source-code contributors. Based on these findings, we propose a set of recommendations for future investigations addressing privacy and identity management in SCPs.
△ Less
Submitted 3 March, 2023;
originally announced March 2023.
-
Cybersecurity Discussions in Stack Overflow: A Developer-Centred Analysis of Engagement and Self-Disclosure Behaviour
Authors:
Nicolás E. Díaz Ferreyra,
Melina Vidoni,
Maritta Heisel,
Riccardo Scandariato
Abstract:
Stack Overflow (SO) is a popular platform among developers seeking advice on various software-related topics, including privacy and security. As for many knowledge-sharing websites, the value of SO depends largely on users' engagement, namely their willingness to answer, comment or post technical questions. Still, many of these questions (including cybersecurity-related ones) remain unanswered, pu…
▽ More
Stack Overflow (SO) is a popular platform among developers seeking advice on various software-related topics, including privacy and security. As for many knowledge-sharing websites, the value of SO depends largely on users' engagement, namely their willingness to answer, comment or post technical questions. Still, many of these questions (including cybersecurity-related ones) remain unanswered, putting the site's relevance and reputation into question. Hence, it is important to understand users' participation in privacy and security discussions to promote engagement and foster the exchange of such expertise. Objective: Based on prior findings on online social networks, this work elaborates on the interplay between users' engagement and their privacy practices in SO. Particularly, it analyses developers' self-disclosure behaviour regarding profile visibility and their involvement in discussions related to privacy and security. Method: We followed a mixed-methods approach by (i) analysing SO data from 1239 cybersecurity-tagged questions along with 7048 user profiles, and (ii) conducting an anonymous online survey (N=64). Results: About 33% of the questions we retrieved had no answer, whereas more than 50% had no accepted answer. We observed that "proactive" users tend to disclose significantly less information in their profiles than "reactive" and "unengaged" ones. However, no correlations were found between these engagement categories and privacy-related constructs such as Perceived Control or General Privacy Concerns. Implications: These findings contribute to (i) a better understanding of developers' engagement towards privacy and security topics, and (ii) to shape strategies promoting the exchange of cybersecurity expertise in SO.
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
Should I Get Involved? On the Privacy Perils of Mining Software Repositories for Research Participants
Authors:
Melina Vidoni,
Nicolás E. Díaz Ferreyra
Abstract:
Mining Software Repositories (MSRs) is an evidence-based methodology that cross-links data to uncover actionable information about software systems. Empirical studies in software engineering often leverage MSR techniques as they allow researchers to unveil issues and flaws in software development so as to analyse the different factors contributing to them. Hence, counting on fine-grained informati…
▽ More
Mining Software Repositories (MSRs) is an evidence-based methodology that cross-links data to uncover actionable information about software systems. Empirical studies in software engineering often leverage MSR techniques as they allow researchers to unveil issues and flaws in software development so as to analyse the different factors contributing to them. Hence, counting on fine-grained information about the repositories and sources being mined (e.g., server names, and contributors' identities) is essential for the reproducibility and transparency of MSR studies. However, this can also introduce threats to participants' privacy as their identities may be linked to flawed/sub-optimal programming practices (e.g., code smells, improper documentation), or vice-versa. Moreover, this can be extensible to close collaborators and community members resulting "guilty by association". This position paper aims to start a discussion about indirect participation in MSRs investigations, the dichotomy of 'privacy vs. utility' regarding sharing non-aggregated data, and its effects on privacy restrictions and ethical considerations for participant involvement.
△ Less
Submitted 24 February, 2022;
originally announced February 2022.
-
Technical Debt in the Peer-Review Documentation of R Packages: a rOpenSci Case Study
Authors:
Zadia Codabux,
Melina Vidoni,
Fatemeh H. Fard
Abstract:
Context: Technical Debt is a metaphor used to describe code that is "not quite right." Although TD studies have gained momentum, TD has yet to be studied as thoroughly in non-Object-Oriented (OO) or scientific software such as R. R is a multi-paradigm programming language, whose popularity in data science and statistical applications has amplified in recent years. Due to R's inherent ability to ex…
▽ More
Context: Technical Debt is a metaphor used to describe code that is "not quite right." Although TD studies have gained momentum, TD has yet to be studied as thoroughly in non-Object-Oriented (OO) or scientific software such as R. R is a multi-paradigm programming language, whose popularity in data science and statistical applications has amplified in recent years. Due to R's inherent ability to expand through user-contributed packages, several community-led organizations were created to organize and peer-review packages in a concerted effort to increase their quality. Nonetheless, it is well-known that most R users do not have a technical programming background, being from multiple disciplines. Objective: The goal of this study is to investigate TD in the peer-review documentation of R packages led by rOpenSci. Method: We collected over 5000 comments from 157 packages that had been reviewed and approved to be published at rOpenSci. We manually analyzed a sample dataset of these comments posted by package authors, editors of rOpenSci, and reviewers during the review process to investigate the TD types present in these reviews. Results: The findings of our study include (i) a taxonomy of TD derived from our analysis of the peer-reviews (ii) documentation debt as being the most prevalent type of debt (iii) different user roles are concerned with different types of TD. For instance, reviewers tend to report some TD types more than other roles, and the TD types they report are different from those reported by the authors of a package. Conclusion: TD analysis in scientific software or peer-review is almost non-existent. Our study is a pioneer but within the context of R packages. However, our findings can serve as a starting point for replication studies, given our public datasets, to perform similar analyses in other scientific software or to investigate the rationale behind our findings.
△ Less
Submitted 16 March, 2021;
originally announced March 2021.
-
Orthogonal Language and Task Adapters in Zero-Shot Cross-Lingual Transfer
Authors:
Marko Vidoni,
Ivan Vulić,
Goran Glavaš
Abstract:
Adapter modules, additional trainable parameters that enable efficient fine-tuning of pretrained transformers, have recently been used for language specialization of multilingual transformers, improving downstream zero-shot cross-lingual transfer. In this work, we propose orthogonal language and task adapters (dubbed orthoadapters) for cross-lingual transfer. They are trained to encode language- a…
▽ More
Adapter modules, additional trainable parameters that enable efficient fine-tuning of pretrained transformers, have recently been used for language specialization of multilingual transformers, improving downstream zero-shot cross-lingual transfer. In this work, we propose orthogonal language and task adapters (dubbed orthoadapters) for cross-lingual transfer. They are trained to encode language- and task-specific information that is complementary (i.e., orthogonal) to the knowledge already stored in the pretrained transformer's parameters. Our zero-shot cross-lingual transfer experiments, involving three tasks (POS-tagging, NER, NLI) and a set of 10 diverse languages, 1) point to the usefulness of orthoadapters in cross-lingual transfer, especially for the most complex NLI task, but also 2) indicate that the optimal adapter configuration highly depends on the task and the target language. We hope that our work will motivate a wider investigation of usefulness of orthogonality constraints in language- and task-specific fine-tuning of pretrained transformers.
△ Less
Submitted 11 December, 2020;
originally announced December 2020.