Skip to main content

Showing 1–5 of 5 results for author: Al-Kaswan, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2312.11658  [pdf, other

    cs.CR cs.AI cs.SE

    Traces of Memorisation in Large Language Models for Code

    Authors: Ali Al-Kaswan, Maliheh Izadi, Arie van Deursen

    Abstract: Large language models have gained significant popularity because of their ability to generate human-like text and potential applications in various fields, such as Software Engineering. Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet. The content of these datasets is memorised and can be extracted by attackers with data extr… ▽ More

    Submitted 15 January, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: ICSE 2024 Research Track

  2. arXiv:2302.13681  [pdf, ps, other

    cs.SE cs.AI

    The (ab)use of Open Source Code to Train Large Language Models

    Authors: Ali Al-Kaswan, Maliheh Izadi

    Abstract: In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a v… ▽ More

    Submitted 28 February, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

  3. arXiv:2302.13149  [pdf, other

    cs.SE cs.AI cs.CL

    STACC: Code Comment Classification using SentenceTransformers

    Authors: Ali Al-Kaswan, Maliheh Izadi, Arie van Deursen

    Abstract: Code comments are a key resource for information about software artefacts. Depending on the use case, only some types of comments are useful. Thus, automatic approaches to classify these comments have been proposed. In this work, we address this need by proposing, STACC, a set of SentenceTransformers-based binary classifiers. These lightweight classifiers are trained and tested on the NLBSE Code C… ▽ More

    Submitted 7 March, 2023; v1 submitted 25 February, 2023; originally announced February 2023.

  4. arXiv:2302.07735  [pdf, other

    cs.CL cs.AI cs.CR

    Targeted Attack on GPT-Neo for the SATML Language Model Data Extraction Challenge

    Authors: Ali Al-Kaswan, Maliheh Izadi, Arie van Deursen

    Abstract: Previous work has shown that Large Language Models are susceptible to so-called data extraction attacks. This allows an attacker to extract a sample that was contained in the training data, which has massive privacy implications. The construction of data extraction attacks is challenging, current attacks are quite inefficient, and there exists a significant gap in the extraction capabilities of un… ▽ More

    Submitted 13 February, 2023; originally announced February 2023.

  5. arXiv:2301.01701  [pdf, other

    cs.CR cs.AI cs.LG cs.SE

    Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries

    Authors: Ali Al-Kaswan, Toufique Ahmed, Maliheh Izadi, Anand Ashok Sawant, Premkumar Devanbu, Arie van Deursen

    Abstract: Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. While the automated summarisation of dec… ▽ More

    Submitted 13 January, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

    Comments: SANER 2023 Technical Track Camera Ready