Search | arXiv e-print repository

AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat Reports

Authors: Lukas Lange, Marc Müller, Ghazaleh Haratinezhad Torbati, Dragan Milchevski, Patrick Grau, Subhash Pujari, Annemarie Friedrich

Abstract: Monitoring the threat landscape to be aware of actual or potential attacks is of utmost importance to cybersecurity professionals. Information about cyber threats is typically distributed using natural language reports. Natural language processing can help with managing this large amount of unstructured information, yet to date, the topic has received little attention. With this paper, we present… ▽ More Monitoring the threat landscape to be aware of actual or potential attacks is of utmost importance to cybersecurity professionals. Information about cyber threats is typically distributed using natural language reports. Natural language processing can help with managing this large amount of unstructured information, yet to date, the topic has received little attention. With this paper, we present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports. The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts including implicitly mentioned techniques and tactics. Entities and concepts are linked to Wikipedia and the MITRE ATT&CK knowledge base, the most widely-used taxonomy for classifying types of attacks. Prior datasets linking to MITRE ATT&CK either provide a single label per document or annotate sentences out-of-context; our dataset annotates entire documents in a much finer-grained way. In an experimental study, we model the annotations of our dataset using state-of-the-art neural models. In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: Accepted at LREC-COLING 2024. Corpus available at https://github.com/boschresearch/anno-ctr-lrec-coling-2024

arXiv:2312.10349 [pdf, other]

A Comparative Analysis of Large Language Models for Code Documentation Generation

Authors: Shubhang Shekhar Dvivedi, Vyshnav Vijay, Sai Leela Rahul Pujari, Shoumik Lodh, Dhruv Kumar

Abstract: This paper presents a comprehensive comparative analysis of Large Language Models (LLMs) for generation of code documentation. Code documentation is an essential part of the software writing process. The paper evaluates models such as GPT-3.5, GPT-4, Bard, Llama2, and Starchat on various parameters like Accuracy, Completeness, Relevance, Understandability, Readability and Time Taken for different… ▽ More This paper presents a comprehensive comparative analysis of Large Language Models (LLMs) for generation of code documentation. Code documentation is an essential part of the software writing process. The paper evaluates models such as GPT-3.5, GPT-4, Bard, Llama2, and Starchat on various parameters like Accuracy, Completeness, Relevance, Understandability, Readability and Time Taken for different levels of code documentation. Our evaluation employs a checklist-based system to minimize subjectivity, providing a more objective assessment. We find that, barring Starchat, all LLMs consistently outperform the original documentation. Notably, closed-source models GPT-3.5, GPT-4, and Bard exhibit superior performance across various parameters compared to open-source/source-available LLMs, namely LLama 2 and StarChat. Considering the time taken for generation, GPT-4 demonstrated the longest duration, followed by Llama2, Bard, with ChatGPT and Starchat having comparable generation times. Additionally, file level documentation had a considerably worse performance across all parameters (except for time taken) as compared to inline and function level documentation. △ Less

Submitted 27 April, 2024; v1 submitted 16 December, 2023; originally announced December 2023.

Comments: Under review

arXiv:2204.09781 [pdf]

Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations

Authors: Qingyu Chen, Alexis Allot, Robert Leaman, Rezarta Islamaj Doğan, **gcheng Du, Li Fang, Kai Wang, Shuo Xu, Yuefu Zhang, Parsa Bagherzadeh, Sabine Bergler, Aakash Bhatnagar, Nidhir Bhavsar, Yung-Chun Chang, Sheng-Jie Lin, Wentai Tang, Hongtong Zhang, Ilija Tavchioski, Senja Pollak, Shubo Tian, **feng Zhang, Yulia Otmakhova, Antonio Jimeno Yepes, Hang Dong, Honghan Wu , et al. (14 additional authors not shown)

Abstract: The COVID-19 pandemic has been severely impacting global society since December 2019. Massive research has been undertaken to understand the characteristics of the virus and design vaccines and drugs. The related findings have been reported in biomedical literature at a rate of about 10,000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretatio… ▽ More The COVID-19 pandemic has been severely impacting global society since December 2019. Massive research has been undertaken to understand the characteristics of the virus and design vaccines and drugs. The related findings have been reported in biomedical literature at a rate of about 10,000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretation. For instance, LitCovid is a literature database of COVID-19-related articles in PubMed, which has accumulated more than 200,000 articles with millions of accesses each month by users worldwide. One primary curation task is to assign up to eight topics (e.g., Diagnosis and Treatment) to the articles in LitCovid. Despite the continuing advances in biomedical text mining methods, few have been dedicated to topic annotations in COVID-19 literature. To close the gap, we organized the BioCreative LitCovid track to call for a community effort to tackle automated topic annotation for COVID-19 literature. The BioCreative LitCovid dataset, consisting of over 30,000 articles with manually reviewed topics, was created for training and testing. It is one of the largest multilabel classification datasets in biomedical scientific literature. 19 teams worldwide participated and made 80 submissions in total. Most teams used hybrid systems based on transformers. The highest performing submissions achieved 0.8875, 0.9181, and 0.9394 for macro F1-score, micro F1-score, and instance-based F1-score, respectively. The level of participation and results demonstrate a successful track and help close the gap between dataset curation and method development. The dataset is publicly available via https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/ for benchmarking and further development. △ Less

Submitted 3 June, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

arXiv:1701.01276 [pdf, other]

Temporal Effects on Hashtag Reuse in Twitter: A Cognitive-Inspired Hashtag Recommendation Approach

Authors: Dominik Kowald, Subhash Pujari, Elisabeth Lex

Abstract: Hashtags have become a powerful tool in social platforms such as Twitter to categorize and search for content, and to spread short messages across members of the social network. In this paper, we study temporal hashtag usage practices in Twitter with the aim of designing a cognitive-inspired hashtag recommendation algorithm we call BLLi,s. Our main idea is to incorporate the effect of time on (i)… ▽ More Hashtags have become a powerful tool in social platforms such as Twitter to categorize and search for content, and to spread short messages across members of the social network. In this paper, we study temporal hashtag usage practices in Twitter with the aim of designing a cognitive-inspired hashtag recommendation algorithm we call BLLi,s. Our main idea is to incorporate the effect of time on (i) individual hashtag reuse (i.e., reusing own hashtags), and (ii) social hashtag reuse (i.e., reusing hashtags, which has been previously used by a followee) into a predictive model. For this, we turn to the Base-Level Learning (BLL) equation from the cognitive architecture ACT-R, which accounts for the time-dependent decay of item exposure in human memory. We validate BLLi,s using two crawled Twitter datasets in two evaluation scenarios: firstly, only temporal usage patterns of past hashtag assignments are utilized and secondly, these patterns are combined with a content-based analysis of the current tweet. In both scenarios, we find not only that temporal effects play an important role for both individual and social hashtag reuse but also that BLLi,s provides significantly better prediction accuracy and ranking results than current state-of-the-art hashtag recommendation methods. △ Less

Submitted 5 January, 2017; originally announced January 2017.

Comments: Accepted at WWW 2017

arXiv:1203.2272 [pdf]

System on Programable Chip for Performance Estimation of Loom Machine

Authors: Gurpreet Singh, Ajay Kumar Roy, Surekha K S, S. Pujari

Abstract: System on programmable chip for the performance estimation of loom machine, which calculates the efficiency and meter count for weaved cloth automatically. Also it calculates the efficiency of loom machine. Previously the same was done using manual process which was not efficient. This article is intended for loom machines which are not modern. System on programmable chip for the performance estimation of loom machine, which calculates the efficiency and meter count for weaved cloth automatically. Also it calculates the efficiency of loom machine. Previously the same was done using manual process which was not efficient. This article is intended for loom machines which are not modern. △ Less

Submitted 10 March, 2012; originally announced March 2012.

arXiv:1002.2321 [pdf, ps, other]

Exploiting Grids for applications in Condensed Matter Physics

Authors: Bhalchandra S. Pujari

Abstract: Grids - the collection of heterogeneous computers spread across the globe - present a new paradigm for the large scale problems in variety of fields. We discuss two representative cases in the area of condensed matter physics outlining the widespread applications of the Grids. Both the problems involve calculations based on commonly used Density Functional Theory and hence can be considered to b… ▽ More Grids - the collection of heterogeneous computers spread across the globe - present a new paradigm for the large scale problems in variety of fields. We discuss two representative cases in the area of condensed matter physics outlining the widespread applications of the Grids. Both the problems involve calculations based on commonly used Density Functional Theory and hence can be considered to be of general interest. We demonstrate the suitability of Grids for the problems discussed and provide a general algorithm to implement and manage such large scale problems. △ Less

Submitted 11 February, 2010; originally announced February 2010.

Journal ref: ICTP Lecture Notes Series, Volume 24 (ISBN 92-95003-42-X) - November 2009

Showing 1–6 of 6 results for author: Pujari, S