Search | arXiv e-print repository

DEArt: Dataset of European Art

Authors: Artem Reshetnikov, Maria-Cristina Marinescu, Joaquim More Lopez

Abstract: Large datasets that were made publicly available to the research community over the last 20 years have been a key enabling factor for the advances in deep learning algorithms for NLP or computer vision. These datasets are generally pairs of aligned image / manually annotated metadata, where images are photographs of everyday life. Scholarly and historical content, on the other hand, treat subjects… ▽ More Large datasets that were made publicly available to the research community over the last 20 years have been a key enabling factor for the advances in deep learning algorithms for NLP or computer vision. These datasets are generally pairs of aligned image / manually annotated metadata, where images are photographs of everyday life. Scholarly and historical content, on the other hand, treat subjects that are not necessarily popular to a general audience, they may not always contain a large number of data points, and new data may be difficult or impossible to collect. Some exceptions do exist, for instance, scientific or health data, but this is not the case for cultural heritage (CH). The poor performance of the best models in computer vision - when tested over artworks - coupled with the lack of extensively annotated datasets for CH, and the fact that artwork images depict objects and actions not captured by photographs, indicate that a CH-specific dataset would be highly valuable for this community. We propose DEArt, at this point primarily an object detection and pose classification dataset meant to be a reference for paintings between the XIIth and the XVIIIth centuries. It contains more than 15000 images, about 80% non-iconic, aligned with manual annotations for the bounding boxes identifying all instances of 69 classes as well as 12 possible poses for boxes identifying human-like objects. Of these, more than 50 classes are CH-specific and thus do not appear in other datasets; these reflect imaginary beings, symbolic entities and other categories related to art. Additionally, existing datasets do not include pose annotations. Our results show that object detectors for the cultural heritage domain can achieve a level of precision comparable to state-of-art models for generic images via transfer learning. △ Less

Submitted 3 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: VISART VI. Workshop at the European Conference of Computer Vision (ECCV)

arXiv:2012.02325 [pdf]

doi 10.1371/journal.pcbi.1009041

Ten Simple Rules for making a vocabulary FAIR

Authors: Simon J D Cox, Alejandra N Gonzalez-Beltran, Barbara Magagna, Maria-Cristina Marinescu

Abstract: We present ten simple rules that support converting a legacy vocabulary -- a list of terms available in a print-based glossary or table not accessible using web standards -- into a FAIR vocabulary. Various pathways may be followed to publish the FAIR vocabulary, but we emphasise particularly the goal of providing a distinct IRI for each term or concept. A standard representation of the concept sho… ▽ More We present ten simple rules that support converting a legacy vocabulary -- a list of terms available in a print-based glossary or table not accessible using web standards -- into a FAIR vocabulary. Various pathways may be followed to publish the FAIR vocabulary, but we emphasise particularly the goal of providing a distinct IRI for each term or concept. A standard representation of the concept should be returned when the individual IRI is de-referenced, using SKOS or OWL serialised in an RDF-based representation for machine-interchange, or in a web-page for human consumption. Guidelines for vocabulary and item metadata are provided, as well as development and maintenance considerations. By following these rules you can achieve the outcome of converting a legacy vocabulary into a standalone FAIR vocabulary, which can be used for unambiguous data annotation. In turn, this increases data interoperability and enables data integration. △ Less

Submitted 3 December, 2020; originally announced December 2020.

Comments: 13 pages

Journal ref: PLoS Comput Biol 17(6): e1009041 (2021)

arXiv:2006.14994 [pdf]

ProVe -- Self-supervised pipeline for automated product replacement and cold-starting based on neural language models

Authors: Andrei Ionut Damian, Laurentiu Piciu, Cosmin Mihai Marinescu

Abstract: In retail vertical industries, businesses are dealing with human limitation of quickly understanding and adapting to new purchasing behaviors. Moreover, retail businesses need to overcome the human limitation of properly managing a massive selection of products/brands/categories. These limitations lead to deficiencies from both commercial (e.g. loss of sales, decrease in customer satisfaction) and… ▽ More In retail vertical industries, businesses are dealing with human limitation of quickly understanding and adapting to new purchasing behaviors. Moreover, retail businesses need to overcome the human limitation of properly managing a massive selection of products/brands/categories. These limitations lead to deficiencies from both commercial (e.g. loss of sales, decrease in customer satisfaction) and operational perspective (e.g. out-of-stock, over-stock). In this paper, we propose a pipeline approach based on Natural Language Understanding, for recommending the most suitable replacements for products that are out-of-stock. Moreover, we will propose a solution for managing products that were newly introduced in a retailer's portfolio with almost no transactional history. This solution will help businesses: automatically assign the new products to the right category; recommend complementary products for cross-sell from day 1; perform sales predictions even with almost no transactional history. Finally, the vector space model resulted by applying the pipeline presented in this paper is directly used as semantic information in deep learning-based demand forecasting solutions, leading to more accurate predictions. The whole research and experimentation process have been done using real-life private transactional data, however the source code is available on https://github.com/Lummetry/ProVe △ Less

Submitted 12 January, 2021; v1 submitted 26 June, 2020; originally announced June 2020.

arXiv:1806.10741 [pdf, other]

Robust Neural Malware Detection Models for Emulation Sequence Learning

Authors: Rakshit Agrawal, Jack W. Stokes, Mady Marinescu, Karthik Selvaraj

Abstract: Malicious software, or malware, presents a continuously evolving challenge in computer security. These embedded snippets of code in the form of malicious files or hidden within legitimate files cause a major risk to systems with their ability to run malicious command sequences. Malware authors even use polymorphism to reorder these commands and create several malicious variations. However, if exec… ▽ More Malicious software, or malware, presents a continuously evolving challenge in computer security. These embedded snippets of code in the form of malicious files or hidden within legitimate files cause a major risk to systems with their ability to run malicious command sequences. Malware authors even use polymorphism to reorder these commands and create several malicious variations. However, if executed in a secure environment, one can perform early malware detection on emulated command sequences. The models presented in this paper leverage this sequential data derived via emulation in order to perform Neural Malware Detection. These models target the core of the malicious operation by learning the presence and pattern of co-occurrence of malicious event actions from within these sequences. Our models can capture entire event sequences and be trained directly using the known target labels. These end-to-end learning models are powered by two commonly used structures - Long Short-Term Memory (LSTM) Networks and Convolutional Neural Networks (CNNs). Previously proposed sequential malware classification models process no more than 200 events. Attackers can evade detection by delaying any malicious activity beyond the beginning of the file. We present specialized models that can handle extremely long sequences while successfully performing malware detection in an efficient way. We present an implementation of the Convoluted Partitioning of Long Sequences approach in order to tackle this vulnerability and operate on long sequences. We present our results on a large dataset consisting of 634,249 file sequences, with extremely long file sequences. △ Less

Submitted 27 June, 2018; originally announced June 2018.

arXiv:1712.05919 [pdf, other]

Attack and Defense of Dynamic Analysis-Based, Adversarial Neural Malware Classification Models

Authors: Jack W. Stokes, De Wang, Mady Marinescu, Marc Marino, Brian Bussone

Abstract: Recently researchers have proposed using deep learning-based systems for malware detection. Unfortunately, all deep learning classification systems are vulnerable to adversarial attacks. Previous work has studied adversarial attacks against static analysis-based malware classifiers which only classify the content of the unknown file without execution. However, since the majority of malware is eith… ▽ More Recently researchers have proposed using deep learning-based systems for malware detection. Unfortunately, all deep learning classification systems are vulnerable to adversarial attacks. Previous work has studied adversarial attacks against static analysis-based malware classifiers which only classify the content of the unknown file without execution. However, since the majority of malware is either packed or encrypted, malware classification based on static analysis often fails to detect these types of files. To overcome this limitation, anti-malware companies typically perform dynamic analysis by emulating each file in the anti-malware engine or performing in-depth scanning in a virtual machine. These strategies allow the analysis of the malware after unpacking or decryption. In this work, we study different strategies of crafting adversarial samples for dynamic analysis. These strategies operate on sparse, binary inputs in contrast to continuous inputs such as pixels in images. We then study the effects of two, previously proposed defensive mechanisms against crafted adversarial samples including the distillation and ensemble defenses. We also propose and evaluate the weight decay defense. Experiments show that with these three defensive strategies, the number of successfully crafted adversarial samples is reduced compared to a standard baseline system without any defenses. In particular, the ensemble defense is the most resilient to adversarial attacks. Importantly, none of the defenses significantly reduce the classification accuracy for detecting malware. Finally, we demonstrate that while adding additional hidden layers to neural models does not significantly improve the malware classification accuracy, it does significantly increase the classifier's robustness to adversarial attacks. △ Less

Submitted 16 December, 2017; originally announced December 2017.

Showing 1–5 of 5 results for author: Marinescu, M