-
"Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions
Authors:
Mihai Masala,
Denis C. Ilie-Ablachim,
Alexandru Dima,
Dragos Corlatescu,
Miruna Zavelca,
Ovio Olaru,
Simina Terian,
Andrei Terian,
Marius Leordeanu,
Horia Velicu,
Marius Popescu,
Mihai Dascalu,
Traian Rebedea
Abstract:
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and trai…
▽ More
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) to support and encourage research on Romanian LLMs while concurrently creating a generalizable recipe, adequate for other low or less-resourced languages.
△ Less
Submitted 27 June, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs
Authors:
Mihai Masala,
Denis C. Ilie-Ablachim,
Dragos Corlatescu,
Miruna Zavelca,
Marius Leordeanu,
Horia Velicu,
Marius Popescu,
Mihai Dascalu,
Traian Rebedea
Abstract:
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English. Hence, their performance in English greatly exceeds their performance in other languages. This document presents our approach to training and evaluating the first foundational and chat LLM specia…
▽ More
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English. Hence, their performance in English greatly exceeds their performance in other languages. This document presents our approach to training and evaluating the first foundational and chat LLM specialized for Romanian.
△ Less
Submitted 17 May, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
EMBERSim: A Large-Scale Databank for Boosting Similarity Search in Malware Analysis
Authors:
Dragos Georgian Corlatescu,
Alexandru Dinu,
Mihaela Gaman,
Paul Sumedrea
Abstract:
In recent years there has been a shift from heuristics-based malware detection towards machine learning, which proves to be more robust in the current heavily adversarial threat landscape. While we acknowledge machine learning to be better equipped to mine for patterns in the increasingly high amounts of similar-looking files, we also note a remarkable scarcity of the data available for similarity…
▽ More
In recent years there has been a shift from heuristics-based malware detection towards machine learning, which proves to be more robust in the current heavily adversarial threat landscape. While we acknowledge machine learning to be better equipped to mine for patterns in the increasingly high amounts of similar-looking files, we also note a remarkable scarcity of the data available for similarity-targeted research. Moreover, we observe that the focus in the few related works falls on quantifying similarity in malware, often overlooking the clean data. This one-sided quantification is especially dangerous in the context of detection bypass. We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER - one of the largest malware classification data sets. We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space. Our contribution is threefold: (1) we publish EMBERSim, an augmented version of EMBER, that includes similarity-informed tags; (2) we enrich EMBERSim with automatically determined malware class tags using the open-source tool AVClass on VirusTotal data and (3) we describe and share the implementation for our class scoring technique and leaf similarity method.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.