-
The Falcon Series of Open Language Models
Authors:
Ebtesam Almazrouei,
Hamza Alobeidli,
Abdulaziz Alshamsi,
Alessandro Cappelli,
Ruxandra Cojocaru,
Mérouane Debbah,
Étienne Goffinet,
Daniel Hesslow,
Julien Launay,
Quentin Malartic,
Daniele Mazzotta,
Badreddine Noune,
Baptiste Pannier,
Guilherme Penedo
Abstract:
We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurr…
▽ More
We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of large language models.
△ Less
Submitted 29 November, 2023; v1 submitted 28 November, 2023;
originally announced November 2023.
-
Do VSR Models Generalize Beyond LRS3?
Authors:
Yasser Abdelaziz Dahou Djilali,
Sanath Narayan,
Eustache Le Bihan,
Haithem Boussaid,
Ebtessam Almazrouei,
Merouane Debbah
Abstract:
The Lip Reading Sentences-3 (LRS3) benchmark has primarily been the focus of intense research in visual speech recognition (VSR) during the last few years. As a result, there is an increased risk of overfitting to its excessively used test set, which is only one hour duration. To alleviate this issue, we build a new VSR test set named WildVSR, by closely following the LRS3 dataset creation process…
▽ More
The Lip Reading Sentences-3 (LRS3) benchmark has primarily been the focus of intense research in visual speech recognition (VSR) during the last few years. As a result, there is an increased risk of overfitting to its excessively used test set, which is only one hour duration. To alleviate this issue, we build a new VSR test set named WildVSR, by closely following the LRS3 dataset creation processes. We then evaluate and analyse the extent to which the current VSR models generalize to the new test data. We evaluate a broad range of publicly available VSR models and find significant drops in performance on our test set, compared to their corresponding LRS3 results. Our results suggest that the increase in word error rates is caused by the models inability to generalize to slightly harder and in the wild lip sequences than those found in the LRS3 test set. Our new test benchmark is made public in order to enable future research towards more robust VSR models.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
Emergent Communication in Multi-Agent Reinforcement Learning for Future Wireless Networks
Authors:
Marwa Chafii,
Salmane Naoumi,
Reda Alami,
Ebtesam Almazrouei,
Mehdi Bennis,
Merouane Debbah
Abstract:
In different wireless network scenarios, multiple network entities need to cooperate in order to achieve a common task with minimum delay and energy consumption. Future wireless networks mandate exchanging high dimensional data in dynamic and uncertain environments, therefore implementing communication control tasks becomes challenging and highly complex. Multi-agent reinforcement learning with em…
▽ More
In different wireless network scenarios, multiple network entities need to cooperate in order to achieve a common task with minimum delay and energy consumption. Future wireless networks mandate exchanging high dimensional data in dynamic and uncertain environments, therefore implementing communication control tasks becomes challenging and highly complex. Multi-agent reinforcement learning with emergent communication (EC-MARL) is a promising solution to address high dimensional continuous control problems with partially observable states in a cooperative fashion where agents build an emergent communication protocol to solve complex tasks. This paper articulates the importance of EC-MARL within the context of future 6G wireless networks, which imbues autonomous decision-making capabilities into network entities to solve complex tasks such as autonomous driving, robot navigation, flying base stations network planning, and smart city applications. An overview of EC-MARL algorithms and their design criteria are provided while presenting use cases and research opportunities on this emerging topic.
△ Less
Submitted 12 September, 2023;
originally announced September 2023.
-
Joint Semantic-Native Communication and Inference via Minimal Simplicial Structures
Authors:
Qiyang Zhao,
Hang Zou,
Mehdi Bennis,
Merouane Debbah,
Ebtesam Almazrouei,
Faouzi Bader
Abstract:
In this work, we study the problem of semantic communication and inference, in which a student agent (i.e. mobile device) queries a teacher agent (i.e. cloud sever) to generate higher-order data semantics living in a simplicial complex. Specifically, the teacher first maps its data into a k-order simplicial complex and learns its high-order correlations. For effective communication and inference,…
▽ More
In this work, we study the problem of semantic communication and inference, in which a student agent (i.e. mobile device) queries a teacher agent (i.e. cloud sever) to generate higher-order data semantics living in a simplicial complex. Specifically, the teacher first maps its data into a k-order simplicial complex and learns its high-order correlations. For effective communication and inference, the teacher seeks minimally sufficient and invariant semantic structures prior to conveying information. These minimal simplicial structures are found via judiciously removing simplices selected by the Hodge Laplacians without compromising the inference query accuracy. Subsequently, the student locally runs its own set of queries based on a masked simplicial convolutional autoencoder (SCAE) leveraging both local and remote teacher's knowledge. Numerical results corroborate the effectiveness of the proposed approach in terms of improving inference query accuracy under different channel conditions and simplicial structures. Experiments on a coauthorship dataset show that removing simplices by ranking the Laplacian values yields a 85% reduction in payload size without sacrificing accuracy. Joint semantic communication and inference by masked SCAE improves query accuracy by 25% compared to local student based query and 15% compared to remote teacher based query. Finally, incorporating channel semantics is shown to effectively improve inference accuracy, notably at low SNR values.
△ Less
Submitted 31 August, 2023;
originally announced August 2023.
-
Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Map**
Authors:
Yasser Abdelaziz Dahou Djilali,
Sanath Narayan,
Haithem Boussaid,
Ebtessam Almazrouei,
Merouane Debbah
Abstract:
Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully train or finetune their models predicting the target speech. This hinders their ability to generalize well beyond the training set and leads to performance degene…
▽ More
Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully train or finetune their models predicting the target speech. This hinders their ability to generalize well beyond the training set and leads to performance degeneration under out-of-distribution challenging scenarios. Unlike previous works that involve auxiliary losses or complex training procedures and architectures, we propose a simple approach, named Lip2Vec that is based on learning a prior model. Given a robust visual speech encoder, this network maps the encoded latent representations of the lip sequence to their corresponding latents from the audio pair, which are sufficiently invariant for effective text decoding. The generated audio representation is then decoded to text using an off-the-shelf Audio Speech Recognition (ASR) model. The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER. Unlike SoTA approaches, our model keeps a reasonable performance on the VoxCeleb test set. We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
-
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Authors:
Guilherme Penedo,
Quentin Malartic,
Daniel Hesslow,
Ruxandra Cojocaru,
Alessandro Cappelli,
Hamza Alobeidli,
Baptiste Pannier,
Ebtesam Almazrouei,
Julien Launay
Abstract:
Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclea…
▽ More
Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques
Authors:
Wenbin Li,
Hakim Hacid,
Ebtesam Almazrouei,
Merouane Debbah
Abstract:
The union of Edge Computing (EC) and Artificial Intelligence (AI) has brought forward the Edge AI concept to provide intelligent solutions close to the end-user environment, for privacy preservation, low latency to real-time performance, and resource optimization. Machine Learning (ML), as the most advanced branch of AI in the past few years, has shown encouraging results and applications in the e…
▽ More
The union of Edge Computing (EC) and Artificial Intelligence (AI) has brought forward the Edge AI concept to provide intelligent solutions close to the end-user environment, for privacy preservation, low latency to real-time performance, and resource optimization. Machine Learning (ML), as the most advanced branch of AI in the past few years, has shown encouraging results and applications in the edge environment. Nevertheless, edge-powered ML solutions are more complex to realize due to the joint constraints from both edge computing and AI domains, and the corresponding solutions are expected to be efficient and adapted in technologies such as data processing, model compression, distributed inference, and advanced learning paradigms for Edge ML requirements. Despite the fact that a great deal of the attention garnered by Edge ML is gained in both the academic and industrial communities, we noticed the lack of a complete survey on existing Edge ML technologies to provide a common understanding of this concept. To tackle this, this paper aims at providing a comprehensive taxonomy and a systematic review of Edge ML techniques, focusing on the soft computing aspects of existing paradigms and techniques. We start by identifying the Edge ML requirements driven by the joint constraints. We then extensively survey more than twenty paradigms and techniques along with their representative work, covering two main parts: edge inference, and edge learning. In particular, we analyze how each technique fits into Edge ML by meeting a subset of the identified requirements. We also summarize Edge ML frameworks and open issues to shed light on future directions for Edge ML.
△ Less
Submitted 15 September, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Internet of NanoThings: Concepts and Applications
Authors:
Ebtesam Almazrouei,
Raed M. Shubair,
Fabrice Saffre
Abstract:
This chapter focuses on Internet of Things from the nanoscale point of view. The chapter starts with section 1 which provides an introduction of nanothings and nanotechnologies. The nanoscale communication paradigms and the different approaches are discussed for nanodevices development. Nanodevice characteristics are discussed and the architecture of wireless nanodevices are outlined. Section 2 de…
▽ More
This chapter focuses on Internet of Things from the nanoscale point of view. The chapter starts with section 1 which provides an introduction of nanothings and nanotechnologies. The nanoscale communication paradigms and the different approaches are discussed for nanodevices development. Nanodevice characteristics are discussed and the architecture of wireless nanodevices are outlined. Section 2 describes Internet of NanoThing(IoNT), its network architecture, and the challenges of nanoscale communication which is essential for enabling IoNT. Section 3 gives some practical applications of IoNT. The internet of Bio-NanoThing (IoBNT) and relevant biomedical applications are discussed. Other Applications such as military, industrial, and environmental applications are also outlined.
△ Less
Submitted 20 September, 2018;
originally announced September 2018.
-
Assessment of SFSDP Cooperative Localization Algorithm for WLAN Environment
Authors:
Ebtesam Almazrouei,
Nazar Ali,
Saleh Al-Araji
Abstract:
Cooperative localization for indoor WiFi networks have received little attention thus far. Many cooperative location algorithms exist for Wireless Sensor Network Applications but their suitability for WiFi based networks has not been studied. In this paper the performance of the Sparse Finite Semi Definite Program (SFSDP) has been examined using real measurements data and under different indoor co…
▽ More
Cooperative localization for indoor WiFi networks have received little attention thus far. Many cooperative location algorithms exist for Wireless Sensor Network Applications but their suitability for WiFi based networks has not been studied. In this paper the performance of the Sparse Finite Semi Definite Program (SFSDP) has been examined using real measurements data and under different indoor conditions. Effects of other network parameters such as varying number of anchors and blind nodes are also included.
△ Less
Submitted 10 January, 2018;
originally announced January 2018.