-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1092 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 14 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
MiQA: A Benchmark for Inference on Metaphorical Questions
Authors:
Iulia-Maria Comsa,
Julian Martin Eisenschlos,
Srini Narayanan
Abstract:
We propose a benchmark to assess the capability of large language models to reason with conventional metaphors. Our benchmark combines the previously isolated topics of metaphor detection and commonsense reasoning into a single task that requires a model to make inferences by accurately selecting between the literal and metaphorical register. We examine the performance of state-of-the-art pre-trai…
▽ More
We propose a benchmark to assess the capability of large language models to reason with conventional metaphors. Our benchmark combines the previously isolated topics of metaphor detection and commonsense reasoning into a single task that requires a model to make inferences by accurately selecting between the literal and metaphorical register. We examine the performance of state-of-the-art pre-trained models on binary-choice tasks and find a large discrepancy between the performance of small and very large models, going from chance to near-human level. We also analyse the largest model in a generative setting and find that although human performance is approached, careful multiple-shot prompting is required.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
Prediction of Dilatory Behavior in eLearning: A Comparison of Multiple Machine Learning Models
Authors:
Christof Imhof,
Ioan-Sorin Comsa,
Martin Hlosta,
Behnam Parsaeifard,
Ivan Moser,
Per Bergamin
Abstract:
Procrastination, the irrational delay of tasks, is a common occurrence in online learning. Potential negative consequences include higher risk of drop-outs, increased stress, and reduced mood. Due to the rise of learning management systems and learning analytics, indicators of such behavior can be detected, enabling predictions of future procrastination and other dilatory behavior. However, resear…
▽ More
Procrastination, the irrational delay of tasks, is a common occurrence in online learning. Potential negative consequences include higher risk of drop-outs, increased stress, and reduced mood. Due to the rise of learning management systems and learning analytics, indicators of such behavior can be detected, enabling predictions of future procrastination and other dilatory behavior. However, research focusing on such predictions is scarce. Moreover, studies involving different types of predictors and comparisons between the predictive performance of various methods are virtually non-existent. In this study, we aim to fill these research gaps by analyzing the performance of multiple machine learning algorithms when predicting the delayed or timely submission of online assignments in a higher education setting with two categories of predictors: subjective, questionnaire-based variables and objective, log-data based indicators extracted from a learning management system. The results show that models with objective predictors consistently outperform models with subjective predictors, and a combination of both variable types perform slightly better. For each of these three options, a different approach prevailed (Gradient Boosting Machines for the subjective, Bayesian multilevel models for the objective, and Random Forest for the combined predictors). We conclude that careful attention should be paid to the selection of predictors and algorithms before implementing such models in learning management systems.
△ Less
Submitted 30 June, 2022;
originally announced June 2022.
-
Zuckerli: A New Compressed Representation for Graphs
Authors:
Luca Versari,
Iulia M. Comsa,
Alessio Conte,
Roberto Grossi
Abstract:
Zuckerli is a scalable compression system meant for large real-world graphs. Graphs are notoriously challenging structures to store efficiently due to their linked nature, which makes it hard to separate them into smaller, compact components. Therefore, effective compression is crucial when dealing with large graphs, which can have billions of nodes and edges. Furthermore, a good compression syste…
▽ More
Zuckerli is a scalable compression system meant for large real-world graphs. Graphs are notoriously challenging structures to store efficiently due to their linked nature, which makes it hard to separate them into smaller, compact components. Therefore, effective compression is crucial when dealing with large graphs, which can have billions of nodes and edges. Furthermore, a good compression system should give the user fast and reasonably flexible access to parts of the compressed data without requiring full decompression, which may be unfeasible on their system. Zuckerli improves multiple aspects of WebGraph, the current state-of-the-art in compressing real-world graphs, by using advanced compression techniques and novel heuristic graph algorithms. It can produce both a compressed representation for storage and one which allows fast direct access to the adjacency lists of the compressed graph without decompressing the entire graph. We validate the effectiveness of Zuckerli on real-world graphs with up to a billion nodes and 90 billion edges, conducting an extensive experimental evaluation of both compression density and decompression performance. We show that Zuckerli-compressed graphs are 10% to 29% smaller, and more than 20% in most cases, with a resource usage for decompression comparable to that of WebGraph.
△ Less
Submitted 2 September, 2020;
originally announced September 2020.
-
Intelligent Matrix Exponentiation
Authors:
Thomas Fischbacher,
Iulia M. Comsa,
Krzysztof Potempa,
Moritz Firsching,
Luca Versari,
Jyrki Alakuijala
Abstract:
We present a novel machine learning architecture that uses the exponential of a single input-dependent matrix as its only nonlinearity. The mathematical simplicity of this architecture allows a detailed analysis of its behaviour, providing robustness guarantees via Lipschitz bounds. Despite its simplicity, a single matrix exponential layer already provides universal approximation properties and ca…
▽ More
We present a novel machine learning architecture that uses the exponential of a single input-dependent matrix as its only nonlinearity. The mathematical simplicity of this architecture allows a detailed analysis of its behaviour, providing robustness guarantees via Lipschitz bounds. Despite its simplicity, a single matrix exponential layer already provides universal approximation properties and can learn fundamental functions of the input, such as periodic functions or multivariate polynomials. This architecture outperforms other general-purpose architectures on benchmark problems, including CIFAR-10, using substantially fewer parameters.
△ Less
Submitted 10 August, 2020;
originally announced August 2020.
-
Committee Draft of JPEG XL Image Coding System
Authors:
Alexander Rhatushnyak,
Jan Wassenberg,
Jon Sneyers,
Jyrki Alakuijala,
Lode Vandevenne,
Luca Versari,
Robert Obryk,
Zoltan Szabadka,
Evgenii Kliuchnikov,
Iulia-Maria Comsa,
Krzysztof Potempa,
Martin Bruse,
Moritz Firsching,
Renata Khasanova,
Ruud van Asseldonk,
Sami Boukortt,
Sebastian Gomez,
Thomas Fischbacher
Abstract:
JPEG XL is a practical approach focused on scalable web distribution and efficient compression of high-quality images. It provides various benefits compared to existing image formats: 60% size reduction at equivalent subjective quality; fast, parallelizable decoding and encoding configurations; features such as progressive, lossless, animation, and reversible transcoding of existing JPEG with 22%…
▽ More
JPEG XL is a practical approach focused on scalable web distribution and efficient compression of high-quality images. It provides various benefits compared to existing image formats: 60% size reduction at equivalent subjective quality; fast, parallelizable decoding and encoding configurations; features such as progressive, lossless, animation, and reversible transcoding of existing JPEG with 22% size reduction; support for high-quality applications including wide gamut, higher resolution/bit depth/dynamic range, and visually lossless coding. The JPEG XL architecture is traditional block-transform coding with upgrades to each component.
△ Less
Submitted 13 August, 2019; v1 submitted 12 August, 2019;
originally announced August 2019.
-
Temporal Coding in Spiking Neural Networks with Alpha Synaptic Function: Learning with Backpropagation
Authors:
Iulia M. Comsa,
Krzysztof Potempa,
Luca Versari,
Thomas Fischbacher,
Andrea Gesmundo,
Jyrki Alakuijala
Abstract:
The timing of individual neuronal spikes is essential for biological brains to make fast responses to sensory stimuli. However, conventional artificial neural networks lack the intrinsic temporal coding ability present in biological networks. We propose a spiking neural network model that encodes information in the relative timing of individual neuron spikes. In classification tasks, the output of…
▽ More
The timing of individual neuronal spikes is essential for biological brains to make fast responses to sensory stimuli. However, conventional artificial neural networks lack the intrinsic temporal coding ability present in biological networks. We propose a spiking neural network model that encodes information in the relative timing of individual neuron spikes. In classification tasks, the output of the network is indicated by the first neuron to spike in the output layer. This temporal coding scheme allows the supervised training of the network with backpropagation, using locally exact derivatives of the postsynaptic spike times with respect to presynaptic spike times. The network operates using a biologically-plausible alpha synaptic transfer function. Additionally, we use trainable synchronisation pulses that provide bias, add flexibility during training and exploit the decay part of the alpha function. We show that such networks can be trained successfully on noisy Boolean logic tasks and on the MNIST dataset encoded in time. The results show that the spiking neural network outperforms comparable spiking models on MNIST and achieves similar quality to fully connected conventional networks with the same architecture. We also find that the spiking network spontaneously discovers two operating regimes, mirroring the accuracy-speed trade-off observed in human decision-making: a slow regime, where a decision is taken after all hidden neurons have spiked and the accuracy is very high, and a fast regime, where a decision is taken very fast but the accuracy is lower. These results demonstrate the computational power of spiking networks with biological characteristics that encode information in the timing of individual neurons. By studying temporal coding in spiking networks, we aim to create building blocks towards energy-efficient and more complex biologically-inspired neural architectures.
△ Less
Submitted 16 November, 2020; v1 submitted 30 July, 2019;
originally announced July 2019.
-
SO(8) Supergravity and the Magic of Machine Learning
Authors:
Iulia M. Comsa,
Moritz Firsching,
Thomas Fischbacher
Abstract:
Using de Wit-Nicolai $D=4\;\mathcal{N}=8\;SO(8)$ supergravity as an example, we show how modern Machine Learning software libraries such as Google's TensorFlow can be employed to greatly simplify the analysis of high-dimensional scalar sectors of some M-Theory compactifications.
We provide detailed information on the location, symmetries, and particle spectra and charges of 192 critical points o…
▽ More
Using de Wit-Nicolai $D=4\;\mathcal{N}=8\;SO(8)$ supergravity as an example, we show how modern Machine Learning software libraries such as Google's TensorFlow can be employed to greatly simplify the analysis of high-dimensional scalar sectors of some M-Theory compactifications.
We provide detailed information on the location, symmetries, and particle spectra and charges of 192 critical points on the scalar manifold of SO(8) supergravity, including one newly discovered $\mathcal{N}=1$ vacuum with $SO(3)$ residual symmetry, one new potentially stabilizable non-supersymmetric solution, and examples for "Galois conjugate pairs" of solutions, i.e. solution-pairs that share the same gauge group embedding into~$SO(8)$ and minimal polynomials for the cosmological constant. Where feasible, we give analytic expressions for solution coordinates and cosmological constants.
As the authors' aspiration is to present the discussion in a form that is accessible to both the Machine Learning and String Theory communities and allows adopting our methods towards the study of other models, we provide an introductory overview over the relevant Physics as well as Machine Learning concepts. This includes short pedagogical code examples. In particular, we show how to formulate a requirement for residual Supersymmetry as a Machine Learning loss function and effectively guide the numerical search towards supersymmetric critical points. Numerical investigations suggest that there are no further supersymmetric vacua beyond this newly discovered fifth solution.
△ Less
Submitted 19 July, 2019; v1 submitted 1 June, 2019;
originally announced June 2019.
-
Evaluating QoS Parameters for IPTV Distribution in Heterogeneous Networks
Authors:
Ioan Sorin Comsa,
Radu Arsinte
Abstract:
The present work presents an architecture developed to evaluate the QoS parameters for the IPTV heterogeneous network. At its very basic level lie two software technologies: Video LAN and Windows Media Services with two operating systems: Windows and Linux. Three types of streams are analyzed, which will be transmitted to a Linux VLC client through means of the aggregation and access servers. The…
▽ More
The present work presents an architecture developed to evaluate the QoS parameters for the IPTV heterogeneous network. At its very basic level lie two software technologies: Video LAN and Windows Media Services with two operating systems: Windows and Linux. Three types of streams are analyzed, which will be transmitted to a Linux VLC client through means of the aggregation and access servers. The first stream is generated in real time by a capture camera, processed by the encapsulated VC-1 encoder and sent to the Media Server, while the second one is of VoD(Video on Demand) type and the third one will be handled by DVBViewer through the MPEG TS form. The first stream is transcoded in H.264-AAC such that the Linux stations will recognize its format. Through the simultaneous transmission of the three streams, we are analyzing their performance from a QoS parameters point of view by means of an application implemented in C programming language. The stream transporting the DVB-S television content was proven to ensure the best performance regarding loss of packets, delays and jitter.
△ Less
Submitted 21 February, 2015;
originally announced February 2015.