Search | arXiv e-print repository

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2212.13535 [pdf, other]

From Single-Visit to Multi-Visit Image-Based Models: Single-Visit Models are Enough to Predict Obstructive Hydronephrosis

Authors: Stanley Bryan Z. Hua, Mandy Rickard, John Weaver, Alice Xiang, Daniel Alvarez, Kyla N. Velear, Kunj Sheth, Gregory E. Tasian, Armando J. Lorenzo, Anna Goldenberg, Lauren Erdman

Abstract: Previous work has shown the potential of deep learning to predict renal obstruction using kidney ultrasound images. However, these image-based classifiers have been trained with the goal of single-visit inference in mind. We compare methods from video action recognition (i.e. convolutional pooling, LSTM, TSM) to adapt single-visit convolutional models to handle multiple visit inference. We demonst… ▽ More Previous work has shown the potential of deep learning to predict renal obstruction using kidney ultrasound images. However, these image-based classifiers have been trained with the goal of single-visit inference in mind. We compare methods from video action recognition (i.e. convolutional pooling, LSTM, TSM) to adapt single-visit convolutional models to handle multiple visit inference. We demonstrate that incorporating images from a patient's past hospital visits provides only a small benefit for the prediction of obstructive hydronephrosis. Therefore, inclusion of prior ultrasounds is beneficial, but prediction based on the latest ultrasound is sufficient for patient risk stratification. △ Less

Submitted 27 December, 2022; originally announced December 2022.

Comments: Paper accepted to SIPAIM 2022 (in Valparaiso, Chile)

arXiv:2210.03453 [pdf, other]

Key Information Extraction in Purchase Documents using Deep Learning and Rule-based Corrections

Authors: Roberto Arroyo, Javier Yebes, Elena Martínez, Héctor Corrales, Javier Lorenzo

Abstract: Deep Learning (DL) is dominating the fields of Natural Language Processing (NLP) and Computer Vision (CV) in the recent times. However, DL commonly relies on the availability of large data annotations, so other alternative or complementary pattern-based techniques can help to improve results. In this paper, we build upon Key Information Extraction (KIE) in purchase documents using both DL and rule… ▽ More Deep Learning (DL) is dominating the fields of Natural Language Processing (NLP) and Computer Vision (CV) in the recent times. However, DL commonly relies on the availability of large data annotations, so other alternative or complementary pattern-based techniques can help to improve results. In this paper, we build upon Key Information Extraction (KIE) in purchase documents using both DL and rule-based corrections. Our system initially trusts on Optical Character Recognition (OCR) and text understanding based on entity tagging to identify purchase facts of interest (e.g., product codes, descriptions, quantities, or prices). These facts are then linked to a same product group, which is recognized by means of line detection and some grou** heuristics. Once these DL approaches are processed, we contribute several mechanisms consisting of rule-based corrections for improving the baseline DL predictions. We prove the enhancements provided by these rule-based corrections over the baseline DL results in the presented experiments for purchase documents from public and NielsenIQ datasets. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: Conference on Computational Linguistics (COLING 2022). PAN-DL Workshop

arXiv:2112.13922 [pdf]

Predicting Breakdown Risk Based on Historical Maintenance Data for Air Force Ground Vehicles

Authors: Jeff Jang, Dilan Nana, Jack Hochschild, Jordi Vila Hernandez de Lorenzo

Abstract: Unscheduled maintenance has contributed to longer downtime for vehicles and increased costs for Logistic Readiness Squadrons (LRSs) in the Air Force. When vehicles are in need of repair outside of their scheduled time, depending on their priority level, the entire squadron's slated repair schedule is transformed negatively. The repercussions of unscheduled maintenance are specifically seen in the… ▽ More Unscheduled maintenance has contributed to longer downtime for vehicles and increased costs for Logistic Readiness Squadrons (LRSs) in the Air Force. When vehicles are in need of repair outside of their scheduled time, depending on their priority level, the entire squadron's slated repair schedule is transformed negatively. The repercussions of unscheduled maintenance are specifically seen in the increase of man hours required to maintain vehicles that should have been working well: this can include more man hours spent on maintenance itself, waiting for parts to arrive, hours spent re-organizing the repair schedule, and more. The dominant trend in the current maintenance system at LRSs is that they do not have predictive maintenance infrastructure to counteract the influx of unscheduled repairs they experience currently, and as a result, their readiness and performance levels are lower than desired. We use data pulled from the Defense Property and Accountability System (DPAS), that the LRSs currently use to store their vehicle maintenance information. Using historical vehicle maintenance data we receive from DPAS, we apply three different algorithms independently to construct an accurate predictive system to optimize maintenance schedules at any given time. Through the application of Logistics Regression, Random Forest, and Gradient Boosted Trees algorithms, we found that a Logistic Regression algorithm, fitted to our data, produced the most accurate results. Our findings indicate that not only would continuing the use of Logistic Regression be prudent for our research purposes, but that there is opportunity to further tune and optimize our Logistic Regression model for higher accuracy. △ Less

Submitted 22 December, 2021; originally announced December 2021.

Comments: 15 pages, 8 figures

arXiv:2105.08647 [pdf, other]

IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture

Authors: J. Lorenzo, I. Parra, M. A. Sotelo

Abstract: Understanding pedestrian crossing behavior is an essential goal in intelligent vehicle development, leading to an improvement in their security and traffic flow. In this paper, we developed a method called IntFormer. It is based on transformer architecture and a novel convolutional video classification model called RubiksNet. Following the evaluation procedure in a recent benchmark, we show that o… ▽ More Understanding pedestrian crossing behavior is an essential goal in intelligent vehicle development, leading to an improvement in their security and traffic flow. In this paper, we developed a method called IntFormer. It is based on transformer architecture and a novel convolutional video classification model called RubiksNet. Following the evaluation procedure in a recent benchmark, we show that our model reaches state-of-the-art results with good performance ($\approx 40$ seq. per second) and size ($8\times $smaller than the best performing model), making it suitable for real-time usage. We also explore each of the input features, finding that ego-vehicle speed is the most important variable, possibly due to the similarity in crossing cases in PIE dataset. △ Less

Submitted 18 May, 2021; originally announced May 2021.

Comments: 5 pages, 2 figures

arXiv:2008.11647 [pdf, other]

RNN-based Pedestrian Crossing Prediction using Activity and Pose-related Features

Authors: Javier Lorenzo, Ignacio Parra, Florian Wirth, Christoph Stiller, David Fernandez Llorca, Miguel Angel Sotelo

Abstract: Pedestrian crossing prediction is a crucial task for autonomous driving. Numerous studies show that an early estimation of the pedestrian's intention can decrease or even avoid a high percentage of accidents. In this paper, different variations of a deep learning system are proposed to attempt to solve this problem. The proposed models are composed of two parts: a CNN-based feature extractor and a… ▽ More Pedestrian crossing prediction is a crucial task for autonomous driving. Numerous studies show that an early estimation of the pedestrian's intention can decrease or even avoid a high percentage of accidents. In this paper, different variations of a deep learning system are proposed to attempt to solve this problem. The proposed models are composed of two parts: a CNN-based feature extractor and an RNN module. All the models were trained and tested on the JAAD dataset. The results obtained indicate that the choice of the features extraction method, the inclusion of additional variables such as pedestrian gaze direction and discrete orientation, and the chosen RNN type have a significant impact on the final performance. △ Less

Submitted 26 August, 2020; originally announced August 2020.

Comments: 6 pages, 5 figures. This work has been accepted for publication at IEEE Intelligent Vehicle Symposium 2020

arXiv:1809.10937 [pdf, other]

New Thread Migration Strategies for NUMA Systems

Authors: O. G. Lorenzo, M. L. Becoña, T. F. Pena, J. C. Cabaleiro, J. A. Lorenzo, F. F. Rivera

Abstract: Multicore systems present on-board memory hierarchies and communication networks that influence performance when executing shared memory parallel codes. Characterising this influence is complex, and understanding the effect of particular hardware configurations on different codes is of paramount importance. In previous works, monitoring information extracted from hardware counters at runtime has b… ▽ More Multicore systems present on-board memory hierarchies and communication networks that influence performance when executing shared memory parallel codes. Characterising this influence is complex, and understanding the effect of particular hardware configurations on different codes is of paramount importance. In previous works, monitoring information extracted from hardware counters at runtime has been used to characterise the behaviour of each thread in the parallel code in terms of the number of floating point operations per second, operational intensity, and latency of memory access. We propose to use this information to guide thread migration strategies that improve execution efficiency by increasing locality and affinity. Different configurations of NAS Parallel OpenMP benchmarks on multicores were used to validate the benefits of the proposed thread migration strategies. Our proposed strategies produce up to 70% improvement in scenarios where locality and affinity are low, there being a small degradation in performance for codes with high locality and affinity. △ Less

Submitted 28 September, 2018; originally announced September 2018.

Comments: Unpublished work

arXiv:1603.05581 [pdf, other]

doi 10.1109/LAWP.2017.2718242

Norm-1 Regularized Consensus-based ADMM for Imaging with a Compressive Antenna

Authors: Juan Heredia Juesas, Ali Molaei, Luis Tirado, William Blackwell, Jose A Martinez Lorenzo

Abstract: This paper presents a novel norm-one-regularized, consensus-based imaging algorithm, based on the Alternating Direction Method of Multipliers (ADMM). This algorithm is capable of imaging composite dielectric and metallic targets by using limited amount of data. The distributed capabilities of the ADMM accelerates the convergence of the imaging. Recently, a Compressive Reflector Antenna (CRA) has b… ▽ More This paper presents a novel norm-one-regularized, consensus-based imaging algorithm, based on the Alternating Direction Method of Multipliers (ADMM). This algorithm is capable of imaging composite dielectric and metallic targets by using limited amount of data. The distributed capabilities of the ADMM accelerates the convergence of the imaging. Recently, a Compressive Reflector Antenna (CRA) has been proposed as a way to provide high-sensing-capacity with a minimum cost and complexity in the hardware architecture. The ADMM algorithm applied to the imaging capabilities of the Compressive Antenna (CA) outperforms current state of the art iterative reconstruction algorithms, such as Nesterov-based methods, in terms of computational cost; and it ultimately enables the use of a CA in quasi-real-time, compressive sensing imaging applications. △ Less

Submitted 16 March, 2016; originally announced March 2016.

Comments: 4 pages, 4 figures

Showing 1–8 of 8 results for author: Lorenzo, J