-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1092 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 14 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Bayesian estimation and reconstruction of marine surface contaminant dispersion
Authors:
Yang Liu,
Christopher M. Harvey,
Frederick E. Hamlyn,
Cunjia Liu
Abstract:
Discharge of hazardous substances into the marine environment poses a substantial risk to both public health and the ecosystem. In such incidents, it is imperative to accurately estimate the release strength of the source and reconstruct the spatio-temporal dispersion of the substances based on the collected measurements. In this study, we propose an integrated estimation framework to tackle this…
▽ More
Discharge of hazardous substances into the marine environment poses a substantial risk to both public health and the ecosystem. In such incidents, it is imperative to accurately estimate the release strength of the source and reconstruct the spatio-temporal dispersion of the substances based on the collected measurements. In this study, we propose an integrated estimation framework to tackle this challenge, which can be used in conjunction with a sensor network or a mobile sensor for environment monitoring. We employ the fundamental convection-diffusion partial differential equation (PDE) to represent the general dispersion of a physical quantity in a non-uniform flow field. The PDE model is spatially discretised into a linear state-space model using the dynamic transient finite-element method (FEM) so that the characterisation of time-varying dispersion can be cast into the problem of inferring the model states from sensor measurements. We also consider imperfect sensing phenomena, including miss-detection and signal quantisation, which are frequently encountered when using a sensor network. This complicated sensor process introduces nonlinearity into the Bayesian estimation process. A Rao-Blackwellised particle filter (RBPF) is designed to provide an effective solution by exploiting the linear structure of the state-space model, whereas the nonlinearity of the measurement model can be handled by Monte Carlo approximation with particles. The proposed framework is validated using a simulated oil spill incident in the Baltic sea with real ocean flow data. The results show the efficacy of the developed spatio-temporal dispersion model and estimation schemes in the presence of imperfect measurements. Moreover, the parameter selection process is discussed, along with some comparison studies to illustrate the advantages of the proposed algorithm over existing methods.
△ Less
Submitted 4 September, 2023; v1 submitted 1 September, 2023;
originally announced September 2023.
-
The Barriers to Online Clothing Websites for Visually Impaired People: An Interview and Observation Approach to Understanding Needs
Authors:
Amnah Alluqmani,
Morgan Harvey,
Ziqi Zhang
Abstract:
Visually impaired (VI) people often face challenges when performing everyday tasks and identify shop** for clothes as one of the most challenging. Many engage in online shop**, which eliminates some challenges of physical shop**. However, clothes shop** online suffers from many other limitations and barriers. More research is needed to address these challenges, and extant works often base…
▽ More
Visually impaired (VI) people often face challenges when performing everyday tasks and identify shop** for clothes as one of the most challenging. Many engage in online shop**, which eliminates some challenges of physical shop**. However, clothes shop** online suffers from many other limitations and barriers. More research is needed to address these challenges, and extant works often base their findings on interviews alone, providing only subjective, recall-biased information. We conducted two complementary studies using both observational and interview approaches to fill a gap in understanding about VI people's behaviour when selecting and purchasing clothes online. Our findings show that shop** websites suffer from inaccurate, misleading, and contradictory clothing descriptions; that VI people mainly rely on (unreliable) search tools and check product descriptions by reviewing customer comments. Our findings also indicate that VI people are hesitant to accept assistance from automated, but that trust in such systems could be improved if researchers can develop systems that better accommodate users' needs and preferences.
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
An Analysis of Classification Approaches for Hit Song Prediction using Engineered Metadata Features with Lyrics and Audio Features
Authors:
Mengyisong Zhao,
Morgan Harvey,
David Cameron,
Frank Hopfgartner,
Valerie J. Gillet
Abstract:
Hit song prediction, one of the emerging fields in music information retrieval (MIR), remains a considerable challenge. Being able to understand what makes a given song a hit is clearly beneficial to the whole music industry. Previous approaches to hit song prediction have focused on using audio features of a record. This study aims to improve the prediction result of the top 10 hits among Billboa…
▽ More
Hit song prediction, one of the emerging fields in music information retrieval (MIR), remains a considerable challenge. Being able to understand what makes a given song a hit is clearly beneficial to the whole music industry. Previous approaches to hit song prediction have focused on using audio features of a record. This study aims to improve the prediction result of the top 10 hits among Billboard Hot 100 songs using more alternative metadata, including song audio features provided by Spotify, song lyrics, and novel metadata-based features (title topic, popularity continuity and genre class). Five machine learning approaches are applied, including: k-nearest neighbours, Naive Bayes, Random Forest, Logistic Regression and Multilayer Perceptron. Our results show that Random Forest (RF) and Logistic Regression (LR) with all features (including novel features, song audio features and lyrics features) outperforms other models, achieving 89.1% and 87.2% accuracy, and 0.91 and 0.93 AUC, respectively. Our findings also demonstrate the utility of our novel music metadata features, which contributed most to the models' discriminative performance.
△ Less
Submitted 31 January, 2023;
originally announced January 2023.
-
Unfoldings and Nets of Regular Polytopes
Authors:
Satyan L. Devadoss,
Matthew Harvey
Abstract:
Over a decade ago, it was shown that every edge unfolding of the Platonic solids was without self-overlap, yielding a valid net. We consider this property for regular polytopes in arbitrary dimensions, notably the simplex, cube, and orthoplex. It was recently proven that all unfoldings of the $n$-cube yield nets. We show this is also true for the $n$-simplex and the $4$-orthoplex but demonstrate i…
▽ More
Over a decade ago, it was shown that every edge unfolding of the Platonic solids was without self-overlap, yielding a valid net. We consider this property for regular polytopes in arbitrary dimensions, notably the simplex, cube, and orthoplex. It was recently proven that all unfoldings of the $n$-cube yield nets. We show this is also true for the $n$-simplex and the $4$-orthoplex but demonstrate its surprising failure for any orthoplex of higher dimension.
△ Less
Submitted 1 November, 2021;
originally announced November 2021.
-
Understanding Mobile Search Task Relevance and User Behaviour in Context
Authors:
Mohammad Aliannejadi,
Morgan Harvey,
Luca Costa,
Matthew Pointon,
Fabio Crestani
Abstract:
Improvements in mobile technologies have led to a dramatic change in how and when people access and use information, and is having a profound impact on how users address their daily information needs. Smart phones are rapidly becoming our main method of accessing information and are frequently used to perform `on-the-go' search tasks. As research into information retrieval continues to evolve, eva…
▽ More
Improvements in mobile technologies have led to a dramatic change in how and when people access and use information, and is having a profound impact on how users address their daily information needs. Smart phones are rapidly becoming our main method of accessing information and are frequently used to perform `on-the-go' search tasks. As research into information retrieval continues to evolve, evaluating search behaviour in context is relatively new. Previous research has studied the effects of context through either self-reported diary studies or quantitative log analysis; however, neither approach is able to accurately capture context of use at the time of searching. In this study, we aim to gain a better understanding of task relevance and search behaviour via a task-based user study (n=31) employing a bespoke Android app. The app allowed us to accurately capture the user's context when completing tasks at different times of the day over the period of a week. Through analysis of the collected data, we gain a better understanding of how using smart phones on the go impacts search behaviour, search performance and task relevance and whether or not the actual context is an important factor.
△ Less
Submitted 13 January, 2019; v1 submitted 17 December, 2018;
originally announced December 2018.
-
Dimensionality reduction methods for molecular simulations
Authors:
Stefan Doerr,
Igor Ariz-Extreme,
Matthew J. Harvey,
Gianni De Fabritiis
Abstract:
Molecular simulations produce very high-dimensional data-sets with millions of data points. As analysis methods are often unable to cope with so many dimensions, it is common to use dimensionality reduction and clustering methods to reach a reduced representation of the data. Yet these methods often fail to capture the most important features necessary for the construction of a Markov model. Here…
▽ More
Molecular simulations produce very high-dimensional data-sets with millions of data points. As analysis methods are often unable to cope with so many dimensions, it is common to use dimensionality reduction and clustering methods to reach a reduced representation of the data. Yet these methods often fail to capture the most important features necessary for the construction of a Markov model. Here we demonstrate the results of various dimensionality reduction methods on two simulation data-sets, one of protein folding and another of protein-ligand binding. The methods tested include a k-means clustering variant, a non-linear auto encoder, principal component analysis and tICA. The dimension-reduced data is then used to estimate the implied timescales of the slowest process by a Markov state model analysis to assess the quality of the projection. The projected dimensions learned from the data are visualized to demonstrate which conformations the various methods choose to represent the molecular process.
△ Less
Submitted 2 November, 2017; v1 submitted 29 October, 2017;
originally announced October 2017.
-
Inference for Belief Networks Using Coupling From the Past
Authors:
Michael Harvey,
Radford M. Neal
Abstract:
Inference for belief networks using Gibbs sampling produces a distribution for unobserved variables that differs from the correct distribution by a (usually) unknown error, since convergence to the right distribution occurs only asymptotically. The method of "coupling from the past" samples from exactly the correct distribution by (conceptually) running dependent Gibbs sampling simulations from…
▽ More
Inference for belief networks using Gibbs sampling produces a distribution for unobserved variables that differs from the correct distribution by a (usually) unknown error, since convergence to the right distribution occurs only asymptotically. The method of "coupling from the past" samples from exactly the correct distribution by (conceptually) running dependent Gibbs sampling simulations from every possible starting state from a time far enough in the past that all runs reach the same state at time t=0. Explicitly considering every possible state is intractable for large networks, however. We propose a method for layered noisy-or networks that uses a compact, but often imprecise, summary of a set of states. This method samples from exactly the correct distribution, and requires only about twice the time per step as ordinary Gibbs sampling, but it may require more simulation steps than would be needed if chains were tracked exactly.
△ Less
Submitted 16 January, 2013;
originally announced January 2013.
-
Folksonomic Tag Clouds as an Aid to Content Indexing
Authors:
Morgan Harvey,
Mark Baillie,
Ian Ruthven,
David Elsweiler
Abstract:
Social tagging systems have recently developed as a popular method of data organisation on the Internet. These systems allow users to organise their content in a way that makes sense to them, rather than forcing them to use a pre-determined and rigid set of categorisations. These folksonomies provide well populated sources of unstructured tags describing web resources which could potentially be…
▽ More
Social tagging systems have recently developed as a popular method of data organisation on the Internet. These systems allow users to organise their content in a way that makes sense to them, rather than forcing them to use a pre-determined and rigid set of categorisations. These folksonomies provide well populated sources of unstructured tags describing web resources which could potentially be used as semantic index terms for these resources. However getting people to agree on what tags best describe a resource is a difficult problem, therefore any feature which increases the consistency and stability of terms chosen would be extremely beneficial. We investigate how the provision of a tag cloud, a weighted list of terms commonly used to assist in browsing a folksonomy, during the tagging process itself influences the tags produced and how difficult the user perceived the task to be. We show that illustrating the most popular tags to users assists in the tagging process and encourages a stable and consistent folksonomy to form.
△ Less
Submitted 21 November, 2009;
originally announced November 2009.