Search | arXiv e-print repository

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2301.05339 [pdf, other]

doi 10.1111/cgf.14776

A Comprehensive Review of Data-Driven Co-Speech Gesture Generation

Authors: Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, Michael Neff

Abstract: Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology in film, games, virtual social spaces, and for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic n… ▽ More Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology in film, games, virtual social spaces, and for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion, and by the great diversity of communicative functions that gestures encompass. Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep-learning-based generative models, that benefit from the growing availability of data. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text, and non-linguistic input. We also chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method. Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development. △ Less

Submitted 10 April, 2023; v1 submitted 12 January, 2023; originally announced January 2023.

Comments: Accepted for EUROGRAPHICS 2023

ACM Class: I.3.7

arXiv:2208.08080 [pdf, other]

Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides

Authors: Dong Won Lee, Chaitanya Ahuja, Paul Pu Liang, Sanika Natu, Louis-Philippe Morency

Abstract: Lecture slide presentations, a sequence of pages that contain text and figures accompanied by speech, are constructed and presented carefully in order to optimally transfer knowledge to students. Previous studies in multimedia and psychology attribute the effectiveness of lecture presentations to their multimodal nature. As a step toward develo** AI to aid in student learning as intelligent teac… ▽ More Lecture slide presentations, a sequence of pages that contain text and figures accompanied by speech, are constructed and presented carefully in order to optimally transfer knowledge to students. Previous studies in multimedia and psychology attribute the effectiveness of lecture presentations to their multimodal nature. As a step toward develo** AI to aid in student learning as intelligent teacher assistants, we introduce the Multimodal Lecture Presentations dataset as a large-scale benchmark testing the capabilities of machine learning models in multimodal understanding of educational content. Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects (e.g., computer science, dentistry, biology). We introduce two research tasks which are designed as step** stones towards AI agents that can explain (automatically captioning a lecture presentation) and illustrate (synthesizing visual figures to accompany spoken explanations) educational content. We provide manual annotations to help implement these two research tasks and evaluate state-of-the-art models on them. Comparing baselines and human student performances, we find that current models struggle in (1) weak crossmodal alignment between slides and spoken text, (2) learning novel visual mediums, (3) technical language, and (4) long-range sequences. Towards addressing this issue, we also introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches. We conclude by shedding light on the challenges and opportunities in multimodal understanding of educational presentations. △ Less

Submitted 17 August, 2022; originally announced August 2022.

Comments: 9 pages, 5 figures

arXiv:2007.12553 [pdf, other]

Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach

Authors: Chaitanya Ahuja, Dong Won Lee, Yukiko I. Nakano, Louis-Philippe Morency

Abstract: How can we teach robots or virtual assistants to gesture naturally? Can we go further and adapt the gesturing style to follow a specific speaker? Gestures that are naturally timed with corresponding speech during human communication are called co-speech gestures. A key challenge, called gesture style transfer, is to learn a model that generates these gestures for a speaking agent 'A' in the gestur… ▽ More How can we teach robots or virtual assistants to gesture naturally? Can we go further and adapt the gesturing style to follow a specific speaker? Gestures that are naturally timed with corresponding speech during human communication are called co-speech gestures. A key challenge, called gesture style transfer, is to learn a model that generates these gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'. A secondary goal is to simultaneously learn to generate co-speech gestures for multiple speakers while remembering what is unique about each speaker. We call this challenge style preservation. In this paper, we propose a new model, named Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures in an end-to-end manner. A novelty of Mix-StAGE is to learn a mixture of generative models which allows for conditioning on the unique gesture style of each speaker. As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings. Mix-StAGE also allows for style preservation when learning simultaneously from multiple speakers. We also introduce a new dataset, Pose-Audio-Transcript-Style (PATS), designed to study gesture generation and style transfer. Our proposed Mix-StAGE model significantly outperforms the previous state-of-the-art approach for gesture generation and provides a path towards performing gesture style transfer across multiple speakers. Link to code, data, and videos: http://chahuja.com/mix-stage △ Less

Submitted 24 July, 2020; originally announced July 2020.

Comments: 24 pages, 12 figures

Journal ref: European Conference on Computer Vision 2020

arXiv:1910.02181 [pdf, other]

To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations

Authors: Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency, Yaser Sheikh

Abstract: Non verbal behaviours such as gestures, facial expressions, body posture, and para-linguistic cues have been shown to complement or clarify verbal messages. Hence to improve telepresence, in form of an avatar, it is important to model these behaviours, especially in dyadic interactions. Creating such personalized avatars not only requires to model intrapersonal dynamics between a avatar's speech a… ▽ More Non verbal behaviours such as gestures, facial expressions, body posture, and para-linguistic cues have been shown to complement or clarify verbal messages. Hence to improve telepresence, in form of an avatar, it is important to model these behaviours, especially in dyadic interactions. Creating such personalized avatars not only requires to model intrapersonal dynamics between a avatar's speech and their body pose, but it also needs to model interpersonal dynamics with the interlocutor present in the conversation. In this paper, we introduce a neural architecture named Dyadic Residual-Attention Model (DRAM), which integrates intrapersonal (monadic) and interpersonal (dyadic) dynamics using selective attention to generate sequences of body pose conditioned on audio and body pose of the interlocutor and audio of the human operating the avatar. We evaluate our proposed model on dyadic conversational data consisting of pose and audio of both participants, confirming the importance of adaptive attention between monadic and dyadic dynamics when predicting avatar pose. We also conduct a user study to analyze judgments of human observers. Our results confirm that the generated body pose is more natural, models intrapersonal dynamics and interpersonal dynamics better than non-adaptive monadic/dyadic models. △ Less

Submitted 4 October, 2019; originally announced October 2019.

arXiv:1907.01108 [pdf, other]

Language2Pose: Natural Language Grounded Pose Forecasting

Authors: Chaitanya Ahuja, Louis-Philippe Morency

Abstract: Generating animations from natural language sentences finds its applications in a a number of domains such as movie script visualization, virtual human animation and, robot motion planning. These sentences can describe different kinds of actions, speeds and direction of these actions, and possibly a target destination. The core modeling challenge in this language-to-pose application is how to map… ▽ More Generating animations from natural language sentences finds its applications in a a number of domains such as movie script visualization, virtual human animation and, robot motion planning. These sentences can describe different kinds of actions, speeds and direction of these actions, and possibly a target destination. The core modeling challenge in this language-to-pose application is how to map linguistic concepts to motion animations. In this paper, we address this multimodal problem by introducing a neural architecture called Joint Language to Pose (or JL2P), which learns a joint embedding of language and pose. This joint embedding space is learned end-to-end using a curriculum learning approach which emphasizes shorter and easier sequences first before moving to longer and harder ones. We evaluate our proposed model on a publicly available corpus of 3D pose data and human-annotated sentences. Both objective metrics and human judgment evaluation confirm that our proposed approach is able to generate more accurate animations and are deemed visually more representative by humans than other data driven approaches. △ Less

Submitted 27 November, 2019; v1 submitted 1 July, 2019; originally announced July 2019.

arXiv:1809.06215 [pdf]

An Automatic Method for Complete Brain Matter Segmentation from Multislice CT scan

Authors: Soumi Ray, Vinod Kumar, Chirag Ahuja, Niranjan Khandelwal

Abstract: Computed tomography imaging is well accepted for its imaging speed, image contrast & resolution and cost. Thus it has wide use in detection and diagnosis of brain diseases. But unfortunately reported works on CT segmentation is not very significant. In this paper, a robust automatic segmentation system is presented which is capable of segment complete brain matter from CT slices, without any lose… ▽ More Computed tomography imaging is well accepted for its imaging speed, image contrast & resolution and cost. Thus it has wide use in detection and diagnosis of brain diseases. But unfortunately reported works on CT segmentation is not very significant. In this paper, a robust automatic segmentation system is presented which is capable of segment complete brain matter from CT slices, without any lose in information. The proposed method is simple, fast, accurate and completely automatic. It can handle multislice CT scan in single run. From a given multislice CT dataset, one slice is selected automatically to form masks for segmentation. Two types of masks are created to handle nasal slices in a better way. Masks are created from selected reference slice using automatic seed point selection and region growing technique. One mask is designed for brain matter and another includes the skull of the reference slice. This second mask is used as global reference mask for all slices whereas the brain matter mask is implemented on only adjacent slices and continuously modified for better segmentation. Slices in given dataset are divided into two batches, before reference slice and after reference slice. Each batch segmented separately. Successive propagation of brain matter mask has demonstrated very high potential in reported segmentation. Presented result shows highest sensitivity and more than 96% accuracy in all cases. Resulted segmented images can be used for any brain disease diagnosis or further image analysis. △ Less

Submitted 22 October, 2018; v1 submitted 11 September, 2018; originally announced September 2018.

arXiv:1710.02254 [pdf, other]

Lattice Recurrent Unit: Improving Convergence and Statistical Efficiency for Sequence Modeling

Authors: Chaitanya Ahuja, Louis-Philippe Morency

Abstract: Recurrent neural networks have shown remarkable success in modeling sequences. However low resource situations still adversely affect the generalizability of these models. We introduce a new family of models, called Lattice Recurrent Units (LRU), to address the challenge of learning deep multi-layer recurrent models with limited resources. LRU models achieve this goal by creating distinct (but cou… ▽ More Recurrent neural networks have shown remarkable success in modeling sequences. However low resource situations still adversely affect the generalizability of these models. We introduce a new family of models, called Lattice Recurrent Units (LRU), to address the challenge of learning deep multi-layer recurrent models with limited resources. LRU models achieve this goal by creating distinct (but coupled) flow of information inside the units: a first flow along time dimension and a second flow along depth dimension. It also offers a symmetry in how information can flow horizontally and vertically. We analyze the effects of decoupling three different components of our LRU model: Reset Gate, Update Gate and Projected State. We evaluate this family on new LRU models on computational convergence rates and statistical efficiency. Our experiments are performed on four publicly-available datasets, comparing with Grid-LSTM and Recurrent Highway networks. Our results show that LRU has better empirical computational convergence rates and statistical efficiency values, along with learning more accurate language models. △ Less

Submitted 22 November, 2017; v1 submitted 5 October, 2017; originally announced October 2017.

Comments: 8 pages, 7 figures

arXiv:1705.09406 [pdf, other]

Multimodal Machine Learning: A Survey and Taxonomy

Authors: Tadas Baltrušaitis, Chaitanya Ahuja, Louis-Philippe Morency

Abstract: Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able… ▽ More Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research. △ Less

Submitted 1 August, 2017; v1 submitted 25 May, 2017; originally announced May 2017.

arXiv:1411.6741 [pdf, other]

A Complex Matrix Factorization approach to Joint Modeling of Magnitude and Phase for Source Separation

Authors: Chaitanya Ahuja, Karan Nathwani, Rajesh M. Hegde

Abstract: Conventional NMF methods for source separation factorize the matrix of spectral magnitudes. Spectral Phase is not included in the decomposition process of these methods. However, phase of the speech mixture is generally used in reconstructing the target speech signal. This results in undesired traces of interfering sources in the target signal. In this paper the spectral phase is incorporated in t… ▽ More Conventional NMF methods for source separation factorize the matrix of spectral magnitudes. Spectral Phase is not included in the decomposition process of these methods. However, phase of the speech mixture is generally used in reconstructing the target speech signal. This results in undesired traces of interfering sources in the target signal. In this paper the spectral phase is incorporated in the decomposition process itself. Additionally, the complex matrix factorization problem is reduced to an NMF problem using simple transformations. This results in effective separation of speech mixtures since both magnitude and phase are utilized jointly in the separation process. Improvement in source separation results are demonstrated using objective quality evaluations on the GRID corpus. △ Less

Submitted 25 November, 2014; originally announced November 2014.

Comments: 5 pages, 3 figures

Showing 1–10 of 10 results for author: Ahuja, C