Search | arXiv e-print repository

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content. △ Less

Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2402.05881 [pdf, other]

Localized and Distributed Beyond Diagonal Reconfigurable Intelligent Surfaces with Lossy Interconnections: Modeling and Optimization

Authors: Matteo Nerini, Golsa Ghiaasi, Bruno Clerckx

Abstract: Reconfigurable intelligent surface (RIS) is a key technology to control the communication environment in future wireless networks. Recently, beyond diagonal RIS (BD-RIS) emerged as a generalization of RIS achieving larger coverage through additional tunable impedance components interconnecting the RIS elements. However, conventional RIS and BD-RIS can effectively serve only users in their proximit… ▽ More Reconfigurable intelligent surface (RIS) is a key technology to control the communication environment in future wireless networks. Recently, beyond diagonal RIS (BD-RIS) emerged as a generalization of RIS achieving larger coverage through additional tunable impedance components interconnecting the RIS elements. However, conventional RIS and BD-RIS can effectively serve only users in their proximity, resulting in limited coverage. To overcome this limitation, in this paper, we investigate distributed RIS, whose elements are distributed over a wide region, in opposition to localized RIS commonly considered in the literature. The scaling laws of distributed BD-RIS reveal that it offers significant gains over distributed conventional RIS and localized BD-RIS, enabled by its interconnections allowing signal propagation within the BD-RIS. To assess the practical performance of distributed BD-RIS, we model and optimize BD-RIS with lossy interconnections through transmission line theory. Our model accounts for phase changes and losses over the BD-RIS interconnections arising when the interconnection lengths are not much smaller than the wavelength. Numerical results show that the performance of localized BD-RIS is only slightly impacted by losses, given the short interconnection lengths. Besides, distributed BD-RIS can achieve orders of magnitude of gains over conventional RIS, even in the presence of low losses. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: Submitted to IEEE for publication

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2306.01736 [pdf, other]

DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model

Authors: Xiuye Gu, Yin Cui, Jonathan Huang, Abdullah Rashwan, Xuan Yang, Xingyi Zhou, Golnaz Ghiasi, Weicheng Kuo, Huizhong Chen, Liang-Chieh Chen, David A Ross

Abstract: Observing the close relationship among panoptic, semantic and instance segmentation tasks, we propose to train a universal multi-dataset multi-task segmentation model: DaTaSeg.We use a shared representation (mask proposals with class predictions) for all tasks. To tackle task discrepancy, we adopt different merge operations and post-processing for different tasks. We also leverage weak-supervision… ▽ More Observing the close relationship among panoptic, semantic and instance segmentation tasks, we propose to train a universal multi-dataset multi-task segmentation model: DaTaSeg.We use a shared representation (mask proposals with class predictions) for all tasks. To tackle task discrepancy, we adopt different merge operations and post-processing for different tasks. We also leverage weak-supervision, allowing our segmentation model to benefit from cheaper bounding box annotations. To share knowledge across datasets, we use text embeddings from the same semantic embedding space as classifiers and share all network parameters among datasets. We train DaTaSeg on ADE semantic, COCO panoptic, and Objects365 detection datasets. DaTaSeg improves performance on all datasets, especially small-scale datasets, achieving 54.0 mIoU on ADE semantic and 53.5 PQ on COCO panoptic. DaTaSeg also enables weakly-supervised knowledge transfer on ADE panoptic and Objects365 instance segmentation. Experiments show DaTaSeg scales with the number of training datasets and enables open-vocabulary segmentation through direct transfer. In addition, we annotate an Objects365 instance segmentation set of 1,000 images and will release it as a public benchmark. △ Less

Submitted 2 June, 2023; originally announced June 2023.

arXiv:2203.12683 [pdf, other]

Revisiting Multi-Scale Feature Fusion for Semantic Segmentation

Authors: Tianjian Meng, Golnaz Ghiasi, Reza Mahjourian, Quoc V. Le, Mingxing Tan

Abstract: It is commonly believed that high internal resolution combined with expensive operations (e.g. atrous convolutions) are necessary for accurate semantic segmentation, resulting in slow speed and large memory usage. In this paper, we question this belief and demonstrate that neither high internal resolution nor atrous convolutions are necessary. Our intuition is that although segmentation is a dense… ▽ More It is commonly believed that high internal resolution combined with expensive operations (e.g. atrous convolutions) are necessary for accurate semantic segmentation, resulting in slow speed and large memory usage. In this paper, we question this belief and demonstrate that neither high internal resolution nor atrous convolutions are necessary. Our intuition is that although segmentation is a dense per-pixel prediction task, the semantics of each pixel often depend on both nearby neighbors and far-away context; therefore, a more powerful multi-scale feature fusion network plays a critical role. Following this intuition, we revisit the conventional multi-scale feature space (typically capped at P5) and extend it to a much richer space, up to P9, where the smallest features are only 1/512 of the input size and thus have very large receptive fields. To process such a rich feature space, we leverage the recent BiFPN to fuse the multi-scale features. Based on these insights, we develop a simplified segmentation model, named ESeg, which has neither high internal resolution nor expensive atrous convolutions. Perhaps surprisingly, our simple method can achieve better accuracy with faster speed than prior art across multiple datasets. In real-time settings, ESeg-Lite-S achieves 76.0% mIoU on CityScapes [12] at 189 FPS, outperforming FasterSeg [9] (73.1% mIoU at 170 FPS). Our ESeg-Lite-L runs at 79 FPS and achieves 80.1% mIoU, largely closing the gap between real-time and high-performance segmentation models. △ Less

Submitted 14 June, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

arXiv:2112.12143 [pdf, other]

Scaling Open-Vocabulary Image Segmentation with Image-Level Labels

Authors: Golnaz Ghiasi, Xiuye Gu, Yin Cui, Tsung-Yi Lin

Abstract: We design an open-vocabulary image segmentation model to organize an image into meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite attaining impressive open-vocabulary classification accuracy with image-level caption labels, are unable to segment visual concepts with pixels. We argue that these models miss an important step of visual grou**, which organizes p… ▽ More We design an open-vocabulary image segmentation model to organize an image into meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite attaining impressive open-vocabulary classification accuracy with image-level caption labels, are unable to segment visual concepts with pixels. We argue that these models miss an important step of visual grou**, which organizes pixels into groups before learning visual-semantic alignments. We propose OpenSeg to address the above issue while still making use of scalable image-level supervision of captions. First, it learns to propose segmentation masks for possible organizations. Then it learns visual-semantic alignments by aligning each word in a caption to one or a few predicted masks. We find the mask representations are the key to support learning image segmentation from captions, making it possible to scale up the dataset and vocabulary sizes. OpenSeg significantly outperforms the recent open-vocabulary method of LSeg by +19.9 mIoU on PASCAL dataset, thanks to its scalability. △ Less

Submitted 20 July, 2022; v1 submitted 22 December, 2021; originally announced December 2021.

Comments: Accepted at ECCV 2022

arXiv:2111.10050 [pdf, other]

Combined Scaling for Zero-shot Transfer Learning

Authors: Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, Quoc V. Le

Abstract: We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution sh… ▽ More We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy, only a small drop from its original ImageNet accuracy. To achieve these results, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions: data size, model size, and batch size. Our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Finally, our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN. We encountered two main challenges with the scaling rules of BASIC. First, the main challenge with implementing the combined scaling rules of BASIC is the limited memory of accelerators, such as GPUs and TPUs. To overcome the memory limit, we propose two simple methods which make use of gradient checkpointing and model parallelism. Second, while increasing the dataset size and the model size has been the defacto method to improve the performance of deep learning models like BASIC, the effect of a large contrastive batch size on such contrastive-trained image-text models is not well-understood. To shed light on the benefits of large contrastive batch sizes, we develop a theoretical framework which shows that larger contrastive batch sizes lead to smaller generalization gaps for image-text models such as BASIC. △ Less

Submitted 12 April, 2023; v1 submitted 19 November, 2021; originally announced November 2021.

arXiv:2108.11353 [pdf, other]

Multi-Task Self-Training for Learning General Representations

Authors: Golnaz Ghiasi, Barret Zoph, Ekin D. Cubuk, Quoc V. Le, Tsung-Yi Lin

Abstract: Despite the fast progress in training specialized models for various tasks, learning a single general model that works well for many tasks is still challenging for computer vision. Here we introduce multi-task self-training (MuST), which harnesses the knowledge in independent specialized teacher models (e.g., ImageNet model on classification) to train a single general student model. Our approach h… ▽ More Despite the fast progress in training specialized models for various tasks, learning a single general model that works well for many tasks is still challenging for computer vision. Here we introduce multi-task self-training (MuST), which harnesses the knowledge in independent specialized teacher models (e.g., ImageNet model on classification) to train a single general student model. Our approach has three steps. First, we train specialized teachers independently on labeled datasets. We then use the specialized teachers to label an unlabeled dataset to create a multi-task pseudo labeled dataset. Finally, the dataset, which now contains pseudo labels from teacher models trained on different datasets/tasks, is then used to train a student model with multi-task learning. We evaluate the feature representations of the student model on 6 vision tasks including image recognition (classification, detection, segmentation)and 3D geometry estimation (depth and surface normal estimation). MuST is scalable with unlabeled or partially labeled datasets and outperforms both specialized supervised models and self-supervised models when training on large scale datasets. Lastly, we show MuST can improve upon already strong checkpoints trained with billions of examples. The results suggest self-training is a promising direction to aggregate labeled and unlabeled training data for learning general feature representations. △ Less

Submitted 25 August, 2021; originally announced August 2021.

Comments: ICCV 2021

arXiv:2106.06080 [pdf, other]

Gradual Domain Adaptation in the Wild:When Intermediate Distributions are Absent

Authors: Samira Abnar, Rianne van den Berg, Golnaz Ghiasi, Mostafa Dehghani, Nal Kalchbrenner, Hanie Sedghi

Abstract: We focus on the problem of domain adaptation when the goal is shifting the model towards the target distribution, rather than learning domain invariant representations. It has been shown that under the following two assumptions: (a) access to samples from intermediate distributions, and (b) samples being annotated with the amount of change from the source distribution, self-training can be success… ▽ More We focus on the problem of domain adaptation when the goal is shifting the model towards the target distribution, rather than learning domain invariant representations. It has been shown that under the following two assumptions: (a) access to samples from intermediate distributions, and (b) samples being annotated with the amount of change from the source distribution, self-training can be successfully applied on gradually shifted samples to adapt the model toward the target distribution. We hypothesize having (a) is enough to enable iterative self-training to slowly adapt the model to the target distribution, by making use of an implicit curriculum. In the case where (a) does not hold, we observe that iterative self-training falls short. We propose GIFT, a method that creates virtual samples from intermediate distributions by interpolating representations of examples from source and target domains. We evaluate an iterative-self-training method on datasets with natural distribution shifts, and show that when applied on top of other domain adaptation methods, it improves the performance of the model on the target dataset. We run an analysis on a synthetic dataset to show that in the presence of (a) iterative-self-training naturally forms a curriculum of samples. Furthermore, we show that when (a) does not hold, GIFT performs better than iterative self-training. △ Less

Submitted 13 July, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

arXiv:2012.07177 [pdf, other]

Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation

Authors: Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, Barret Zoph

Abstract: Building instance segmentation models that are data-efficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations is a promising direction towards addressing this challenge. Here, we perform a systematic study of the Copy-Paste augmentation ([13, 12]) for instance segmentation where we randomly paste objects onto an image. Prior studies… ▽ More Building instance segmentation models that are data-efficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations is a promising direction towards addressing this challenge. Here, we perform a systematic study of the Copy-Paste augmentation ([13, 12]) for instance segmentation where we randomly paste objects onto an image. Prior studies on Copy-Paste relied on modeling the surrounding visual context for pasting the objects. However, we find that the simple mechanism of pasting objects randomly is good enough and can provide solid gains on top of strong baselines. Furthermore, we show Copy-Paste is additive with semi-supervised methods that leverage extra data through pseudo labeling (e.g. self-training). On COCO instance segmentation, we achieve 49.1 mask AP and 57.3 box AP, an improvement of +0.6 mask AP and +1.5 box AP over the previous state-of-the-art. We further demonstrate that Copy-Paste can lead to significant improvements on the LVIS benchmark. Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3.6 mask AP on rare categories. △ Less

Submitted 23 June, 2021; v1 submitted 13 December, 2020; originally announced December 2020.

Comments: Accepted at CVPR 2021 (Oral)

arXiv:2006.06882 [pdf, other]

Rethinking Pre-training and Self-training

Authors: Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D. Cubuk, Quoc V. Le

Abstract: Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training is commonly used to initialize the backbones of object detection and segmentation models. He et al., however, show a surprising result that ImageNet pre-training has limited impact on COCO object detection. Here we investigate self-training as another method to utilize additional data on the same… ▽ More Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training is commonly used to initialize the backbones of object detection and segmentation models. He et al., however, show a surprising result that ImageNet pre-training has limited impact on COCO object detection. Here we investigate self-training as another method to utilize additional data on the same setup and contrast it against ImageNet pre-training. Our study reveals the generality and flexibility of self-training with three additional insights: 1) stronger data augmentation and more labeled data further diminish the value of pre-training, 2) unlike pre-training, self-training is always helpful when using stronger data augmentation, in both low-data and high-data regimes, and 3) in the case that pre-training is helpful, self-training improves upon pre-training. For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data. Self-training, on the other hand, shows positive improvements from +1.3 to +3.4AP across all dataset sizes. In other words, self-training works well exactly on the same setup that pre-training does not work (using ImageNet to help COCO). On the PASCAL segmentation dataset, which is a much smaller dataset than COCO, though pre-training does help significantly, self-training improves upon the pre-trained model. On COCO object detection, we achieve 54.3AP, an improvement of +1.5AP over the strongest SpineNet model. On PASCAL segmentation, we achieve 90.5 mIOU, an improvement of +1.5% mIOU over the previous state-of-the-art result by DeepLabv3+. △ Less

Submitted 15 November, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: Accepted for publication at the Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS 2020)

arXiv:2005.04959 [pdf, other]

A Bandwidth Scalable Millimetre Wave Over-The-Air Test System with Low Complexity

Authors: Erich Zöchmann, Terje Mathiesen, Thomas Blazek, Herbert Groll, Golsa Ghiaasi

Abstract: In this work, we show the design and validation of a testbed for over-the-air testing of millimetre waves equipments. We have extended the frequency capabilities of a baseband channel emulator which is capable of emulating non-stationary channels, to higher frequencies. To assure that the propagation between devices-under-test and the RF-frontend of the emulator is only a line-of-sight link, we ha… ▽ More In this work, we show the design and validation of a testbed for over-the-air testing of millimetre waves equipments. We have extended the frequency capabilities of a baseband channel emulator which is capable of emulating non-stationary channels, to higher frequencies. To assure that the propagation between devices-under-test and the RF-frontend of the emulator is only a line-of-sight link, we have isolated the devices-under-test in two anechoic chambers. We characterize the testbed at 57 GHz by means of frequency sweeps for two artificial cases: when the emulator is replicating a one-tap channel and a two-tap channel. Finally we demonstrate the system functionality through reproduction of channels which were acquired during a vehicular millimetre wave measurement campaign conducted in Vienna in 2018. △ Less

Submitted 11 May, 2020; originally announced May 2020.

Comments: EuCap2020

arXiv:2001.08068 [pdf, other]

doi 10.23919/EuCAP48036.2020.9135714

A Hardware-in-the-Loop Evaluation of the Impact of the V2X Channel on the Traffic-Safety Versus Efficiency Trade-offs

Authors: Alessandro Bazzi, Thomas Blazek, Michele Menarini, Barbara M. Masini, Alberto Zanella, Christoph Mecklenbrauker, Golsa Ghiaasi

Abstract: Vehicles are increasingly becoming connected and short-range wireless communications promise to introduce a radical change in the drivers' behaviors. Among the main use cases, the intersection management is surely one of those that could mostly impact on both traffic safety and efficiency. In this work, we consider an intersection collision warning application and exploit an hardware-in-the-loop (… ▽ More Vehicles are increasingly becoming connected and short-range wireless communications promise to introduce a radical change in the drivers' behaviors. Among the main use cases, the intersection management is surely one of those that could mostly impact on both traffic safety and efficiency. In this work, we consider an intersection collision warning application and exploit an hardware-in-the-loop (HIL) platform to verify the impact on the risk of accidents as well as the average time to travel a given distance. Besides including real ITS-G5 compliant message exchanges, the platform also includes a channel emulator with real signals. Results show that the risk of collisions can be drastically reduced, with an overall trade-off between safety and traffic efficiency. At the same time, it is shown that the presence of real channel conditions cannot guarantee the same condition of zero-risk as with ideal channel propagation, remarking the importance of channel conditions and signal processing. △ Less

Submitted 22 January, 2020; originally announced January 2020.

arXiv:1912.05027 [pdf, other]

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization

Authors: Xianzhi Du, Tsung-Yi Lin, Pengchong **, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song

Abstract: Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are proposed to resolve this by applying a decoder network onto a b… ▽ More Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are proposed to resolve this by applying a decoder network onto a backbone model designed for classification tasks. In this paper, we argue encoder-decoder architecture is ineffective in generating strong multi-scale features because of the scale-decreased backbone. We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search. Using similar building blocks, SpineNet models outperform ResNet-FPN models by ~3% AP at various scales while using 10-20% fewer FLOPs. In particular, SpineNet-190 achieves 52.5% AP with a MaskR-CNN detector and achieves 52.1% AP with a RetinaNet detector on COCO for a single model without test-time augmentation, significantly outperforms prior art of detectors. SpineNet can transfer to classification tasks, achieving 5% top-1 accuracy improvement on a challenging iNaturalist fine-grained dataset. Code is at: https://github.com/tensorflow/tpu/tree/master/models/official/detection. △ Less

Submitted 17 June, 2020; v1 submitted 10 December, 2019; originally announced December 2019.

Comments: CVPR 2020

arXiv:1912.01106 [pdf, other]

MnasFPN: Learning Latency-aware Pyramid Architecture for Object Detection on Mobile Devices

Authors: Bo Chen, Golnaz Ghiasi, Hanxiao Liu, Tsung-Yi Lin, Dmitry Kalenichenko, Hartwig Adams, Quoc V. Le

Abstract: Despite the blooming success of architecture search for vision tasks in resource-constrained environments, the design of on-device object detection architectures have mostly been manual. The few automated search efforts are either centered around non-mobile-friendly search spaces or not guided by on-device latency. We propose MnasFPN, a mobile-friendly search space for the detection head, and comb… ▽ More Despite the blooming success of architecture search for vision tasks in resource-constrained environments, the design of on-device object detection architectures have mostly been manual. The few automated search efforts are either centered around non-mobile-friendly search spaces or not guided by on-device latency. We propose MnasFPN, a mobile-friendly search space for the detection head, and combine it with latency-aware architecture search to produce efficient object detection models. The learned MnasFPN head, when paired with MobileNetV2 body, outperforms MobileNetV3+SSDLite by 1.8 mAP at similar latency on Pixel. It is also both 1.0 mAP more accurate and 10% faster than NAS-FPNLite. Ablation studies show that the majority of the performance gain comes from innovations in the search space. Further explorations reveal an interesting coupling between the search space design and the search algorithm, and that the complexity of MnasFPN search space may be at a local optimum. △ Less

Submitted 30 July, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

Comments: 10 pages, 7 figures

arXiv:1906.11172 [pdf, other]

Learning Data Augmentation Strategies for Object Detection

Authors: Barret Zoph, Ekin D. Cubuk, Golnaz Ghiasi, Tsung-Yi Lin, Jonathon Shlens, Quoc V. Le

Abstract: Data augmentation is a critical component of training deep learning models. Although data augmentation has been shown to significantly improve image classification, its potential has not been thoroughly investigated for object detection. Given the additional cost for annotating images for object detection, data augmentation may be of even greater importance for this computer vision task. In this w… ▽ More Data augmentation is a critical component of training deep learning models. Although data augmentation has been shown to significantly improve image classification, its potential has not been thoroughly investigated for object detection. Given the additional cost for annotating images for object detection, data augmentation may be of even greater importance for this computer vision task. In this work, we study the impact of data augmentation on object detection. We first demonstrate that data augmentation operations borrowed from image classification may be helpful for training detection models, but the improvement is limited. Thus, we investigate how learned, specialized data augmentation policies improve generalization performance for detection models. Importantly, these augmentation policies only affect training and leave a trained model unchanged during evaluation. Experiments on the COCO dataset indicate that an optimized data augmentation policy improves detection accuracy by more than +2.3 mAP, and allow a single inference model to achieve a state-of-the-art accuracy of 50.7 mAP. Importantly, the best policy found on COCO may be transferred unchanged to other detection datasets and models to improve predictive accuracy. For example, the best augmentation policy identified with COCO improves a strong baseline on PASCAL-VOC by +2.7 mAP. Our results also reveal that a learned augmentation policy is superior to state-of-the-art architecture regularization methods for object detection, even when considering strong baselines. Code for training with the learned policy is available online at https://github.com/tensorflow/tpu/tree/master/models/official/detection △ Less

Submitted 26 June, 2019; originally announced June 2019.

arXiv:1904.07392 [pdf, other]

NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection

Authors: Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, Quoc V. Le

Abstract: Current state-of-the-art convolutional architectures for object detection are manually designed. Here we aim to learn a better architecture of feature pyramid network for object detection. We adopt Neural Architecture Search and discover a new feature pyramid architecture in a novel scalable search space covering all cross-scale connections. The discovered architecture, named NAS-FPN, consists of… ▽ More Current state-of-the-art convolutional architectures for object detection are manually designed. Here we aim to learn a better architecture of feature pyramid network for object detection. We adopt Neural Architecture Search and discover a new feature pyramid architecture in a novel scalable search space covering all cross-scale connections. The discovered architecture, named NAS-FPN, consists of a combination of top-down and bottom-up connections to fuse features across scales. NAS-FPN, combined with various backbone models in the RetinaNet framework, achieves better accuracy and latency tradeoff compared to state-of-the-art object detection models. NAS-FPN improves mobile detection accuracy by 2 AP compared to state-of-the-art SSDLite with MobileNetV2 model in [32] and achieves 48.3 AP which surpasses Mask R-CNN [10] detection accuracy with less computation time. △ Less

Submitted 15 April, 2019; originally announced April 2019.

Comments: Accepted at CVPR 2019

arXiv:1812.05463 [pdf, other]

doi 10.1109/PIMRC.2018.8581026

Measured Channel Hardening in an Indoor Multiband Scenario

Authors: Golsa Ghiaasi, Jens Abraham, Egil Eide, Torbjörn Ekman

Abstract: A study of channel hardening in a large-scale antenna system has been carried out by means of indoor channel measurements over four frequency bands, namely 1.472 GHz, 2.6 GHz, 3.82 GHz and 4.16 GHz. NTNU's Reconfigurable Radio Network Platform has been used to record the channel estimates for 40 single user non-line of sight radio links to a 64 element wide-band antenna array. By examining the rms… ▽ More A study of channel hardening in a large-scale antenna system has been carried out by means of indoor channel measurements over four frequency bands, namely 1.472 GHz, 2.6 GHz, 3.82 GHz and 4.16 GHz. NTNU's Reconfigurable Radio Network Platform has been used to record the channel estimates for 40 single user non-line of sight radio links to a 64 element wide-band antenna array. By examining the rms delay spread and the ratio of the normalized subcarrier to average SISO link power of the radio channel received by a single user after combination, the hardening of this equivalent channel is analyzed for various numbers of nodes. The channel hardening merits show consistent behaviour throughout the frequency bands. By combining 16 antennas the rms delay spread of the equivalent channel is reduced from above 100 ns to below 35 ns with significantly reduced variation in the channel power. △ Less

Submitted 13 December, 2018; originally announced December 2018.

Comments: 6 pages, 7 figures, presented at IEEE PIMRC 2018

arXiv:1811.08560 [pdf, other]

Adjustable Real-time Style Transfer

Authors: Mohammad Babaeizadeh, Golnaz Ghiasi

Abstract: Artistic style transfer is the problem of synthesizing an image with content similar to a given image and style similar to another. Although recent feed-forward neural networks can generate stylized images in real-time, these models produce a single stylization given a pair of style/content images, and the user doesn't have control over the synthesized output. Moreover, the style transfer depends… ▽ More Artistic style transfer is the problem of synthesizing an image with content similar to a given image and style similar to another. Although recent feed-forward neural networks can generate stylized images in real-time, these models produce a single stylization given a pair of style/content images, and the user doesn't have control over the synthesized output. Moreover, the style transfer depends on the hyper-parameters of the model with varying "optimum" for different input images. Therefore, if the stylized output is not appealing to the user, she/he has to try multiple models or retrain one with different hyper-parameters to get a favorite stylization. In this paper, we address these issues by proposing a novel method which allows adjustment of crucial hyper-parameters, after the training and in real-time, through a set of manually adjustable parameters. These parameters enable the user to modify the synthesized outputs from the same pair of style/content images, in search of a favorite stylized image. Our quantitative and qualitative experiments indicate how adjusting these parameters is comparable to retraining the model with different hyper-parameters. We also demonstrate how these parameters can be randomized to generate results which are diverse but still very similar in style and content. △ Less

Submitted 20 November, 2018; originally announced November 2018.

arXiv:1810.12890 [pdf, other]

DropBlock: A regularization method for convolutional networks

Authors: Golnaz Ghiasi, Tsung-Yi Lin, Quoc V. Le

Abstract: Deep neural networks often work well when they are over-parameterized and trained with a massive amount of noise and regularization, such as weight decay and dropout. Although dropout is widely used as a regularization technique for fully connected layers, it is often less effective for convolutional layers. This lack of success of dropout for convolutional layers is perhaps due to the fact that a… ▽ More Deep neural networks often work well when they are over-parameterized and trained with a massive amount of noise and regularization, such as weight decay and dropout. Although dropout is widely used as a regularization technique for fully connected layers, it is often less effective for convolutional layers. This lack of success of dropout for convolutional layers is perhaps due to the fact that activation units in convolutional layers are spatially correlated so information can still flow through convolutional networks despite dropout. Thus a structured form of dropout is needed to regularize convolutional networks. In this paper, we introduce DropBlock, a form of structured dropout, where units in a contiguous region of a feature map are dropped together. We found that applying DropbBlock in skip connections in addition to the convolution layers increases the accuracy. Also, gradually increasing number of dropped units during training leads to better accuracy and more robust to hyperparameter choices. Extensive experiments show that DropBlock works better than dropout in regularizing convolutional networks. On ImageNet classification, ResNet-50 architecture with DropBlock achieves $78.13\%$ accuracy, which is more than $1.6\%$ improvement on the baseline. On COCO detection, DropBlock improves Average Precision of RetinaNet from $36.8\%$ to $38.4\%$. △ Less

Submitted 30 October, 2018; originally announced October 2018.

Comments: Accepted at NIPS 2018

arXiv:1705.06830 [pdf, other]

Exploring the structure of a real-time, arbitrary neural artistic stylization network

Authors: Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent Dumoulin, Jonathon Shlens

Abstract: In this paper, we present a method which combines the flexibility of the neural algorithm of artistic style with the speed of fast style transfer networks to allow real-time stylization using any content/style image pair. We build upon recent work leveraging conditional instance normalization for multi-style transfer networks by learning to predict the conditional instance normalization parameters… ▽ More In this paper, we present a method which combines the flexibility of the neural algorithm of artistic style with the speed of fast style transfer networks to allow real-time stylization using any content/style image pair. We build upon recent work leveraging conditional instance normalization for multi-style transfer networks by learning to predict the conditional instance normalization parameters directly from a style image. The model is successfully trained on a corpus of roughly 80,000 paintings and is able to generalize to paintings previously unobserved. We demonstrate that the learned embedding space is smooth and contains a rich structure and organizes semantic information associated with paintings in an entirely unsupervised manner. △ Less

Submitted 24 August, 2017; v1 submitted 18 May, 2017; originally announced May 2017.

Comments: Accepted as an oral presentation at British Machine Vision Conference (BMVC) 2017

arXiv:1605.02264 [pdf, other]

Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation

Authors: Golnaz Ghiasi, Charless C. Fowlkes

Abstract: CNN architectures have terrific recognition performance but rely on spatial pooling which makes it difficult to adapt them to tasks that require dense, pixel-accurate labeling. This paper makes two contributions: (1) We demonstrate that while the apparent spatial resolution of convolutional feature maps is low, the high-dimensional feature representation contains significant sub-pixel localization… ▽ More CNN architectures have terrific recognition performance but rely on spatial pooling which makes it difficult to adapt them to tasks that require dense, pixel-accurate labeling. This paper makes two contributions: (1) We demonstrate that while the apparent spatial resolution of convolutional feature maps is low, the high-dimensional feature representation contains significant sub-pixel localization information. (2) We describe a multi-resolution reconstruction architecture based on a Laplacian pyramid that uses skip connections from higher resolution feature maps and multiplicative gating to successively refine segment boundaries reconstructed from lower-resolution maps. This approach yields state-of-the-art semantic segmentation results on the PASCAL VOC and Cityscapes segmentation benchmarks without resorting to more complex random-field inference or instance detection driven architectures. △ Less

Submitted 30 July, 2016; v1 submitted 7 May, 2016; originally announced May 2016.

arXiv:1506.08347 [pdf, other]

Occlusion Coherence: Detecting and Localizing Occluded Faces

Authors: Golnaz Ghiasi, Charless C. Fowlkes

Abstract: The presence of occluders significantly impacts object recognition accuracy. However, occlusion is typically treated as an unstructured source of noise and explicit models for occluders have lagged behind those for object appearance and shape. In this paper we describe a hierarchical deformable part model for face detection and landmark localization that explicitly models part occlusion. The propo… ▽ More The presence of occluders significantly impacts object recognition accuracy. However, occlusion is typically treated as an unstructured source of noise and explicit models for occluders have lagged behind those for object appearance and shape. In this paper we describe a hierarchical deformable part model for face detection and landmark localization that explicitly models part occlusion. The proposed model structure makes it possible to augment positive training data with large numbers of synthetically occluded instances. This allows us to easily incorporate the statistics of occlusion patterns in a discriminatively trained model. We test the model on several benchmarks for landmark localization and detection including challenging new data sets featuring significant occlusion. We find that the addition of an explicit occlusion model yields a detection system that outperforms existing approaches for occluded instances while maintaining competitive accuracy in detection and landmark localization for unoccluded instances. △ Less

Submitted 24 August, 2016; v1 submitted 27 June, 2015; originally announced June 2015.

Showing 1–23 of 23 results for author: Ghiaasi, G