-
Enhancing Performance of Vision Transformers on Small Datasets through Local Inductive Bias Incorporation
Authors:
Ibrahim Batuhan Akkaya,
Senthilkumar S. Kathiresan,
Elahe Arani,
Bahram Zonooz
Abstract:
Vision transformers (ViTs) achieve remarkable performance on large datasets, but tend to perform worse than convolutional neural networks (CNNs) when trained from scratch on smaller datasets, possibly due to a lack of local inductive bias in the architecture. Recent studies have therefore added locality to the architecture and demonstrated that it can help ViTs achieve performance comparable to CN…
▽ More
Vision transformers (ViTs) achieve remarkable performance on large datasets, but tend to perform worse than convolutional neural networks (CNNs) when trained from scratch on smaller datasets, possibly due to a lack of local inductive bias in the architecture. Recent studies have therefore added locality to the architecture and demonstrated that it can help ViTs achieve performance comparable to CNNs in the small-size dataset regime. Existing methods, however, are architecture-specific or have higher computational and memory costs. Thus, we propose a module called Local InFormation Enhancer (LIFE) that extracts patch-level local information and incorporates it into the embeddings used in the self-attention block of ViTs. Our proposed module is memory and computation efficient, as well as flexible enough to process auxiliary tokens such as the classification and distillation tokens. Empirical results show that the addition of the LIFE module improves the performance of ViTs on small image classification datasets. We further demonstrate how the effect can be extended to downstream tasks, such as object detection and semantic segmentation. In addition, we introduce a new visualization method, Dense Attention Roll-Out, specifically designed for dense prediction tasks, allowing the generation of class-specific attention maps utilizing the attention maps of all tokens.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
GPT-4 Technical Report
Authors:
OpenAI,
Josh Achiam,
Steven Adler,
Sandhini Agarwal,
Lama Ahmad,
Ilge Akkaya,
Florencia Leoni Aleman,
Diogo Almeida,
Janko Altenschmidt,
Sam Altman,
Shyamal Anadkat,
Red Avila,
Igor Babuschkin,
Suchir Balaji,
Valerie Balcom,
Paul Baltescu,
Haiming Bao,
Mohammad Bavarian,
Jeff Belgum,
Irwan Bello,
Jake Berdine,
Gabriel Bernadett-Shapiro,
Christopher Berner,
Lenny Bogdonoff,
Oleg Boiko
, et al. (256 additional authors not shown)
Abstract:
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based mo…
▽ More
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was develo** infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
△ Less
Submitted 4 March, 2024; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Self-training via Metric Learning for Source-Free Domain Adaptation of Semantic Segmentation
Authors:
Ibrahim Batuhan Akkaya,
Ugur Halici
Abstract:
Unsupervised source-free domain adaptation methods aim to train a model for the target domain utilizing a pretrained source-domain model and unlabeled target-domain data, particularly when accessibility to source data is restricted due to intellectual property or privacy concerns. Traditional methods usually use self-training with pseudo-labeling, which is often subjected to thresholding based on…
▽ More
Unsupervised source-free domain adaptation methods aim to train a model for the target domain utilizing a pretrained source-domain model and unlabeled target-domain data, particularly when accessibility to source data is restricted due to intellectual property or privacy concerns. Traditional methods usually use self-training with pseudo-labeling, which is often subjected to thresholding based on prediction confidence. However, such thresholding limits the effectiveness of self-training due to insufficient supervision. This issue becomes more severe in a source-free setting, where supervision comes solely from the predictions of the pre-trained source model. In this study, we propose a novel approach by incorporating a mean-teacher model, wherein the student network is trained using all predictions from the teacher network. Instead of employing thresholding on predictions, we introduce a method to weight the gradients calculated from pseudo-labels based on the reliability of the teacher's predictions. To assess reliability, we introduce a novel approach using proxy-based metric learning. Our method is evaluated in synthetic-to-real and cross-city scenarios, demonstrating superior performance compared to existing state-of-the-art methods.
△ Less
Submitted 9 April, 2024; v1 submitted 8 December, 2022;
originally announced December 2022.
-
Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
Authors:
Bowen Baker,
Ilge Akkaya,
Peter Zhokhov,
Joost Huizinga,
Jie Tang,
Adrien Ecoffet,
Brandon Houghton,
Raul Sampedro,
Jeff Clune
Abstract:
Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the interne…
▽ More
Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.
△ Less
Submitted 23 June, 2022;
originally announced June 2022.
-
Focus-and-Detect: A Small Object Detection Framework for Aerial Images
Authors:
Onur Can Koyun,
Reyhan Kevser Keser,
İbrahim Batuhan Akkaya,
Behçet Uğur Töreyin
Abstract:
Despite recent advances, object detection in aerial images is still a challenging task. Specific problems in aerial images makes the detection problem harder, such as small objects, densely packed objects, objects in different sizes and with different orientations. To address small object detection problem, we propose a two-stage object detection framework called "Focus-and-Detect". The first stag…
▽ More
Despite recent advances, object detection in aerial images is still a challenging task. Specific problems in aerial images makes the detection problem harder, such as small objects, densely packed objects, objects in different sizes and with different orientations. To address small object detection problem, we propose a two-stage object detection framework called "Focus-and-Detect". The first stage which consists of an object detector network supervised by a Gaussian Mixture Model, generates clusters of objects constituting the focused regions. The second stage, which is also an object detector network, predicts objects within the focal regions. Incomplete Box Suppression (IBS) method is also proposed to overcome the truncation effect of region search approach. Results indicate that the proposed two-stage framework achieves an AP score of 42.06 on VisDrone validation dataset, surpassing all other state-of-the-art small object detection methods reported in the literature, to the best of authors' knowledge.
△ Less
Submitted 24 March, 2022;
originally announced March 2022.
-
Self-training Guided Adversarial Domain Adaptation For Thermal Imagery
Authors:
Ibrahim Batuhan Akkaya,
Fazil Altinel,
Ugur Halici
Abstract:
Deep models trained on large-scale RGB image datasets have shown tremendous success. It is important to apply such deep models to real-world problems. However, these models suffer from a performance bottleneck under illumination changes. Thermal IR cameras are more robust against such changes, and thus can be very useful for the real-world problems. In order to investigate efficacy of combining fe…
▽ More
Deep models trained on large-scale RGB image datasets have shown tremendous success. It is important to apply such deep models to real-world problems. However, these models suffer from a performance bottleneck under illumination changes. Thermal IR cameras are more robust against such changes, and thus can be very useful for the real-world problems. In order to investigate efficacy of combining feature-rich visible spectrum and thermal image modalities, we propose an unsupervised domain adaptation method which does not require RGB-to-thermal image pairs. We employ large-scale RGB dataset MS-COCO as source domain and thermal dataset FLIR ADAS as target domain to demonstrate results of our method. Although adversarial domain adaptation methods aim to align the distributions of source and target domains, simply aligning the distributions cannot guarantee perfect generalization to the target domain. To this end, we propose a self-training guided adversarial domain adaptation method to promote generalization capabilities of adversarial domain adaptation methods. To perform self-training, pseudo labels are assigned to the samples on the target thermal domain to learn more generalized representations for the target domain. Extensive experimental analyses show that our proposed method achieves better results than the state-of-the-art adversarial domain adaptation methods. The code and models are publicly available.
△ Less
Submitted 14 June, 2021;
originally announced June 2021.
-
Asymmetric self-play for automatic goal discovery in robotic manipulation
Authors:
OpenAI OpenAI,
Matthias Plappert,
Raul Sampedro,
Tao Xu,
Ilge Akkaya,
Vineet Kosaraju,
Peter Welinder,
Ruben D'Sa,
Arthur Petron,
Henrique P. d. O. Pinto,
Alex Paino,
Hyeonwoo Noh,
Lilian Weng,
Qiming Yuan,
Casey Chu,
Wojciech Zaremba
Abstract:
We train a single, goal-conditioned policy that can solve many robotic manipulation tasks, including tasks with previously unseen goals and objects. We rely on asymmetric self-play for goal discovery, where two agents, Alice and Bob, play a game. Alice is asked to propose challenging goals and Bob aims to solve them. We show that this method can discover highly diverse and complex goals without an…
▽ More
We train a single, goal-conditioned policy that can solve many robotic manipulation tasks, including tasks with previously unseen goals and objects. We rely on asymmetric self-play for goal discovery, where two agents, Alice and Bob, play a game. Alice is asked to propose challenging goals and Bob aims to solve them. We show that this method can discover highly diverse and complex goals without any human priors. Bob can be trained with only sparse rewards, because the interaction between Alice and Bob results in a natural curriculum and Bob can learn from Alice's trajectory when relabeled as a goal-conditioned demonstration. Finally, our method scales, resulting in a single policy that can generalize to many unseen tasks such as setting a table, stacking blocks, and solving simple puzzles. Videos of a learned policy is available at https://robotics-self-play.github.io.
△ Less
Submitted 13 January, 2021;
originally announced January 2021.
-
GAIT: Gradient Adjusted Unsupervised Image-to-Image Translation
Authors:
Ibrahim Batuhan Akkaya,
Ugur Halici
Abstract:
Image-to-image translation (IIT) has made much progress recently with the development of adversarial learning. In most of the recent work, an adversarial loss is utilized to match the distributions of the translated and target image sets. However, this may create artifacts if two domains have different marginal distributions, for example, in uniform areas. In this work, we propose an unsupervised…
▽ More
Image-to-image translation (IIT) has made much progress recently with the development of adversarial learning. In most of the recent work, an adversarial loss is utilized to match the distributions of the translated and target image sets. However, this may create artifacts if two domains have different marginal distributions, for example, in uniform areas. In this work, we propose an unsupervised IIT method that preserves the uniform regions after the translation. The gradient adjustment loss, which is the L2 norm between the Sobel response of the target image and the adjusted Sobel response of the source images, is utilized. The proposed method is validated on the jellyfish-to-Haeckel dataset, which is prepared to demonstrate the mentioned problem, which contains images with different background distributions. We demonstrate that our method obtained a performance gain compared to the baseline method qualitatively and quantitatively, showing the effectiveness of the proposed method.
△ Less
Submitted 2 September, 2020;
originally announced September 2020.
-
Solving Rubik's Cube with a Robot Hand
Authors:
OpenAI,
Ilge Akkaya,
Marcin Andrychowicz,
Maciek Chociej,
Mateusz Litwin,
Bob McGrew,
Arthur Petron,
Alex Paino,
Matthias Plappert,
Glenn Powell,
Raphael Ribas,
Jonas Schneider,
Nikolas Tezak,
Jerry Tworek,
Peter Welinder,
Lilian Weng,
Qiming Yuan,
Wojciech Zaremba,
Lei Zhang
Abstract:
We demonstrate that models trained only in simulation can be used to solve a manipulation problem of unprecedented complexity on a real robot. This is made possible by two key components: a novel algorithm, which we call automatic domain randomization (ADR) and a robot platform built for machine learning. ADR automatically generates a distribution over randomized environments of ever-increasing di…
▽ More
We demonstrate that models trained only in simulation can be used to solve a manipulation problem of unprecedented complexity on a real robot. This is made possible by two key components: a novel algorithm, which we call automatic domain randomization (ADR) and a robot platform built for machine learning. ADR automatically generates a distribution over randomized environments of ever-increasing difficulty. Control policies and vision state estimators trained with ADR exhibit vastly improved sim2real transfer. For control policies, memory-augmented models trained on an ADR-generated distribution of environments show clear signs of emergent meta-learning at test time. The combination of ADR with our custom robot platform allows us to solve a Rubik's cube with a humanoid robot hand, which involves both control and state estimation problems. Videos summarizing our results are available: https://openai.com/blog/solving-rubiks-cube/
△ Less
Submitted 15 October, 2019;
originally announced October 2019.
-
Control Improvisation with Probabilistic Temporal Specifications
Authors:
Ilge Akkaya,
Daniel J. Fremont,
Rafael Valle,
Alexandre Donzé,
Edward A. Lee,
Sanjit A. Seshia
Abstract:
We consider the problem of generating randomized control sequences for complex networked systems typically actuated by human agents. Our approach leverages a concept known as control improvisation, which is based on a combination of data-driven learning and controller synthesis from formal specifications. We learn from existing data a generative model (for instance, an explicit-duration hidden Mar…
▽ More
We consider the problem of generating randomized control sequences for complex networked systems typically actuated by human agents. Our approach leverages a concept known as control improvisation, which is based on a combination of data-driven learning and controller synthesis from formal specifications. We learn from existing data a generative model (for instance, an explicit-duration hidden Markov model, or EDHMM) and then supervise this model in order to guarantee that the generated sequences satisfy some desirable specifications given in Probabilistic Computation Tree Logic (PCTL). We present an implementation of our approach and apply it to the problem of mimicking the use of lighting appliances in a residential unit, with potential applications to home security and resource management. We present experimental results showing that our approach produces realistic control sequences, similar to recorded data based on human actuation, while satisfying suitable formal requirements.
△ Less
Submitted 29 February, 2016; v1 submitted 6 November, 2015;
originally announced November 2015.
-
CCD UBVRI Photometry of the Galactic open clusters: Be~89, Ru~135, and Be~10
Authors:
Inci Akkaya,
William J. Schuster,
Raul Michel,
Carlos Chavarria-K,
Andre Moitinho,
Roberto Vazquez,
Yuksel Karatas
Abstract:
The fundamental parameters of reddening, metallicity, age, and distance are presented for the poorly studied open clusters Be~89, Ru~135, and Be~10, derived from their CCD UBVRI photometry. By fitting the appropriate isochrones to the observed sequences of the clusters in five different color--magnitude diagrams, the weighted averages of distance moduli and heliocentric distances ($(V_0$--…
▽ More
The fundamental parameters of reddening, metallicity, age, and distance are presented for the poorly studied open clusters Be~89, Ru~135, and Be~10, derived from their CCD UBVRI photometry. By fitting the appropriate isochrones to the observed sequences of the clusters in five different color--magnitude diagrams, the weighted averages of distance moduli and heliocentric distances ($(V_0$--$M_{V}), d$(kpc)) are $(11\fm90\pm 0\fm06, 2.4\pm 0.06$) for Be~89, $(9\fm58\pm 0\fm07, 0.81\pm 0.03$) for Ru~135, and $(11\fm16\pm 0\fm06, 1.7 \pm 0.05$) for Be~10, and the weighted averages of the ages $(\log(A), A$(Gyr)) are $(9.58\pm 0.06, 3.8\pm 0.6)$ for Be~89, $(9.58\pm 0.06, 3.8\pm 0.7)$ for Ru~135, and $(9.06\pm 0.05, 1.08\pm 0.08)$ for Be~10.
△ Less
Submitted 17 August, 2010;
originally announced August 2010.