Skip to main content

Showing 1–50 of 53 results for author: Tuzel, O

.
  1. arXiv:2405.13226  [pdf, other

    cs.CL cs.LG

    Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

    Authors: Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Oncel Tuzel

    Abstract: Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length. However, this method of concatenation can lead to cross-document attention within a sequence, which is neither a desirable learning signal n… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  2. arXiv:2405.08911  [pdf, other

    cs.CV cs.LG

    CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

    Authors: Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Oncel Tuzel

    Abstract: CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks.… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

  3. arXiv:2404.15653  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

    Authors: Sachin Mehta, Maxwell Horton, Fartash Faghri, Mohammad Hossein Sekhavat, Mahyar Najibi, Mehrdad Farajtabar, Oncel Tuzel, Mohammad Rastegari

    Abstract: Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed m… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  4. arXiv:2312.09299  [pdf, other

    cs.LG cs.CL cs.CV

    Weight subcloning: direct initialization of transformers using larger pretrained ones

    Authors: Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari

    Abstract: Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

  5. arXiv:2311.18237  [pdf, other

    cs.CV cs.LG

    Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models

    Authors: Raviteja Vemulapalli, Hadi Pouransari, Fartash Faghri, Sachin Mehta, Mehrdad Farajtabar, Mohammad Rastegari, Oncel Tuzel

    Abstract: Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high inference compute cost, these models cannot be deployed for many real-world applications. Motivated by this, we ask the following important question, "How can we leverage the knowledge from a large VFM to… ▽ More

    Submitted 1 July, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

    Comments: International Conference on Machine Learning, 2024

  6. arXiv:2311.18168  [pdf, other

    cs.CV cs.LG eess.AS

    Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

    Authors: Karren D. Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, Oncel Tuzel

    Abstract: We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one map** from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D f… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

  7. arXiv:2311.17910  [pdf, other

    cs.CV cs.GR

    HUGS: Human Gaussian Splats

    Authors: Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, Anurag Ranjan

    Abstract: Recent advances in neural rendering have improved both training and rendering times by orders of magnitude. While these methods demonstrate state-of-the-art quality and speed, they are designed for photogrammetry of static scenes and do not generalize well to freely moving humans in the environment. In this work, we introduce Human Gaussian Splats (HUGS) that represents an animatable human togethe… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

  8. arXiv:2311.17049  [pdf, other

    cs.CV cs.CL cs.LG

    MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

    Authors: Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel

    Abstract: Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP -- a new family of ef… ▽ More

    Submitted 1 April, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

    Comments: CVPR 2024

  9. arXiv:2310.16226  [pdf, other

    cs.CV cs.CL cs.LG

    TiC-CLIP: Continual Training of CLIP Models

    Authors: Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, Fartash Faghri

    Abstract: Kee** large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language mode… ▽ More

    Submitted 21 March, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  10. arXiv:2310.15308  [pdf, other

    cs.CV cs.LG

    SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

    Authors: Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, Hadi Pouransari

    Abstract: The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficient… ▽ More

    Submitted 10 June, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

  11. arXiv:2310.15130  [pdf, other

    cs.SD cs.CV eess.AS

    Novel-View Acoustic Synthesis from 3D Reconstructed Rooms

    Authors: Byeongjoo Ahn, Karren Yang, Brian Hamilton, Jonathan Sheaffer, Anurag Ranjan, Miguel Sarabia, Oncel Tuzel, Jen-Hao Rick Chang

    Abstract: We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown sound sources, we estimate the sound anywhere in the scene. We identify the main challenges of novel-view acoustic synthesis as sound source localization, separ… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

  12. arXiv:2310.14108  [pdf, other

    cs.LG cs.AI cs.CV

    CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

    Authors: Mohammadreza Salehi, Mehrdad Farajtabar, Maxwell Horton, Fartash Faghri, Hadi Pouransari, Raviteja Vemulapalli, Oncel Tuzel, Ali Farhadi, Mohammad Rastegari, Sachin Mehta

    Abstract: Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual represent… ▽ More

    Submitted 21 October, 2023; originally announced October 2023.

  13. arXiv:2310.04564  [pdf, other

    cs.LG cs.AI

    ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

    Authors: Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar

    Abstract: Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications. However, their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. Despite recent trends favoring alternative activation functions such as GELU or SiLU, known for increased computation, this study strongly advocates for reinstat… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

    Comments: preprint

  14. arXiv:2309.10707  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models

    Authors: Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Raviteja Vemulapalli, Jen-Hao Rick Chang, Karren Yang, Gautam Varma Mantena, Oncel Tuzel

    Abstract: While Automatic Speech Recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. However, target-domain data usually are not readily available in many scenarios. In this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

  15. arXiv:2306.07890  [pdf, other

    cs.CV cs.LG

    VISION Datasets: A Benchmark for Vision-based InduStrial InspectiON

    Authors: Hao** Bai, Shancong Mou, Tatiana Likhomanenko, Ramazan Gokberk Cinbis, Oncel Tuzel, ** Huang, Jiulong Shan, Jianjun Shi, Meng Cao

    Abstract: Despite progress in vision-based inspection algorithms, real-world industrial challenges -- specifically in data availability, quality, and complex production requirements -- often remain under-addressed. We introduce the VISION Datasets, a diverse collection of 14 industrial inspection datasets, uniquely poised to meet these challenges. Unlike previous datasets, VISION brings versatility to defec… ▽ More

    Submitted 17 June, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

  16. arXiv:2304.12390  [pdf, other

    cs.CV cs.GR

    Pointersect: Neural Rendering with Cloud-Ray Intersection

    Authors: Jen-Hao Rick Chang, Wei-Yu Chen, Anurag Ranjan, Kwang Moo Yi, Oncel Tuzel

    Abstract: We propose a novel method that renders point clouds as if they are surfaces. The proposed method is differentiable and requires no scene-specific optimization. This unique capability enables, out-of-the-box, surface normal estimation, rendering room-scale point clouds, inverse rendering, and ray tracing with global illumination. Unlike existing work that focuses on converting point clouds to other… ▽ More

    Submitted 24 April, 2023; originally announced April 2023.

    Comments: CVPR 2023

  17. arXiv:2303.15437  [pdf, other

    cs.CV

    FaceLit: Neural 3D Relightable Faces

    Authors: Anurag Ranjan, Kwang Moo Yi, Jen-Hao Rick Chang, Oncel Tuzel

    Abstract: We propose a generative framework, FaceLit, capable of generating a 3D face that can be rendered at various user-defined lighting conditions and views, learned purely from 2D images in-the-wild without any manual annotation. Unlike existing works that require careful capture setup or human labor, we rely on off-the-shelf pose and illumination estimators. With these estimates, we incorporate the Ph… ▽ More

    Submitted 27 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  18. arXiv:2303.14885  [pdf, other

    eess.AS cs.LG cs.SD

    Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis

    Authors: Karren Yang, Ting-Yao Hu, Jen-Hao Rick Chang, Hema Swetha Koppula, Oncel Tuzel

    Abstract: Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To… ▽ More

    Submitted 26 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  19. arXiv:2303.14189  [pdf, other

    cs.CV

    FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

    Authors: Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, Anurag Ranjan

    Abstract: The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural repara… ▽ More

    Submitted 17 August, 2023; v1 submitted 24 March, 2023; originally announced March 2023.

    Comments: ICCV 2023

  20. arXiv:2303.08983  [pdf, other

    cs.CV cs.AI cs.LG

    Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement

    Authors: Fartash Faghri, Hadi Pouransari, Sachin Mehta, Mehrdad Farajtabar, Ali Farhadi, Mohammad Rastegari, Oncel Tuzel

    Abstract: We propose Dataset Reinforcement, a strategy to improve a dataset once such that the accuracy of any model architecture trained on the reinforced dataset is improved at no additional training cost for users. We propose a Dataset Reinforcement strategy based on data augmentation and knowledge distillation. Our generic strategy is designed based on extensive analysis across CNN- and transformer-base… ▽ More

    Submitted 22 September, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted at International Conference on Computer Vision (ICCV) 2023. v2: Camera-ready version with new Tables 9 and 10. v3: Correction to Table 7-Avg. column

  21. arXiv:2303.04766  [pdf, other

    cs.CV cs.IR cs.LG

    FastFill: Efficient Compatible Model Update

    Authors: Florian Jaeckle, Fartash Faghri, Ali Farhadi, Oncel Tuzel, Hadi Pouransari

    Abstract: In many retrieval systems the original high dimensional data (e.g., images) is mapped to a lower dimensional feature through a learned embedding model. The task of retrieving the most similar data from a gallery set to a given query data is performed through a similarity comparison on features. When the embedding model is updated, it might produce features that are not comparable/compatible with f… ▽ More

    Submitted 8 March, 2023; originally announced March 2023.

    Comments: To appear in The Eleventh International Conference on Learning Representations

  22. arXiv:2212.10553  [pdf, other

    cs.CV cs.AI cs.LG

    RangeAugment: Efficient Online Augmentation with Range Learning

    Authors: Sachin Mehta, Saeid Naderiparizi, Fartash Faghri, Maxwell Horton, Lailin Chen, Ali Farhadi, Oncel Tuzel, Mohammad Rastegari

    Abstract: State-of-the-art automatic augmentation methods (e.g., AutoAugment and RandAugment) for visual recognition tasks diversify training data using a large set of augmentation operations. The range of magnitudes of many augmentation operations (e.g., brightness and contrast) is continuous. Therefore, to make search computationally tractable, these methods use fixed and manually-defined magnitude ranges… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

    Comments: Technical report (22 pages including references and appendix)

  23. arXiv:2210.13567  [pdf, ps, other

    cs.CV cs.LG cs.SD eess.AS

    I see what you hear: a vision-inspired method to localize words

    Authors: Mohammad Samragh, Arnav Kundu, Ting-Yao Hu, Minsik Cho, Aman Chadha, Ashish Shrivastava, Oncel Tuzel, Devang Naik

    Abstract: This paper explores the possibility of using visual object detection techniques for word localization in speech data. Object detection has been thoroughly studied in the contemporary literature for visual data. Noting that an audio can be interpreted as a 1-dimensional image, object localization techniques can be fundamentally useful for word localization. Building upon this idea, we propose a lig… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

  24. arXiv:2210.03927  [pdf, other

    cs.LG

    APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

    Authors: Elan Rosenfeld, Preetum Nakkiran, Hadi Pouransari, Oncel Tuzel, Fartash Faghri

    Abstract: Recent advances in learning aligned multimodal representations have been primarily driven by training large neural networks on massive, noisy paired-modality datasets. In this work, we ask whether it is possible to achieve similar results with substantially less training time and data. We achieve this by taking advantage of existing pretrained unimodal encoders and careful curation of alignment da… ▽ More

    Submitted 8 October, 2022; originally announced October 2022.

  25. arXiv:2206.04040  [pdf, other

    cs.CV

    MobileOne: An Improved One millisecond Mobile Backbone

    Authors: Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, Anurag Ranjan

    Abstract: Efficient neural network backbones for mobile devices are often optimized for metrics such as FLOPs or parameter count. However, these metrics may not correlate well with latency of the network when deployed on a mobile device. Therefore, we perform extensive analysis of different metrics by deploying several mobile-friendly networks on a mobile device. We identify and analyze architectural and op… ▽ More

    Submitted 28 March, 2023; v1 submitted 8 June, 2022; originally announced June 2022.

    Comments: Accepted at CVPR 2023

  26. arXiv:2203.12575  [pdf, other

    cs.CV

    NeuMan: Neural Human Radiance Field from a Single Video

    Authors: Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, Anurag Ranjan

    Abstract: Photorealistic rendering and reposing of humans is important for enabling augmented reality experiences. We propose a novel framework to reconstruct the human and the scene that can be rendered with novel human poses and views from just a single in-the-wild video. Given a video captured by a moving camera, we train two NeRF models: a human NeRF model and a scene NeRF model. To train these models,… ▽ More

    Submitted 21 September, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

  27. arXiv:2112.02805  [pdf, other

    cs.CV

    Forward Compatible Training for Large-Scale Embedding Retrieval Systems

    Authors: Vivek Ramanujan, Pavan Kumar Anasosalu Vasu, Ali Farhadi, Oncel Tuzel, Hadi Pouransari

    Abstract: In visual retrieval systems, updating the embedding model requires recomputing features for every piece of data. This expensive process is referred to as backfilling. Recently, the idea of backward compatible training (BCT) was proposed. To avoid the cost of backfilling, BCT modifies training of the new model to make its representations compatible with those of the old model. However, BCT can sign… ▽ More

    Submitted 29 March, 2022; v1 submitted 6 December, 2021; originally announced December 2021.

    Comments: 14 pages with appendix. In proceedings at the conference on Computer Vision and Pattern Recognition 2022

  28. arXiv:2110.11479  [pdf, other

    eess.AS cs.LG cs.SD

    Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition

    Authors: Ting-Yao Hu, Mohammadreza Armandpour, Ashish Shrivastava, Jen-Hao Rick Chang, Hema Koppula, Oncel Tuzel

    Abstract: With recent advances in speech synthesis, synthetic data is becoming a viable alternative to real data for training speech recognition models. However, machine learning with synthetic data is not trivial due to the gap between the synthetic and the real data distributions. Synthetic datasets may contain artifacts that do not exist in real data such as structured noise, content errors, or unrealist… ▽ More

    Submitted 21 October, 2021; originally announced October 2021.

  29. arXiv:2110.07040  [pdf, other

    cs.CV cs.LG

    Data Incubation -- Synthesizing Missing Data for Handwriting Recognition

    Authors: Jen-Hao Rick Chang, Martin Bresler, Youssouf Chherawala, Adrien Delaye, Thomas Deselaers, Ryan Dixon, Oncel Tuzel

    Abstract: In this paper, we demonstrate how a generative model can be used to build a better recognizer through the control of content and style. We are building an online handwriting recognizer from a modest amount of training samples. By training our controllable handwriting synthesizer on the same data, we can synthesize handwriting with previously underrepresented content (e.g., URLs and email addresses… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

  30. arXiv:2110.03860  [pdf, other

    cs.CV cs.LG

    Token Pooling in Vision Transformers

    Authors: Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish Prabhu, Mohammad Rastegari, Oncel Tuzel

    Abstract: Despite the recent success in many applications, the high computational requirements of vision transformers limit their use in resource-constrained settings. While many existing methods improve the quadratic complexity of attention, in most vision transformers, self-attention is not the major computation bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers. To impr… ▽ More

    Submitted 11 October, 2021; v1 submitted 7 October, 2021; originally announced October 2021.

    Journal ref: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023

  31. arXiv:2110.02891  [pdf, other

    cs.LG cs.SD eess.AS

    Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models

    Authors: Jen-Hao Rick Chang, Ashish Shrivastava, Hema Swetha Koppula, Xiaoshuai Zhang, Oncel Tuzel

    Abstract: Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, under an unsupervised-style setting, typical training algorithms f… ▽ More

    Submitted 30 June, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: ICML 2022

  32. arXiv:2106.06129  [pdf, other

    cs.CV cs.LG

    Instance-Level Task Parameters: A Robust Multi-task Weighting Framework

    Authors: Pavan Kumar Anasosalu Vasu, Shreyas Saxena, Oncel Tuzel

    Abstract: Recent works have shown that deep neural networks benefit from multi-task learning by learning a shared representation across several related tasks. However, performance of such systems depend on relative weighting between various losses involved during training. Prior works on loss weighting schemes assume that instances are equally easy or hard for all tasks. In order to break this assumption, w… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

  33. arXiv:2011.01156  [pdf, other

    cs.LG stat.ML

    SapAugment: Learning A Sample Adaptive Policy for Data Augmentation

    Authors: Ting-Yao Hu, Ashish Shrivastava, Jen-Hao Rick Chang, Hema Koppula, Stefan Braun, Kyuyeon Hwang, Ozlem Kalinli, Oncel Tuzel

    Abstract: Data augmentation methods usually apply the same augmentation (or a mix of them) to all the training samples. For example, to perturb data with noise, the noise is sampled from a Normal distribution with a fixed standard deviation, for all samples. We hypothesize that a hard sample with high training loss already provides strong training signal to update the model parameters and should be perturbe… ▽ More

    Submitted 15 February, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: Accepted at ICASSP 2021

  34. arXiv:2011.01151  [pdf, other

    cs.SD cs.LG eess.AS

    Optimize what matters: Training DNN-HMM Keyword Spotting Model Using End Metric

    Authors: Ashish Shrivastava, Arnav Kundu, Chandra Dhir, Devang Naik, Oncel Tuzel

    Abstract: Deep Neural Network--Hidden Markov Model (DNN-HMM) based methods have been successfully used for many always-on keyword spotting algorithms that detect a wake word to trigger a device. The DNN predicts the state probabilities of a given speech frame, while HMM decoder combines the DNN predictions of multiple speech frames to compute the keyword detection score. The DNN, in prior methods, is traine… ▽ More

    Submitted 25 February, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: Accepted at ICASSP 2021

  35. arXiv:2007.04871  [pdf, other

    cs.LG eess.SP stat.ML

    Subject-Aware Contrastive Learning for Biosignals

    Authors: Joseph Y. Cheng, Hanlin Goh, Kaan Dogrusoz, Oncel Tuzel, Erdrin Azemi

    Abstract: Datasets for biosignals, such as electroencephalogram (EEG) and electrocardiogram (ECG), often have noisy labels and have limited number of subjects (<100). To handle these challenges, we propose a self-supervised approach based on contrastive learning to model biosignals with a reduced reliance on labeled data and with fewer subjects. In this regime of limited labels and subjects, intersubject va… ▽ More

    Submitted 30 June, 2020; originally announced July 2020.

  36. arXiv:2007.00051  [pdf, other

    cs.LG stat.ML

    Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution

    Authors: Hadi Pouransari, Mojan Javaheripi, Vinay Sharma, Oncel Tuzel

    Abstract: Knowledge distillation has been used to transfer knowledge learned by a sophisticated model (teacher) to a simpler model (student). This technique is widely used to compress model complexity. However, in most applications the compressed student model suffers from an accuracy gap with its teacher. We propose extracurricular learning, a novel knowledge distillation method, that bridges this gap by (… ▽ More

    Submitted 20 November, 2020; v1 submitted 30 June, 2020; originally announced July 2020.

  37. arXiv:2003.06227  [pdf, other

    eess.AS cs.CV cs.IT cs.LG cs.SD

    Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis

    Authors: Ting-Yao Hu, Ashish Shrivastava, Oncel Tuzel, Chandra Dhir

    Abstract: We present a method to generate speech from input text and a style vector that is extracted from a reference speech signal in an unsupervised manner, i.e., no style annotation, such as speaker information, is required. Existing unsupervised methods, during training, generate speech by computing style from the corresponding ground truth sample and use a decoder to combine the style vector with the… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

    Comments: Accepted at ICASSP 2020 (for presentation in a lecture session)

  38. arXiv:2001.02786  [pdf, other

    cs.LG cs.NE

    Least squares binary quantization of neural networks

    Authors: Hadi Pouransari, Zhucheng Tu, Oncel Tuzel

    Abstract: Quantizing weights and activations of deep neural networks results in significant improvement in inference efficiency at the cost of lower accuracy. A source of the accuracy gap between full precision and quantized models is the quantization error. In this work, we focus on the binary quantization, in which values are mapped to -1 and 1. We provide a unified framework to analyze different scaling… ▽ More

    Submitted 13 June, 2020; v1 submitted 8 January, 2020; originally announced January 2020.

  39. arXiv:1904.01649  [pdf, other

    cs.CV

    MVX-Net: Multimodal VoxelNet for 3D Object Detection

    Authors: Vishwanath A. Sindagi, Yin Zhou, Oncel Tuzel

    Abstract: Many recent works on 3D object detection have focused on designing neural network architectures that can consume point cloud data. While these approaches demonstrate encouraging performance, they are typically based on a single modality and are unable to leverage information from other modalities, such as a camera. Although a few approaches fuse data from different modalities, these methods either… ▽ More

    Submitted 2 April, 2019; originally announced April 2019.

    Comments: 7 pages

    Journal ref: International Conference on Robotics and Automation (ICRA), 2019

  40. arXiv:1812.02886  [pdf, other

    cs.LG stat.ML

    Nonlinear Conjugate Gradients For Scaling Synchronous Distributed DNN Training

    Authors: Saurabh Adya, Vinay Palakkode, Oncel Tuzel

    Abstract: Nonlinear conjugate gradient (NLCG) based optimizers have shown superior loss convergence properties compared to gradient descent based optimizers for traditional optimization problems. However, in Deep Neural Network (DNN) training, the dominant optimization algorithm of choice is still Stochastic Gradient Descent (SGD) and its variants. In this work, we propose and evaluate the stochastic precon… ▽ More

    Submitted 19 November, 2019; v1 submitted 6 December, 2018; originally announced December 2018.

    Comments: 10 pages

    MSC Class: I.2; G.1.6; G.4; D.1.3 ACM Class: I.2; G.1.6; G.4; D.1.3

  41. arXiv:1802.06806  [pdf, other

    cs.CV cs.AI

    Divide, Denoise, and Defend against Adversarial Attacks

    Authors: Seyed-Mohsen Moosavi-Dezfooli, Ashish Shrivastava, Oncel Tuzel

    Abstract: Deep neural networks, although shown to be a successful class of machine learning algorithms, are known to be extremely unstable to adversarial perturbations. Improving the robustness of neural networks against these attacks is important, especially for security-critical applications. To defend against such attacks, we propose dividing the input image into multiple patches, denoising each patch in… ▽ More

    Submitted 25 April, 2019; v1 submitted 19 February, 2018; originally announced February 2018.

  42. arXiv:1711.06396  [pdf, other

    cs.CV

    VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

    Authors: Yin Zhou, Oncel Tuzel

    Abstract: Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekee** robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remo… ▽ More

    Submitted 16 November, 2017; originally announced November 2017.

  43. arXiv:1702.01478  [pdf, other

    cs.CV

    Attentional Network for Visual Object Detection

    Authors: Kota Hara, Ming-Yu Liu, Oncel Tuzel, Amir-massoud Farahmand

    Abstract: We propose augmenting deep neural networks with an attention mechanism for the visual object detection task. As perceiving a scene, humans have the capability of multiple fixation points, each attended to scene content at different locations and scales. However, such a mechanism is missing in the current state-of-the-art visual object detection methods. Inspired by the human vision system, we prop… ▽ More

    Submitted 5 February, 2017; originally announced February 2017.

  44. arXiv:1612.07828  [pdf, other

    cs.CV cs.LG cs.NE

    Learning from Simulated and Unsupervised Images through Adversarial Training

    Authors: Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, Russ Webb

    Abstract: With recent progress in graphics, it has become more tractable to train models on synthetic images, potentially avoiding the need for expensive annotations. However, learning from synthetic images may not achieve the desired performance due to a gap between synthetic and real image distributions. To reduce this gap, we propose Simulated+Unsupervised (S+U) learning, where the task is to learn a mod… ▽ More

    Submitted 19 July, 2017; v1 submitted 22 December, 2016; originally announced December 2016.

    Comments: Accepted at CVPR 2017 for oral presentation

  45. arXiv:1606.07536  [pdf, other

    cs.CV

    Coupled Generative Adversarial Networks

    Authors: Ming-Yu Liu, Oncel Tuzel

    Abstract: We propose coupled generative adversarial network (CoGAN) for learning a joint distribution of multi-domain images. In contrast to the existing approaches, which require tuples of corresponding images in different domains in the training set, CoGAN can learn a joint distribution without any tuple of corresponding images. It can learn a joint distribution with just samples drawn from the marginal d… ▽ More

    Submitted 20 September, 2016; v1 submitted 23 June, 2016; originally announced June 2016.

    Comments: To be published in NIPS 2016

  46. arXiv:1603.07235  [pdf, other

    cs.CV cs.LG

    Global-Local Face Upsampling Network

    Authors: Oncel Tuzel, Yuichi Taguchi, John R. Hershey

    Abstract: Face hallucination, which is the task of generating a high-resolution face image from a low-resolution input image, is a well-studied problem that is useful in widespread application areas. Face hallucination is particularly challenging when the input face resolution is very low (e.g., 10 x 12 pixels) and/or the image is captured in an uncontrolled setting with large pose and illumination variatio… ▽ More

    Submitted 27 April, 2016; v1 submitted 23 March, 2016; originally announced March 2016.

  47. Robust Face Alignment Using a Mixture of Invariant Experts

    Authors: Oncel Tuzel, Tim K. Marks, Salil Tambe

    Abstract: Face alignment, which is the task of finding the locations of a set of facial landmark points in an image of a face, is useful in widespread application areas. Face alignment is particularly challenging when there are large variations in pose (in-plane and out-of-plane rotations) and facial expression. To address this issue, we propose a cascade in which each stage consists of a mixture of regress… ▽ More

    Submitted 23 October, 2016; v1 submitted 13 November, 2015; originally announced November 2015.

    Comments: 17 pages, 6 figures

    Journal ref: Proceedings of 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, October 11-14, 2016, pp 825-841

  48. arXiv:1511.04067  [pdf, other

    cs.CV

    Deep Gaussian Conditional Random Field Network: A Model-based Deep Network for Discriminative Denoising

    Authors: Raviteja Vemulapalli, Oncel Tuzel, Ming-Yu Liu

    Abstract: We propose a novel deep network architecture for image\\ denoising based on a Gaussian Conditional Random Field (GCRF) model. In contrast to the existing discriminative denoising methods that train a separate model for each noise level, the proposed deep network explicitly models the input noise variance and hence is capable of handling a range of noise levels. Our deep network, which we refer to… ▽ More

    Submitted 12 November, 2015; originally announced November 2015.

    Comments: 10 pages, 5 figures

  49. arXiv:1506.04723  [pdf, other

    cs.CV

    Layered Interpretation of Street View Images

    Authors: Ming-Yu Liu, Shuoxin Lin, Srikumar Ramalingam, Oncel Tuzel

    Abstract: We propose a layered street view model to encode both depth and semantic information on street view images for autonomous driving. Recently, stixels, stix-mantics, and tiered scene labeling methods have been proposed to model street view images. We propose a 4-layer street view model, a compact representation over the recently proposed stix-mantics model. Our layers encode semantic classes like gr… ▽ More

    Submitted 29 July, 2015; v1 submitted 15 June, 2015; originally announced June 2015.

    Comments: The paper will be presented in the 2015 Robotics: Science and Systems Conference (RSS)

  50. arXiv:1503.02725  [pdf, other

    cs.CV

    Deep Hierarchical Parsing for Semantic Segmentation

    Authors: Abhishek Sharma, Oncel Tuzel, David W. Jacobs

    Abstract: This paper proposes a learning-based approach to scene parsing inspired by the deep Recursive Context Propagation Network (RCPN). RCPN is a deep feed-forward neural network that utilizes the contextual information from the entire image, through bottom-up followed by top-down context propagation via random binary parse trees. This improves the feature representation of every super-pixel in the imag… ▽ More

    Submitted 30 March, 2015; v1 submitted 9 March, 2015; originally announced March 2015.

    Comments: IEEE CVPR 2015