Skip to main content

Showing 1–27 of 27 results for author: Shih, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.03218  [pdf, other

    cs.LG cs.AI cs.CL cs.CY

    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

    Authors: Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer , et al. (32 additional authors not shown)

    Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in develo** biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are develo** evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing furthe… ▽ More

    Submitted 15 May, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

    Comments: See the project page at https://wmdp.ai

  2. arXiv:2308.04571  [pdf, other

    cs.RO cs.CV cs.HC

    Optimizing Algorithms From Pairwise User Preferences

    Authors: Leonid Keselman, Katherine Shih, Martial Hebert, Aaron Steinfeld

    Abstract: Typical black-box optimization approaches in robotics focus on learning from metric scores. However, that is not always possible, as not all developers have ground truth available. Learning appropriate robot behavior in human-centric contexts often requires querying users, who typically cannot provide precise metric scores. Existing approaches leverage human feedback in an attempt to model an impl… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

    Comments: Accepted at IROS 2023

    ACM Class: I.2.9; H.1.2; I.2.8

  3. arXiv:2305.18211  [pdf

    eess.SP cs.CV cs.LG

    WiFi-TCN: Temporal Convolution for Human Interaction Recognition based on WiFi signal

    Authors: Chih-Yang Lin, Chia-Yu Lin, Yu-Tso Liu, Timothy K. Shih

    Abstract: The utilization of Wi-Fi based human activity recognition has gained considerable interest in recent times, primarily owing to its applications in various domains such as healthcare for monitoring breath and heart rate, security, elderly care. These Wi-Fi-based methods exhibit several advantages over conventional state-of-the-art techniques that rely on cameras and sensors, including lower costs a… ▽ More

    Submitted 11 January, 2024; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Paper is currently under review at IEEE Access

  4. arXiv:2303.07578  [pdf, ps, other

    cs.SD cs.LG eess.AS

    VANI: Very-lightweight Accent-controllable TTS for Native and Non-native speakers with Identity Preservation

    Authors: Rohan Badlani, Akshit Arora, Subhankar Ghosh, Rafael Valle, Kevin J. Shih, João Felipe Santos, Boris Ginsburg, Bryan Catanzaro

    Abstract: We introduce VANI, a very lightweight multi-lingual accent controllable speech synthesis system. Our model builds upon disentanglement strategies proposed in RADMMM and supports explicit control of accent, language, speaker and fine-grained $F_0$ and energy features for speech synthesis. We utilize the Indic languages dataset, released for LIMMITS 2023 as part of ICASSP Signal Processing Grand Cha… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: Presentation accepted at ICASSP 2023

  5. arXiv:2301.10335  [pdf, other

    cs.SD cs.LG eess.AS

    Multilingual Multiaccented Multispeaker TTS with RADTTS

    Authors: Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro

    Abstract: We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfe… ▽ More

    Submitted 24 January, 2023; originally announced January 2023.

    Comments: 5 pages, submitted to ICASSP 2023

  6. arXiv:2210.01887  [pdf, other

    cs.CV

    Collecting The Puzzle Pieces: Disentangled Self-Driven Human Pose Transfer by Permuting Textures

    Authors: Nannan Li, Kevin J. Shih, Bryan A. Plummer

    Abstract: Human pose transfer synthesizes new view(s) of a person for a given pose. Recent work achieves this via self-reconstruction, which disentangles a person's pose and texture information by breaking the person down into parts, then recombines them for reconstruction. However, part-level disentanglement preserves some pose information that can create unwanted artifacts. In this paper, we propose Pose… ▽ More

    Submitted 30 August, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

    Comments: Accepted to ICCV 2023

  7. arXiv:2203.01786  [pdf, other

    cs.SD cs.LG eess.AS

    Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows

    Authors: Kevin J. Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro

    Abstract: Despite recent advances in generative modeling for text-to-speech synthesis, these models do not yet have the same fine-grained adjustability of pitch-conditioned deterministic models such as FastPitch and FastSpeech2. Pitch information is not only low-dimensional, but also discontinuous, making it particularly difficult to model in a generative setting. Our work explores several techniques for ha… ▽ More

    Submitted 27 June, 2022; v1 submitted 3 March, 2022; originally announced March 2022.

    Comments: 22 pages, 11 figures, 3 tables

  8. arXiv:2108.10447  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    One TTS Alignment To Rule Them All

    Authors: Rohan Badlani, Adrian Łancucki, Kevin J. Shih, Rafael Valle, Wei **, Bryan Catanzaro

    Abstract: Speech-to-text alignment is a critical component of neural textto-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive endto-end TTS models rely on durati… ▽ More

    Submitted 23 August, 2021; originally announced August 2021.

  9. arXiv:2106.12608  [pdf, other

    cs.CL q-bio.QM

    Clinical Named Entity Recognition using Contextualized Token Representations

    Authors: Yichao Zhou, Chelsea Ju, J. Harry Caufield, Kevin Shih, Calvin Chen, Yizhou Sun, Kai-Wei Chang, Peipei **, Wei Wang

    Abstract: The clinical named entity recognition (CNER) task seeks to locate and classify clinical terminologies into predefined categories, such as diagnostic procedure, disease disorder, severity, medication, medication dosage, and sign symptom. CNER facilitates the study of side-effect on medications including identification of novel phenomena and human-focused information extraction. Existing approaches… ▽ More

    Submitted 23 June, 2021; originally announced June 2021.

    Comments: 1 figure, 6 tables

  10. arXiv:2005.05957  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

    Authors: Rafael Valle, Kevin Shih, Ryan Prenger, Bryan Catanzaro

    Abstract: In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer. Flowtron borrows insights from IAF and revamps Tacotron in order to provide high-quality and expressive mel-spectrogram synthesis. Flowtron is optimized by maximizing the likelihood of the training data, which makes training simple a… ▽ More

    Submitted 16 July, 2020; v1 submitted 12 May, 2020; originally announced May 2020.

    Comments: 10 pages, 7 pictures

  11. arXiv:2001.09518  [pdf, other

    cs.CV

    Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos

    Authors: Aysegul Dundar, Kevin J. Shih, Animesh Garg, Robert Pottorf, Andrew Tao, Bryan Catanzaro

    Abstract: Unsupervised landmark learning is the task of learning semantic keypoint-like representations without the use of expensive input keypoint-level annotations. A popular approach is to factorize an image into a pose and appearance data stream, then to reconstruct the image from the factorized components. The pose representation should capture a set of consistent and tightly localized landmarks in ord… ▽ More

    Submitted 26 January, 2020; originally announced January 2020.

  12. Light Field Synthesis by Training Deep Network in the Refocused Image Domain

    Authors: Chang-Le Liu, Kuang-Tsu Shih, Jiun-Woei Huang, Homer H. Chen

    Abstract: Light field imaging, which captures spatio-angular information of incident light on image sensor, enables many interesting applications like image refocusing and augmented reality. However, due to the limited sensor resolution, a trade-off exists between the spatial and angular resolution. To increase the angular resolution, view synthesis techniques have been adopted to generate new views from ex… ▽ More

    Submitted 28 April, 2020; v1 submitted 14 October, 2019; originally announced October 2019.

    Comments: Accepted to IEEE Transactions on Image Processing

  13. arXiv:1909.02749  [pdf, other

    cs.CV cs.LG stat.ML

    Video Interpolation and Prediction with Unsupervised Landmarks

    Authors: Kevin J. Shih, Aysegul Dundar, Animesh Garg, Robert Pottorf, Andrew Tao, Bryan Catanzaro

    Abstract: Prediction and interpolation for long-range video data involves the complex task of modeling motion trajectories for each visible object, occlusions and dis-occlusions, as well as appearance changes due to viewpoint and lighting. Optical flow based techniques generalize but are suitable only for short temporal ranges. Many methods opt to project the video frames to a low dimensional latent space,… ▽ More

    Submitted 6 September, 2019; originally announced September 2019.

    Comments: Technical Report

  14. arXiv:1906.05928  [pdf, other

    cs.CV

    Unsupervised Video Interpolation Using Cycle Consistency

    Authors: Fitsum A. Reda, Deqing Sun, Aysegul Dundar, Mohammad Shoeybi, Guilin Liu, Kevin J. Shih, Andrew Tao, Jan Kautz, Bryan Catanzaro

    Abstract: Learning to synthesize high frame rate videos via interpolation requires large quantities of high frame rate training videos, which, however, are scarce, especially at high resolutions. Here, we propose unsupervised techniques to synthesize high frame rate videos directly from low frame rate videos using cycle consistency. For a triplet of consecutive frames, we optimize models to minimize the dis… ▽ More

    Submitted 27 March, 2021; v1 submitted 13 June, 2019; originally announced June 2019.

    Comments: Published in ICCV 2019. Codes are available at https://github.com/NVIDIA/unsupervised-video-interpolation. Project website https://nv-adlr.github.io/publication/2019-UnsupervisedVideoInterpolation

  15. arXiv:1903.02728  [pdf, other

    cs.CV

    Graphical Contrastive Losses for Scene Graph Parsing

    Authors: Ji Zhang, Kevin J. Shih, Ahmed Elgammal, Andrew Tao, Bryan Catanzaro

    Abstract: Most scene graph parsers use a two-stage pipeline to detect visual relationships: the first stage detects entities, and the second predicts the predicate for each entity pair using a softmax distribution. We find that such pipelines, trained with only a cross entropy loss over predicate classes, suffer from two common errors. The first, Entity Instance Confusion, occurs when the model confuses mul… ▽ More

    Submitted 16 August, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

  16. arXiv:1812.01593  [pdf, other

    cs.CV cs.AI cs.MM cs.RO

    Improving Semantic Segmentation via Video Propagation and Label Relaxation

    Authors: Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao, Bryan Catanzaro

    Abstract: Semantic segmentation requires large amounts of pixel-wise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models' ability to predict future frames in order to also predict future labels.… ▽ More

    Submitted 2 July, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

    Comments: CVPR 2019 Oral. Code link: https://github.com/NVIDIA/semantic-segmentation. YouTube link: https://www.youtube.com/watch?v=aEbXjGZDZSQ

  17. arXiv:1811.11718  [pdf, other

    cs.CV

    Partial Convolution based Padding

    Authors: Guilin Liu, Kevin J. Shih, Ting-Chun Wang, Fitsum A. Reda, Karan Sapra, Zhiding Yu, Andrew Tao, Bryan Catanzaro

    Abstract: In this paper, we present a simple yet effective padding scheme that can be used as a drop-in module for existing convolutional neural networks. We call it partial convolution based padding, with the intuition that the padded region can be treated as holes and the original input as non-holes. Specifically, during the convolution operation, the convolution results are re-weighted near image borders… ▽ More

    Submitted 28 November, 2018; originally announced November 2018.

    Comments: 11 pages; code is available at https://github.com/NVIDIA/partialconv

  18. arXiv:1811.09543  [pdf, other

    cs.CV

    An Interpretable Model for Scene Graph Generation

    Authors: Ji Zhang, Kevin Shih, Andrew Tao, Bryan Catanzaro, Ahmed Elgammal

    Abstract: We propose an efficient and interpretable scene graph generator. We consider three types of features: visual, spatial and semantic, and we use a late fusion strategy such that each feature's contribution can be explicitly investigated. We study the key factors about these features that have the most impact on the performance, and also visualize the learned visual features for relationships and inv… ▽ More

    Submitted 21 November, 2018; originally announced November 2018.

    Comments: arXiv admin note: substantial text overlap with arXiv:1811.00662

  19. Revisiting Image-Language Networks for Open-ended Phrase Detection

    Authors: Bryan A. Plummer, Kevin J. Shih, Yichen Li, Ke Xu, Svetlana Lazebnik, Stan Sclaroff, Kate Saenko

    Abstract: Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image. In this paper we address a more realistic version of the natural language grounding task where we must both identify whether the phrase is relevant to an image and localize the phrase. This can also be viewed as a generalization of object detection to… ▽ More

    Submitted 12 October, 2020; v1 submitted 17 November, 2018; originally announced November 2018.

    Comments: Accepted to TPAMI

  20. arXiv:1811.00684  [pdf, other

    cs.CV

    SDCNet: Video Prediction Using Spatially-Displaced Convolution

    Authors: Fitsum A. Reda, Guilin Liu, Kevin J. Shih, Robert Kirby, Jon Barker, David Tarjan, Andrew Tao, Bryan Catanzaro

    Abstract: We present an approach for high-resolution video frame prediction by conditioning on both past frames and past optical flows. Previous approaches rely on resampling past frames, guided by a learned future optical flow, or on direct generation of pixels. Resampling based on flow is insufficient because it cannot deal with disocclusions. Generative models currently lead to blurry results. Recent app… ▽ More

    Submitted 27 March, 2021; v1 submitted 1 November, 2018; originally announced November 2018.

    Comments: Published in ECCV 2018. Codes available at https://github.com/NVIDIA/semantic-segmentation/tree/sdcnet/sdcnet. Project page available at https://nv-adlr.github.io/publication/2018-SDCNet

  21. arXiv:1811.00662  [pdf, other

    cs.CV

    Introduction to the 1st Place Winning Model of OpenImages Relationship Detection Challenge

    Authors: Ji Zhang, Kevin Shih, Andrew Tao, Bryan Catanzaro, Ahmed Elgammal

    Abstract: This article describes the model we built that achieved 1st place in the OpenImage Visual Relationship Detection Challenge on Kaggle. Three key factors contribute the most to our success: 1) language bias is a powerful baseline for this task. We build the empirical distribution $P(predicate|subject,object)$ in the training set and directly use that in testing. This baseline achieved the 2nd place… ▽ More

    Submitted 7 November, 2018; v1 submitted 1 November, 2018; originally announced November 2018.

  22. arXiv:1804.07723  [pdf, other

    cs.CV

    Image Inpainting for Irregular Holes Using Partial Convolutions

    Authors: Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, Bryan Catanzaro

    Abstract: Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using convolutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value). This often leads to artifacts such as color discrepancy and blurriness. Post-processing is usually used to reduce such artifacts, bu… ▽ More

    Submitted 15 December, 2018; v1 submitted 20 April, 2018; originally announced April 2018.

    Comments: Update: camera-ready; L1 loss is size-averaged; code of partial conv layer: https://github.com/NVIDIA/partialconv. Published at ECCV 2018

  23. arXiv:1712.03463  [pdf, other

    cs.CL

    Learning Interpretable Spatial Operations in a Rich 3D Blocks World

    Authors: Yonatan Bisk, Kevin J. Shih, Ye** Choi, Daniel Marcu

    Abstract: In this paper, we study the problem of map** natural language instructions to complex spatial actions in a 3D blocks world. We first introduce a new dataset that pairs complex 3D spatial operations to rich natural language descriptions that require complex spatial and pragmatic interpretations such as "mirroring", "twisting", and "balancing". This dataset, built on the simulation environment of… ▽ More

    Submitted 24 December, 2017; v1 submitted 9 December, 2017; originally announced December 2017.

    Comments: AAAI 2018

  24. arXiv:1704.00260  [pdf, other

    cs.CV cs.AI cs.LG cs.NE stat.ML

    Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

    Authors: Tanmay Gupta, Kevin Shih, Saurabh Singh, Derek Hoiem

    Abstract: An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multi-task learning. In particular, the task of visual recognition is aligned to the task of visual question answe… ▽ More

    Submitted 16 October, 2017; v1 submitted 2 April, 2017; originally announced April 2017.

    Comments: Accepted in ICCV 2017. The arxiv version has an extra analysis on correlation with human attention

  25. arXiv:1511.07394  [pdf, other

    cs.CV

    Where To Look: Focus Regions for Visual Question Answering

    Authors: Kevin J. Shih, Saurabh Singh, Derek Hoiem

    Abstract: We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method exhibits significant improvements in answering questions such as "what color," where it is necessary to evaluate a specific location, and "what room," where it selectively identifies informative image regions. Our model is tested on the VQA dataset which is the largest… ▽ More

    Submitted 10 January, 2016; v1 submitted 23 November, 2015; originally announced November 2015.

    Comments: Submitted to CVPR2016

  26. arXiv:1507.06332  [pdf, other

    cs.CV

    Part Localization using Multi-Proposal Consensus for Fine-Grained Categorization

    Authors: Kevin J. Shih, Arun Mallya, Saurabh Singh, Derek Hoiem

    Abstract: We present a simple deep learning framework to simultaneously predict keypoint locations and their respective visibilities and use those to achieve state-of-the-art performance for fine-grained classification. We show that by conditioning the predictions on object proposals with sufficient image support, our method can do well without complicated spatial reasoning. Instead, inference methods with… ▽ More

    Submitted 22 July, 2015; originally announced July 2015.

    Comments: BMVC 2015

  27. arXiv:1411.5307  [pdf, other

    cs.IR cs.CV

    Efficient Media Retrieval from Non-Cooperative Queries

    Authors: Kevin Shih, Wei Di, Vignesh Jagadeesh, Robinson Piramuthu

    Abstract: Text is ubiquitous in the artificial world and easily attainable when it comes to book title and author names. Using the images from the book cover set from the Stanford Mobile Visual Search dataset and additional book covers and metadata from openlibrary.org, we construct a large scale book cover retrieval dataset, complete with 100K distractor covers and title and author strings for each. Becaus… ▽ More

    Submitted 19 November, 2014; originally announced November 2014.

    Comments: 8 pages, 9 figures, 1 table