Skip to main content

Showing 1–20 of 20 results for author: Ka, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  2. arXiv:2212.03232  [pdf, other

    cs.LG cs.AI stat.ML

    Learning the joint distribution of two sequences using little or no paired data

    Authors: Soroosh Mariooryad, Matt Shannon, Siyuan Ma, Tom Bagby, David Kao, Daisy Stanton, Eric Battenberg, RJ Skerry-Ryan

    Abstract: We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exact model under a realistic data setup, we propose a variational inference approximation. To train this variational model with categorical data, we propose a KL en… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

  3. Audio Matters Too: How Audial Avatar Customization Enhances Visual Avatar Customization

    Authors: Dominic Kao, Rabindra Ratan, Christos Mousas, Amogh Joshi, Edward F. Melcer

    Abstract: Avatar customization is known to positively affect crucial outcomes in numerous domains. However, it is unknown whether audial customization can confer the same benefits as visual customization. We conducted a preregistered 2 x 2 (visual choice vs. visual assignment x audial choice vs. audial assignment) study in a Java programming game. Participants with visual choice experienced higher avatar id… ▽ More

    Submitted 10 February, 2022; originally announced February 2022.

    Comments: 27 pages

  4. arXiv:2111.05095  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Speaker Generation

    Authors: Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, David Kao

    Abstract: This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to… ▽ More

    Submitted 7 November, 2021; originally announced November 2021.

    Comments: 12 pages, 3 figures, 4 tables, appendix with 2 tables

    ACM Class: I.2.7; G.3

  5. arXiv:2107.03554  [pdf

    cs.CV cs.CY

    Automated Object Behavioral Feature Extraction for Potential Risk Analysis based on Video Sensor

    Authors: Byeongjoon Noh, Dongho Ka, Wonjun Noh, Hwasoo Yeo

    Abstract: Pedestrians are exposed to risk of death or serious injuries on roads, especially unsignalized crosswalks, for a variety of reasons. To date, an extensive variety of studies have reported on vision based traffic safety system. However, many studies required manual inspection of the volumes of traffic video to reliably obtain traffic related objects behavioral factors. In this paper, we propose an… ▽ More

    Submitted 25 October, 2021; v1 submitted 7 July, 2021; originally announced July 2021.

    Comments: 6 pages, 9 figures

  6. arXiv:2105.02582  [pdf

    cs.CV

    Vision based Pedestrian Potential Risk Analysis based on Automated Behavior Feature Extraction for Smart and Safe City

    Authors: Byeongjoon Noh, Dongho Ka, David Lee, Hwasoo Yeo

    Abstract: Despite recent advances in vehicle safety technologies, road traffic accidents still pose a severe threat to human lives and have become a leading cause of premature deaths. In particular, crosswalks present a major threat to pedestrians, but we lack dense behavioral data to investigate the risks they face. Therefore, we propose a comprehensive analytical model for pedestrian potential risk using… ▽ More

    Submitted 27 May, 2021; v1 submitted 6 May, 2021; originally announced May 2021.

    Comments: 26 pages, 15 figures, 5 tables

  7. arXiv:2010.08029  [pdf, other

    cs.LG stat.ML

    Non-saturating GAN training as divergence minimization

    Authors: Matt Shannon, Ben Poole, Soroosh Mariooryad, Tom Bagby, Eric Battenberg, David Kao, Daisy Stanton, RJ Skerry-Ryan

    Abstract: Non-saturating generative adversarial network (GAN) training is widely used and has continued to obtain groundbreaking results. However so far this approach has lacked strong theoretical justification, in contrast to alternatives such as f-GANs and Wasserstein GANs which are motivated in terms of approximate divergence minimization. In this paper we show that non-saturating GAN training does in fa… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

  8. arXiv:2009.10868  [pdf, other

    cs.CV

    A Real-Time Predictive Pedestrian Collision Warning Service for Cooperative Intelligent Transportation Systems Using 3D Pose Estimation

    Authors: Ue-Hwan Kim, Dongho Ka, Hwasoo Yeo, Jong-Hwan Kim

    Abstract: Minimizing traffic accidents between vehicles and pedestrians is one of the primary research goals in intelligent transportation systems. To achieve the goal, pedestrian orientation recognition and prediction of pedestrian's crossing or not-crossing intention play a central role. Contemporary approaches do not guarantee satisfactory performance due to limited field-of-view, lack of generalization,… ▽ More

    Submitted 21 February, 2022; v1 submitted 22 September, 2020; originally announced September 2020.

    Comments: 12 pages, 8 figures, 4 tables

  9. arXiv:2007.04495  [pdf, other

    cs.HC

    Hack.VR: A Programming Game in Virtual Reality

    Authors: Dominic Kao, Christos Mousas, Alejandra J. Magana, D. Fox Harrell, Rabindra Ratan, Edward F. Melcer, Brett Sherrick, Paul Parsons, Dmitri A. Gusev

    Abstract: In this article we describe Hack.VR, an object-oriented programming game in virtual reality. Hack.VR uses a VR programming language in which nodes represent functions and node connections represent data flow. Using this programming framework, players reprogram VR objects such as elevators, robots, and switches. Hack.VR has been designed to be highly interactable both physically and semantically.

    Submitted 23 November, 2020; v1 submitted 8 July, 2020; originally announced July 2020.

  10. arXiv:2006.03519  [pdf, other

    cs.HC

    Exploring Help Facilities in Game-Making Software

    Authors: Dominic Kao

    Abstract: Help facilities have been crucial in hel** users learn about software for decades. But despite widespread prevalence of game engines and game editors that ship with many of today's most popular games, there is a lack of empirical evidence on how help facilities impact game-making. For instance, certain types of help facilities may help users more than others. To better understand help facilities… ▽ More

    Submitted 5 June, 2020; originally announced June 2020.

  11. CheXplain: Enabling Physicians to Explore and UnderstandData-Driven, AI-Enabled Medical Imaging Analysis

    Authors: Yao Xie, Melody Chen, David Kao, Ge Gao, Xiang 'Anthony' Chen

    Abstract: The recent development of data-driven AI promises to automate medical diagnosis; however, most AI functions as 'black boxes' to physicians with limited computational knowledge. Using medical imaging as a point of departure, we conducted three iterations of design activities to formulate CheXplain---a system that enables physicians to explore and understand AI-enabled chest X-ray analysis: (1) a pa… ▽ More

    Submitted 19 January, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

    Comments: 10 pages, 5 figures

    ACM Class: H.5.m

  12. arXiv:1910.10288  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

    Authors: Eric Battenberg, RJ Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao, Matt Shannon, Tom Bagby

    Abstract: Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attentio… ▽ More

    Submitted 22 April, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Accepted to ICASSP 2020

  13. arXiv:1910.01709  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Semi-Supervised Generative Modeling for Controllable Speech Synthesis

    Authors: Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby

    Abstract: We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised TTS models. We demonstrate that our model… ▽ More

    Submitted 3 October, 2019; originally announced October 2019.

  14. arXiv:1906.03402  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

    Authors: Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, Tom Bagby

    Abstract: Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of an… ▽ More

    Submitted 25 October, 2019; v1 submitted 8 June, 2019; originally announced June 2019.

    Comments: Submitted to ICLR 2020

  15. arXiv:1504.06018  [pdf, other

    cs.IT

    Blind Index Coding

    Authors: David T. H. Kao, Mohammad Ali Maddah-Ali, A. Salman Avestimehr

    Abstract: We introduce the blind index coding (BIC) problem, in which a single sender communicates distinct messages to multiple users over a shared channel. Each user has partial knowledge of each message as side information. However, unlike classic index coding, in BIC, the sender is uncertain of what side information is available to each user. In particular, the sender only knows the amount of bits in ea… ▽ More

    Submitted 1 September, 2015; v1 submitted 22 April, 2015; originally announced April 2015.

    Comments: Parts of this paper were presented at ISIT 2015 and ICC 2015

  16. arXiv:1504.04797  [pdf, other

    cs.IT

    Rover-to-Orbiter Communication in Mars: Taking Advantage of the Varying Topology

    Authors: Songze Li, David T. H. Kao, A. Salman Avestimehr

    Abstract: In this paper, we study the communication problem from rovers on Mars' surface to Mars-orbiting satellites. We first justify that, to a good extent, the rover-to-orbiter communication problem can be modelled as communication over a $2 \times 2$ X-channel with the network topology varying over time. For such a fading X-channel where transmitters are only aware of the time-varying topology but not t… ▽ More

    Submitted 10 December, 2015; v1 submitted 19 April, 2015; originally announced April 2015.

    Comments: 13 pages, 6 figures. Accepted by IEEE Transactions on Communications

  17. arXiv:1405.1091  [pdf, other

    cs.IT

    Linear Degrees of Freedom of the MIMO X-Channel with Delayed CSIT

    Authors: David T. H. Kao, A. Salman Avestimehr

    Abstract: We study the degrees of freedom (DoF) of the multiple-input multiple-output X-channel (MIMO XC) with delayed channel state information at the transmitters (delayed CSIT), assuming linear coding strategies at the transmitters. We present two results: 1) the linear sum DoF for MIMO XC with general antenna configurations, and 2) the linear DoF region for MIMO XC with symmetric antennas. The converse… ▽ More

    Submitted 5 May, 2014; originally announced May 2014.

    Comments: to be presented in part at ISIT 2014

  18. arXiv:1305.3934  [pdf, other

    cs.IT

    An Upper Bound on the Capacity of Vector Dirty Paper with Unknown Spin and Stretch

    Authors: David T. H. Kao, Ashutosh Sabharwal

    Abstract: Dirty paper codes are a powerful tool for combating known interference. However, there is a significant difference between knowing the transmitted interference sequence and knowing the received interference sequence, especially when the channel modifying the interference is uncertain. We present an upper bound on the capacity of a compound vector dirty paper channel where although an additive Gaus… ▽ More

    Submitted 16 May, 2013; originally announced May 2013.

    Comments: to be presented at ISIT 2013

  19. arXiv:1110.0886  [pdf, other

    cs.IT

    Two-User Interference Channels with Local Views: On Capacity Regions of TDM-Dominating Policies

    Authors: David T. -H. Kao, Ashutosh Sabharwal

    Abstract: We study the capacity regions of two-user interference channels where transmitters base their transmission schemes on local views of the channel state. Under the local view model, each transmitter knows only a subset of the four channel gains, which may be mismatched from the other transmitter. We consider a set of seven local views, and find that for five out of the seven local views, TDM is su… ▽ More

    Submitted 29 July, 2012; v1 submitted 4 October, 2011; originally announced October 2011.

    Comments: revised 22 Jun, including updated title

  20. arXiv:0906.0557  [pdf, ps, other

    cs.NI cs.PF

    An Axiomatic Theory of Fairness in Network Resource Allocation

    Authors: Tian Lan, David Kao, Mung Chiang, Ashutosh Sabharwal

    Abstract: We present a set of five axioms for fairness measures in resource allocation. A family of fairness measures satisfying the axioms is constructed. Well-known notions such as alpha-fairness, Jain's index, and entropy are shown to be special cases. Properties of fairness measures satisfying the axioms are proven, including Schur-concavity. Among the engineering implications is a generalized Jain's… ▽ More

    Submitted 7 October, 2009; v1 submitted 2 June, 2009; originally announced June 2009.