Search | arXiv e-print repository

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content. △ Less

Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2212.03232 [pdf, other]

Learning the joint distribution of two sequences using little or no paired data

Authors: Soroosh Mariooryad, Matt Shannon, Siyuan Ma, Tom Bagby, David Kao, Daisy Stanton, Eric Battenberg, RJ Skerry-Ryan

Abstract: We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exact model under a realistic data setup, we propose a variational inference approximation. To train this variational model with categorical data, we propose a KL en… ▽ More We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exact model under a realistic data setup, we propose a variational inference approximation. To train this variational model with categorical data, we propose a KL encoder loss approach which has connections to the wake-sleep algorithm. Identifying the joint or conditional distributions by only observing unpaired samples from the marginals is only possible under certain conditions in the data distribution and we discuss under what type of conditional independence assumptions that might be achieved, which guides the architecture designs. Experimental results show that even tiny amount of paired data (5 minutes) is sufficient to learn to relate the two modalities (graphemes and phonemes here) when a massive amount of unpaired data is available, paving the path to adopting this principled approach for all seq2seq models in low data resource regimes. △ Less

Submitted 6 December, 2022; originally announced December 2022.

arXiv:2202.05315 [pdf, other]

doi 10.1145/3491102.3501848

Audio Matters Too: How Audial Avatar Customization Enhances Visual Avatar Customization

Authors: Dominic Kao, Rabindra Ratan, Christos Mousas, Amogh Joshi, Edward F. Melcer

Abstract: Avatar customization is known to positively affect crucial outcomes in numerous domains. However, it is unknown whether audial customization can confer the same benefits as visual customization. We conducted a preregistered 2 x 2 (visual choice vs. visual assignment x audial choice vs. audial assignment) study in a Java programming game. Participants with visual choice experienced higher avatar id… ▽ More Avatar customization is known to positively affect crucial outcomes in numerous domains. However, it is unknown whether audial customization can confer the same benefits as visual customization. We conducted a preregistered 2 x 2 (visual choice vs. visual assignment x audial choice vs. audial assignment) study in a Java programming game. Participants with visual choice experienced higher avatar identification and autonomy. Participants with audial choice experienced higher avatar identification and autonomy, but only within the group of participants who had visual choice available. Visual choice led to an increase in time spent, and indirectly led to increases in intrinsic motivation, immersion, time spent, future play motivation, and likelihood of game recommendation. Audial choice moderated the majority of these effects. Our results suggest that audial customization plays an important enhancing role vis-à-vis visual customization. However, audial customization appears to have a weaker effect compared to visual customization. We discuss the implications for avatar customization more generally across digital applications. △ Less

Submitted 10 February, 2022; originally announced February 2022.

Comments: 27 pages

arXiv:2111.05095 [pdf, other]

Speaker Generation

Authors: Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, David Kao

Abstract: This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to… ▽ More This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. Audio samples are available on our demo page. △ Less

Submitted 7 November, 2021; originally announced November 2021.

Comments: 12 pages, 3 figures, 4 tables, appendix with 2 tables

ACM Class: I.2.7; G.3

arXiv:2107.03554 [pdf]

Automated Object Behavioral Feature Extraction for Potential Risk Analysis based on Video Sensor

Authors: Byeongjoon Noh, Dongho Ka, Wonjun Noh, Hwasoo Yeo

Abstract: Pedestrians are exposed to risk of death or serious injuries on roads, especially unsignalized crosswalks, for a variety of reasons. To date, an extensive variety of studies have reported on vision based traffic safety system. However, many studies required manual inspection of the volumes of traffic video to reliably obtain traffic related objects behavioral factors. In this paper, we propose an… ▽ More Pedestrians are exposed to risk of death or serious injuries on roads, especially unsignalized crosswalks, for a variety of reasons. To date, an extensive variety of studies have reported on vision based traffic safety system. However, many studies required manual inspection of the volumes of traffic video to reliably obtain traffic related objects behavioral factors. In this paper, we propose an automated and simpler system for effectively extracting object behavioral features from video sensors deployed on the road. We conduct basic statistical analysis on these features, and show how they can be useful for monitoring the traffic behavior on the road. We confirm the feasibility of the proposed system by applying our prototype to two unsignalized crosswalks in Osan city, South Korea. To conclude, we compare behaviors of vehicles and pedestrians in those two areas by simple statistical analysis. This study demonstrates the potential for a network of connected video sensors to provide actionable data for smart cities to improve pedestrian safety in dangerous road environments. △ Less

Submitted 25 October, 2021; v1 submitted 7 July, 2021; originally announced July 2021.

Comments: 6 pages, 9 figures

arXiv:2105.02582 [pdf]

Vision based Pedestrian Potential Risk Analysis based on Automated Behavior Feature Extraction for Smart and Safe City

Authors: Byeongjoon Noh, Dongho Ka, David Lee, Hwasoo Yeo

Abstract: Despite recent advances in vehicle safety technologies, road traffic accidents still pose a severe threat to human lives and have become a leading cause of premature deaths. In particular, crosswalks present a major threat to pedestrians, but we lack dense behavioral data to investigate the risks they face. Therefore, we propose a comprehensive analytical model for pedestrian potential risk using… ▽ More Despite recent advances in vehicle safety technologies, road traffic accidents still pose a severe threat to human lives and have become a leading cause of premature deaths. In particular, crosswalks present a major threat to pedestrians, but we lack dense behavioral data to investigate the risks they face. Therefore, we propose a comprehensive analytical model for pedestrian potential risk using video footage gathered by road security cameras deployed at such crossings. The proposed system automatically detects vehicles and pedestrians, calculates trajectories by frames, and extracts behavioral features affecting the likelihood of potentially dangerous scenes between these objects. Finally, we design a data cube model by using the large amount of the extracted features accumulated in a data warehouse to perform multidimensional analysis for potential risk scenes with levels of abstraction, but this is beyond the scope of this paper, and will be detailed in a future study. In our experiment, we focused on extracting the various behavioral features from multiple crosswalks, and visualizing and interpreting their behaviors and relationships among them by camera location to show how they may or may not contribute to potential risk. We validated feasibility and applicability by applying it in multiple crosswalks in Osan city, Korea. △ Less

Submitted 27 May, 2021; v1 submitted 6 May, 2021; originally announced May 2021.

Comments: 26 pages, 15 figures, 5 tables

arXiv:2010.08029 [pdf, other]

Non-saturating GAN training as divergence minimization

Authors: Matt Shannon, Ben Poole, Soroosh Mariooryad, Tom Bagby, Eric Battenberg, David Kao, Daisy Stanton, RJ Skerry-Ryan

Abstract: Non-saturating generative adversarial network (GAN) training is widely used and has continued to obtain groundbreaking results. However so far this approach has lacked strong theoretical justification, in contrast to alternatives such as f-GANs and Wasserstein GANs which are motivated in terms of approximate divergence minimization. In this paper we show that non-saturating GAN training does in fa… ▽ More Non-saturating generative adversarial network (GAN) training is widely used and has continued to obtain groundbreaking results. However so far this approach has lacked strong theoretical justification, in contrast to alternatives such as f-GANs and Wasserstein GANs which are motivated in terms of approximate divergence minimization. In this paper we show that non-saturating GAN training does in fact approximately minimize a particular f-divergence. We develop general theoretical tools to compare and classify f-divergences and use these to show that the new f-divergence is qualitatively similar to reverse KL. These results help to explain the high sample quality but poor diversity often observed empirically when using this scheme. △ Less

Submitted 15 October, 2020; originally announced October 2020.

arXiv:2009.10868 [pdf, other]

A Real-Time Predictive Pedestrian Collision Warning Service for Cooperative Intelligent Transportation Systems Using 3D Pose Estimation

Authors: Ue-Hwan Kim, Dongho Ka, Hwasoo Yeo, Jong-Hwan Kim

Abstract: Minimizing traffic accidents between vehicles and pedestrians is one of the primary research goals in intelligent transportation systems. To achieve the goal, pedestrian orientation recognition and prediction of pedestrian's crossing or not-crossing intention play a central role. Contemporary approaches do not guarantee satisfactory performance due to limited field-of-view, lack of generalization,… ▽ More Minimizing traffic accidents between vehicles and pedestrians is one of the primary research goals in intelligent transportation systems. To achieve the goal, pedestrian orientation recognition and prediction of pedestrian's crossing or not-crossing intention play a central role. Contemporary approaches do not guarantee satisfactory performance due to limited field-of-view, lack of generalization, and high computational complexity. To overcome these limitations, we propose a real-time predictive pedestrian collision warning service (P2CWS) for two tasks: pedestrian orientation recognition (100.53 FPS) and intention prediction (35.76 FPS). Our framework obtains satisfying generalization over multiple sites because of the proposed site-independent features. At the center of the feature extraction lies 3D pose estimation. The 3D pose analysis enables robust and accurate recognition of pedestrian orientations and prediction of intentions over multiple sites. The proposed vision framework realizes 89.3% accuracy in the behavior recognition task on the TUD dataset without any training process and 91.28% accuracy in intention prediction on our dataset achieving new state-of-the-art performance. To contribute to the corresponding research community, we make our source codes public which are available at https://github.com/Uehwan/VisionForPedestrian △ Less

Submitted 21 February, 2022; v1 submitted 22 September, 2020; originally announced September 2020.

Comments: 12 pages, 8 figures, 4 tables

arXiv:2007.04495 [pdf, other]

Hack.VR: A Programming Game in Virtual Reality

Authors: Dominic Kao, Christos Mousas, Alejandra J. Magana, D. Fox Harrell, Rabindra Ratan, Edward F. Melcer, Brett Sherrick, Paul Parsons, Dmitri A. Gusev

Abstract: In this article we describe Hack.VR, an object-oriented programming game in virtual reality. Hack.VR uses a VR programming language in which nodes represent functions and node connections represent data flow. Using this programming framework, players reprogram VR objects such as elevators, robots, and switches. Hack.VR has been designed to be highly interactable both physically and semantically. In this article we describe Hack.VR, an object-oriented programming game in virtual reality. Hack.VR uses a VR programming language in which nodes represent functions and node connections represent data flow. Using this programming framework, players reprogram VR objects such as elevators, robots, and switches. Hack.VR has been designed to be highly interactable both physically and semantically. △ Less

Submitted 23 November, 2020; v1 submitted 8 July, 2020; originally announced July 2020.

arXiv:2006.03519 [pdf, other]

Exploring Help Facilities in Game-Making Software

Authors: Dominic Kao

Abstract: Help facilities have been crucial in hel** users learn about software for decades. But despite widespread prevalence of game engines and game editors that ship with many of today's most popular games, there is a lack of empirical evidence on how help facilities impact game-making. For instance, certain types of help facilities may help users more than others. To better understand help facilities… ▽ More Help facilities have been crucial in hel** users learn about software for decades. But despite widespread prevalence of game engines and game editors that ship with many of today's most popular games, there is a lack of empirical evidence on how help facilities impact game-making. For instance, certain types of help facilities may help users more than others. To better understand help facilities, we created game-making software that allowed us to systematically vary the type of help available. We then ran a study of 1646 participants that compared six help facility conditions: 1) Text Help, 2) Interactive Help, 3) Intelligent Agent Help, 4) Video Help, 5) All Help, and 6) No Help. Each participant created their own first-person shooter game level using our game-making software with a randomly assigned help facility condition. Results indicate that Interactive Help has a greater positive impact on time spent, controls learnability, learning motivation, total editor activity, and game level quality. Video Help is a close second across these same measures. △ Less

Submitted 5 June, 2020; originally announced June 2020.

arXiv:2001.05149 [pdf, other]

doi 10.1145/3313831.3376807

CheXplain: Enabling Physicians to Explore and UnderstandData-Driven, AI-Enabled Medical Imaging Analysis

Authors: Yao Xie, Melody Chen, David Kao, Ge Gao, Xiang 'Anthony' Chen

Abstract: The recent development of data-driven AI promises to automate medical diagnosis; however, most AI functions as 'black boxes' to physicians with limited computational knowledge. Using medical imaging as a point of departure, we conducted three iterations of design activities to formulate CheXplain---a system that enables physicians to explore and understand AI-enabled chest X-ray analysis: (1) a pa… ▽ More The recent development of data-driven AI promises to automate medical diagnosis; however, most AI functions as 'black boxes' to physicians with limited computational knowledge. Using medical imaging as a point of departure, we conducted three iterations of design activities to formulate CheXplain---a system that enables physicians to explore and understand AI-enabled chest X-ray analysis: (1) a paired survey between referring physicians and radiologists reveals whether, when, and what kinds of explanations are needed; (2) a low-fidelity prototype co-designed with three physicians formulates eight key features; and (3) a high-fidelity prototype evaluated by another six physicians provides detailed summative insights on how each feature enables the exploration and understanding of AI. We summarize by discussing recommendations for future work to design and implement explainable medical AI systems that encompass four recurring themes: motivation, constraint, explanation, and justification. △ Less

Submitted 19 January, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

Comments: 10 pages, 5 figures

ACM Class: H.5.m

arXiv:1910.10288 [pdf, other]

Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

Authors: Eric Battenberg, RJ Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao, Matt Shannon, Tom Bagby

Abstract: Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attentio… ▽ More Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attention mechanisms: location-relative GMM-based mechanisms and additive energy-based mechanisms. We suggest simple modifications to GMM-based attention that allow it to align quickly and consistently during training, and introduce a new location-relative attention mechanism to the additive energy-based family, called Dynamic Convolution Attention (DCA). We compare the various mechanisms in terms of alignment speed and consistency during training, naturalness, and ability to generalize to long utterances, and conclude that GMM attention and DCA can generalize to very long utterances, while preserving naturalness for shorter, in-domain utterances. △ Less

Submitted 22 April, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

Comments: Accepted to ICASSP 2020

arXiv:1910.01709 [pdf, other]

Semi-Supervised Generative Modeling for Controllable Speech Synthesis

Authors: Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby

Abstract: We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised TTS models. We demonstrate that our model… ▽ More We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models. By providing partial supervision to some of the latent variables, we are able to force them to take on consistent and interpretable purposes, which previously hasn't been possible with purely unsupervised TTS models. We demonstrate that our model is able to reliably discover and control important but rarely labelled attributes of speech, such as affect and speaking rate, with as little as 1% (30 minutes) supervision. Even at such low supervision levels we do not observe a degradation of synthesis quality compared to a state-of-the-art baseline. Audio samples are available on the web. △ Less

Submitted 3 October, 2019; originally announced October 2019.

arXiv:1906.03402 [pdf, other]

Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Authors: Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, Tom Bagby

Abstract: Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of an… ▽ More Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity (the amount of information the embedding contains about the data) as a unified method of analyzing the behavior of latent variable models of speech, comparing existing heuristic (non-variational) methods to variational methods that are able to explicitly constrain capacity using an upper bound on representational mutual information. In our proposed model (Capacitron), we show that by adding conditional dependencies to the variational posterior such that it matches the form of the true posterior, the same model can be used for high-precision prosody transfer, text-agnostic style transfer, and generation of natural-sounding prior samples. For multi-speaker models, Capacitron is able to preserve target speaker identity during inter-speaker prosody transfer and when drawing samples from the latent prior. Lastly, we introduce a method for decomposing embedding capacity hierarchically across two sets of latents, allowing a portion of the latent variability to be specified and the remaining variability sampled from a learned prior. Audio examples are available on the web. △ Less

Submitted 25 October, 2019; v1 submitted 8 June, 2019; originally announced June 2019.

Comments: Submitted to ICLR 2020

arXiv:1504.06018 [pdf, other]

Blind Index Coding

Authors: David T. H. Kao, Mohammad Ali Maddah-Ali, A. Salman Avestimehr

Abstract: We introduce the blind index coding (BIC) problem, in which a single sender communicates distinct messages to multiple users over a shared channel. Each user has partial knowledge of each message as side information. However, unlike classic index coding, in BIC, the sender is uncertain of what side information is available to each user. In particular, the sender only knows the amount of bits in ea… ▽ More We introduce the blind index coding (BIC) problem, in which a single sender communicates distinct messages to multiple users over a shared channel. Each user has partial knowledge of each message as side information. However, unlike classic index coding, in BIC, the sender is uncertain of what side information is available to each user. In particular, the sender only knows the amount of bits in each user's side information but not its content. This problem can arise naturally in caching and wireless networks. In order to blindly exploit side information in the BIC problem, we develop a hybrid coding scheme that XORs uncoded bits of a subset of messages with random combinations of bits from other messages. This scheme allows us to strike the right balance between maximizing the transmission rate to each user and minimizing the interference leakage to others. We also develop a general outer bound, which relies on a strong data processing inequality to effectively capture the senders uncertainty about the users' side information. Additionally, we consider the case where communication takes place over a shared wireless medium, modeled by an erasure broadcast channel, and show that surprisingly, combining repetition coding with hybrid coding improves the achievable rate region and outperforms alternative strategies of co** with channel erasure and while blindly exploiting side information. △ Less

Submitted 1 September, 2015; v1 submitted 22 April, 2015; originally announced April 2015.

Comments: Parts of this paper were presented at ISIT 2015 and ICC 2015

arXiv:1504.04797 [pdf, other]

Rover-to-Orbiter Communication in Mars: Taking Advantage of the Varying Topology

Authors: Songze Li, David T. H. Kao, A. Salman Avestimehr

Abstract: In this paper, we study the communication problem from rovers on Mars' surface to Mars-orbiting satellites. We first justify that, to a good extent, the rover-to-orbiter communication problem can be modelled as communication over a $2 \times 2$ X-channel with the network topology varying over time. For such a fading X-channel where transmitters are only aware of the time-varying topology but not t… ▽ More In this paper, we study the communication problem from rovers on Mars' surface to Mars-orbiting satellites. We first justify that, to a good extent, the rover-to-orbiter communication problem can be modelled as communication over a $2 \times 2$ X-channel with the network topology varying over time. For such a fading X-channel where transmitters are only aware of the time-varying topology but not the time-varying channel state (i.e., no CSIT), we propose coding strategies that code across topologies, and develop upper bounds on the sum degrees-of-freedom (DoF) that is shown to be tight under certain pattern of the topology variation. Furthermore we demonstrate that the proposed scheme approximately achieves the ergodic sum-capacity of the network. Using the proposed coding scheme, we numerically evaluate the ergodic rate gain over a time-division-multiple-access (TDMA) scheme for Rayleigh and Rice fading channels. We also numerically demonstrate that with practical orbital parameters, a 9.6% DoF gain, as well as more than 11.6% throughput gain can be achieved for a rover-to-orbiter communication network. △ Less

Submitted 10 December, 2015; v1 submitted 19 April, 2015; originally announced April 2015.

Comments: 13 pages, 6 figures. Accepted by IEEE Transactions on Communications

arXiv:1405.1091 [pdf, other]

Linear Degrees of Freedom of the MIMO X-Channel with Delayed CSIT

Authors: David T. H. Kao, A. Salman Avestimehr

Abstract: We study the degrees of freedom (DoF) of the multiple-input multiple-output X-channel (MIMO XC) with delayed channel state information at the transmitters (delayed CSIT), assuming linear coding strategies at the transmitters. We present two results: 1) the linear sum DoF for MIMO XC with general antenna configurations, and 2) the linear DoF region for MIMO XC with symmetric antennas. The converse… ▽ More We study the degrees of freedom (DoF) of the multiple-input multiple-output X-channel (MIMO XC) with delayed channel state information at the transmitters (delayed CSIT), assuming linear coding strategies at the transmitters. We present two results: 1) the linear sum DoF for MIMO XC with general antenna configurations, and 2) the linear DoF region for MIMO XC with symmetric antennas. The converse for each result is based on develo** a novel rank-ratio inequality that characterizes the maximum ratio between the dimensions of received linear subspaces at the two multiple-antenna receivers. The achievability of the linear sum DoF is based on a three-phase strategy, in which during the first two phases only the transmitter with fewer antennas exploits delayed CSIT in order to minimize the dimension of its signal at the unintended receiver. During Phase 3, both transmitters use delayed CSIT to send linear combinations of past transmissions such that each receiver receives a superposition of desired message data and known interference, thus simultaneously serving both receivers. We also derive other linear DoF outer bounds for the MIMO XC that, in addition to the outer bounds from the sum DoF converse and the proposed transmission strategy, allow us to characterize the linear DoF region for symmetric antenna configurations. △ Less

Submitted 5 May, 2014; originally announced May 2014.

Comments: to be presented in part at ISIT 2014

arXiv:1305.3934 [pdf, other]

An Upper Bound on the Capacity of Vector Dirty Paper with Unknown Spin and Stretch

Authors: David T. H. Kao, Ashutosh Sabharwal

Abstract: Dirty paper codes are a powerful tool for combating known interference. However, there is a significant difference between knowing the transmitted interference sequence and knowing the received interference sequence, especially when the channel modifying the interference is uncertain. We present an upper bound on the capacity of a compound vector dirty paper channel where although an additive Gaus… ▽ More Dirty paper codes are a powerful tool for combating known interference. However, there is a significant difference between knowing the transmitted interference sequence and knowing the received interference sequence, especially when the channel modifying the interference is uncertain. We present an upper bound on the capacity of a compound vector dirty paper channel where although an additive Gaussian sequence is known to the transmitter, the channel matrix between the interferer and receiver is uncertain but known to lie within a bounded set. Our bound is tighter than previous bounds in the low-SIR regime for the scalar version of the compound dirty paper channel and employs a construction that focuses on the relationship between the dimension of the message-bearing signal and the dimension of the additive state sequence. Additionally, a bound on the high-SNR behavior of the system is established. △ Less

Submitted 16 May, 2013; originally announced May 2013.

Comments: to be presented at ISIT 2013

arXiv:1110.0886 [pdf, other]

Two-User Interference Channels with Local Views: On Capacity Regions of TDM-Dominating Policies

Authors: David T. -H. Kao, Ashutosh Sabharwal

Abstract: We study the capacity regions of two-user interference channels where transmitters base their transmission schemes on local views of the channel state. Under the local view model, each transmitter knows only a subset of the four channel gains, which may be mismatched from the other transmitter. We consider a set of seven local views, and find that for five out of the seven local views, TDM is su… ▽ More We study the capacity regions of two-user interference channels where transmitters base their transmission schemes on local views of the channel state. Under the local view model, each transmitter knows only a subset of the four channel gains, which may be mismatched from the other transmitter. We consider a set of seven local views, and find that for five out of the seven local views, TDM is sufficient to achieve the qualified notion of capacity region for the linear deterministic interference channel which approximates the Gaussian interference channel. For these five local views, the qualified capacity result implies that no policy can achieve a rate point outside the TDM region without inducing a corner case of sub-TDM performance in another channel state. The common trait shared by the two remaining local views - those with the potential to outperform TDM - is transmitter knowledge of the outgoing interference link accompanied by some common knowledge of state, emphasizing their importance in creating opportunities to coordinate usage of more advanced schemes. Our conclusions are extended to bounded gap characterizations of the capacity region for the Gaussian interference channel. △ Less

Submitted 29 July, 2012; v1 submitted 4 October, 2011; originally announced October 2011.

Comments: revised 22 Jun, including updated title

arXiv:0906.0557 [pdf, ps, other]

An Axiomatic Theory of Fairness in Network Resource Allocation

Authors: Tian Lan, David Kao, Mung Chiang, Ashutosh Sabharwal

Abstract: We present a set of five axioms for fairness measures in resource allocation. A family of fairness measures satisfying the axioms is constructed. Well-known notions such as alpha-fairness, Jain's index, and entropy are shown to be special cases. Properties of fairness measures satisfying the axioms are proven, including Schur-concavity. Among the engineering implications is a generalized Jain's… ▽ More We present a set of five axioms for fairness measures in resource allocation. A family of fairness measures satisfying the axioms is constructed. Well-known notions such as alpha-fairness, Jain's index, and entropy are shown to be special cases. Properties of fairness measures satisfying the axioms are proven, including Schur-concavity. Among the engineering implications is a generalized Jain's index that tunes the resolution of the fairness measure, a new understanding of alpha-fair utility functions, and an interpretation of "larger alpha is more fair". We also construct an alternative set of four axioms to capture efficiency objectives and feasibility constraints. △ Less

Submitted 7 October, 2009; v1 submitted 2 June, 2009; originally announced June 2009.

Showing 1–20 of 20 results for author: Ka, D