Skip to main content

Showing 1–14 of 14 results for author: Akbari, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2312.14125  [pdf, other

    cs.CV cs.AI

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Authors: Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam , et al. (6 additional authors not shown)

    Abstract: We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and tas… ▽ More

    Submitted 4 June, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: To appear at ICML 2024; Project page: http://sites.research.google/videopoet/

  2. arXiv:2305.06324  [pdf, other

    cs.CV cs.AI cs.LG cs.MM eess.IV

    Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

    Authors: Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang, Hartwig Adam

    Abstract: We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient… ▽ More

    Submitted 11 December, 2023; v1 submitted 10 May, 2023; originally announced May 2023.

  3. arXiv:2211.02077  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization

    Authors: Junru Wu, Yi Liang, Feng Han, Hassan Akbari, Zhangyang Wang, Cong Yu

    Abstract: Self-supervised pre-training recently demonstrates success on large-scale multimodal data, and state-of-the-art contrastive learning methods often enforce the feature consistency from cross-modality inputs, such as video/audio or video/text pairs. Despite its convenience to formulate and leverage in practice, such cross-modality alignment (CMA) is only a weak and noisy supervision, since two modal… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: Accepted at NeurIPS 2022

  4. arXiv:2209.06794  [pdf, other

    cs.CV cs.CL

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Authors: Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner , et al. (4 additional authors not shown)

    Abstract: Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaL… ▽ More

    Submitted 5 June, 2023; v1 submitted 14 September, 2022; originally announced September 2022.

    Comments: ICLR 2023 (Notable-top-5%)

  5. arXiv:2208.14445  [pdf

    q-bio.QM cs.CV eess.IV

    Artificial intelligence-based locoregional markers of brain peritumoral microenvironment

    Authors: Zahra Riahi Samani, Drew Parker, Hamed Akbari, Spyridon Bakas, Ronald L. Wolf, Steven Brem, Ragini Verma

    Abstract: In malignant primary brain tumors, cancer cells infiltrate into the peritumoral brain structures which results in inevitable recurrence. Quantitative assessment of infiltrative heterogeneity in the peritumoral region, the area where biopsy or resection can be hazardous, is important for clinical decision making. Previous work on characterizing the infiltrative heterogeneity in the peritumoral regi… ▽ More

    Submitted 29 August, 2022; originally announced August 2022.

  6. arXiv:2112.06979  [pdf, other

    eess.IV cs.CV

    The Brain Tumor Sequence Registration (BraTS-Reg) Challenge: Establishing Correspondence Between Pre-Operative and Follow-up MRI Scans of Diffuse Glioma Patients

    Authors: Bhakti Baheti, Satrajit Chakrabarty, Hamed Akbari, Michel Bilello, Benedikt Wiestler, Julian Schwarting, Evan Calabrese, Jeffrey Rudie, Syed Abidi, Mina Mousa, Javier Villanueva-Meyer, Brandon K. K. Fields, Florian Kofler, Russell Takeshi Shinohara, Juan Eugenio Iglesias, Tony C. W. Mok, Albert C. S. Chung, Marek Wodzinski, Artur Jurgas, Niccolo Marini, Manfredo Atzori, Henning Muller, Christoph Grobroehmer, Hanna Siebert, Lasse Hansen , et al. (48 additional authors not shown)

    Abstract: Registration of longitudinal brain MRI scans containing pathologies is challenging due to dramatic changes in tissue appearance. Although there has been progress in develo** general-purpose medical image registration techniques, they have not yet attained the requisite precision and reliability for this task, highlighting its inherent complexity. Here we describe the Brain Tumor Sequence Registr… ▽ More

    Submitted 17 April, 2024; v1 submitted 13 December, 2021; originally announced December 2021.

  7. arXiv:2104.11178  [pdf, other

    cs.CV cs.AI cs.LG cs.MM eess.IV

    VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

    Authors: Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong

    Abstract: We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and eval… ▽ More

    Submitted 6 December, 2021; v1 submitted 22 April, 2021; originally announced April 2021.

    Comments: Published in the 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

  8. arXiv:2104.08145  [pdf, other

    cs.CL cs.AI cs.LG

    KI-BERT: Infusing Knowledge Context for Better Language and Domain Understanding

    Authors: Keyur Faldu, Amit Sheth, Prashant Kikani, Hemang Akbari

    Abstract: Contextualized entity representations learned by state-of-the-art transformer-based language models (TLMs) like BERT, GPT, T5, etc., leverage the attention mechanism to learn the data context from training data corpus. However, these models do not use the knowledge context. Knowledge context can be understood as semantics about entities and their relationship with neighboring entities in knowledge… ▽ More

    Submitted 3 September, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

    Comments: 10 pages, 4 figures, 4 tables

  9. arXiv:2011.09530  [pdf, other

    cs.CV cs.AI eess.IV

    Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

    Authors: Hassan Akbari, Hamid Palangi, Jianwei Yang, Sudha Rao, Asli Celikyilmaz, Roland Fernandez, Paul Smolensky, Jianfeng Gao, Shih-Fu Chang

    Abstract: Neuro-symbolic representations have proved effective in learning structure information in vision and language. In this paper, we propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions. We refer to these relations as rel… ▽ More

    Submitted 18 November, 2020; originally announced November 2020.

  10. arXiv:2008.07628  [pdf, other

    eess.IV cs.CV

    A Deep Network for Joint Registration and Reconstruction of Images with Pathologies

    Authors: Xu Han, Zhengyang Shen, Zhenlin Xu, Spyridon Bakas, Hamed Akbari, Michel Bilello, Christos Davatzikos, Marc Niethammer

    Abstract: Registration of images with pathologies is challenging due to tissue appearance changes and missing correspondences caused by the pathologies. Moreover, mass effects as observed for brain tumors may displace tissue, creating larger deformations over time than what is observed in a healthy brain. Deep learning models have successfully been applied to image registration to offer dramatic speed up an… ▽ More

    Submitted 17 August, 2020; originally announced August 2020.

  11. arXiv:1811.11683  [pdf, other

    cs.CV cs.CL cs.LG eess.IV

    Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

    Authors: Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, Shih-Fu Chang

    Abstract: We address the problem of phrase grounding by lear ing a multi-level common semantic space shared by the textual and visual modalities. We exploit multiple levels of feature maps of a Deep Convolutional Neural Network, as well as contextualized word and sentence embeddings extracted from a character-based language model. Following dedicated non-linear map**s for visual features at each level, wo… ▽ More

    Submitted 29 May, 2019; v1 submitted 28 November, 2018; originally announced November 2018.

    Comments: Accepted in CVPR 2019

  12. arXiv:1710.09798  [pdf, other

    cs.CV eess.AS eess.IV

    Lip2AudSpec: Speech reconstruction from silent lip movements video

    Authors: Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani

    Abstract: In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of speech and its corresponding sound generation method resulting in a more natural sounding reconstructed speech. Our proposed network consists of an autoencoder to extract bottleneck features from the auditory spectrogram w… ▽ More

    Submitted 26 October, 2017; originally announced October 2017.

  13. arXiv:1412.7018  [pdf, other

    cs.DC

    Discrete Load Balancing in Heterogeneous Networks with a Focus on Second-Order Diffusion

    Authors: Hoda Akbari, Petra Berenbrink, Robert Elsässer, Dominik Kaaser

    Abstract: In this paper we consider a wide class of discrete diffusion load balancing algorithms. The problem is defined as follows. We are given an interconnection network and a number of load items, which are arbitrarily distributed among the nodes of the network. The goal is to redistribute the load in iterative discrete steps such that at the end each node has (almost) the same number of items. In diffu… ▽ More

    Submitted 22 December, 2014; originally announced December 2014.

    Comments: Full version of paper submitted to ICDCS 2015

  14. arXiv:1407.1395  [pdf

    cs.IT

    CB-REFIM: A Practical Coordinated Beamforming in Multicell Networks

    Authors: Mohammad Hossein Akbari, Vahid Tabataba Vakili

    Abstract: Performance of multicell systems is inevitably limited by interference and available resources. Although intercell interference can be mitigated by Base Station (BS) Coordination, the demand on inter-BS information exchange and computational complexity grows rapidly with the number of cells, subcarriers, and users. On the other hand, some of the existing coordination beamforming methods need compu… ▽ More

    Submitted 9 July, 2014; v1 submitted 5 July, 2014; originally announced July 2014.

    Comments: 20 pages, 8 figures, to appear in IET Communication