-
Item-Language Model for Conversational Recommendation
Authors:
Li Yang,
Anushya Subbiah,
Hardik Patel,
Judith Yue Li,
Yanwei Song,
Reza Mirghaderi,
Vikram Aggarwal
Abstract:
Large-language Models (LLMs) have been extremely successful at tasks like complex dialogue understanding, reasoning and coding due to their emergent abilities. These emergent abilities have been extended with multi-modality to include image, audio, and video capabilities. Recommender systems, on the other hand, have been critical for information seeking and item discovery needs. Recently, there ha…
▽ More
Large-language Models (LLMs) have been extremely successful at tasks like complex dialogue understanding, reasoning and coding due to their emergent abilities. These emergent abilities have been extended with multi-modality to include image, audio, and video capabilities. Recommender systems, on the other hand, have been critical for information seeking and item discovery needs. Recently, there have been attempts to apply LLMs for recommendations. One difficulty of current attempts is that the underlying LLM is usually not trained on the recommender system data, which largely contains user interaction signals and is often not publicly available. Another difficulty is user interaction signals often have a different pattern from natural language text, and it is currently unclear if the LLM training setup can learn more non-trivial knowledge from interaction signals compared with traditional recommender system methods. Finally, it is difficult to train multiple LLMs for different use-cases, and to retain the original language and reasoning abilities when learning from recommender system data. To address these three limitations, we propose an Item-Language Model (ILM), which is composed of an item encoder to produce text-aligned item representations that encode user interaction signals, and a frozen LLM that can understand those item representations with preserved pretrained knowledge. We conduct extensive experiments which demonstrate both the importance of the language-alignment and of user interaction knowledge in the item encoder.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond
Authors:
Yang Zhao,
Tingbo Hou,
Yu-Chuan Su,
Xuhui Jia. Yandong Li,
Matthias Grundmann
Abstract:
An authentic face restoration system is becoming increasingly demanding in many computer vision applications, e.g., image enhancement, video communication, and taking portrait. Most of the advanced face restoration models can recover high-quality faces from low-quality ones but usually fail to faithfully generate realistic and high-frequency details that are favored by users. To achieve authentic…
▽ More
An authentic face restoration system is becoming increasingly demanding in many computer vision applications, e.g., image enhancement, video communication, and taking portrait. Most of the advanced face restoration models can recover high-quality faces from low-quality ones but usually fail to faithfully generate realistic and high-frequency details that are favored by users. To achieve authentic restoration, we propose $\textbf{IDM}$, an $\textbf{I}$teratively learned face restoration system based on denoising $\textbf{D}$iffusion $\textbf{M}$odels (DDMs). We define the criterion of an authentic face restoration system, and argue that denoising diffusion models are naturally endowed with this property from two aspects: intrinsic iterative refinement and extrinsic iterative enhancement. Intrinsic learning can preserve the content well and gradually refine the high-quality details, while extrinsic enhancement helps clean the data and improve the restoration task one step further. We demonstrate superior performance on blind face restoration tasks. Beyond restoration, we find the authentically cleaned data by the proposed restoration system is also helpful to image generation tasks in terms of training stabilization and sample quality. Without modifying the models, we achieve better quality than state-of-the-art on FFHQ and ImageNet generation using either GANs or diffusion models.
△ Less
Submitted 18 July, 2023;
originally announced July 2023.
-
V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
Authors:
Kun Su,
Judith Yue Li,
Qingqing Huang,
Dima Kuzmin,
Joonseok Lee,
Chris Donahue,
Fei Sha,
Aren Jansen,
Yu Wang,
Mauro Verzetti,
Timo I. Denk
Abstract:
Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally alig…
▽ More
Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.
△ Less
Submitted 22 February, 2024; v1 submitted 11 May, 2023;
originally announced May 2023.
-
Multi-Task End-to-End Training Improves Conversational Recommendation
Authors:
Naveen Ram,
Dima Kuzmin,
Ellie Ka In Chio,
Moustafa Farid Alzantot,
Santiago Ontanon,
Ambarish Jash,
Judith Yue Li
Abstract:
In this paper, we analyze the performance of a multitask end-to-end transformer model on the task of conversational recommendations, which aim to provide recommendations based on a user's explicit preferences expressed in dialogue. While previous works in this area adopt complex multi-component approaches where the dialogue management and entity recommendation tasks are handled by separate compone…
▽ More
In this paper, we analyze the performance of a multitask end-to-end transformer model on the task of conversational recommendations, which aim to provide recommendations based on a user's explicit preferences expressed in dialogue. While previous works in this area adopt complex multi-component approaches where the dialogue management and entity recommendation tasks are handled by separate components, we show that a unified transformer model, based on the T5 text-to-text transformer model, can perform competitively in both recommending relevant items and generating conversation dialogue. We fine-tune our model on the ReDIAL conversational movie recommendation dataset, and create additional training tasks derived from MovieLens (such as the prediction of movie attributes and related movies based on an input movie), in a multitask learning setting. Using a series of probe studies, we demonstrate that the learned knowledge in the additional tasks is transferred to the conversational setting, where each task leads to a 9%-52% increase in its related probe score.
△ Less
Submitted 8 May, 2023;
originally announced May 2023.
-
MAQA: A Multimodal QA Benchmark for Negation
Authors:
Judith Yue Li,
Aren Jansen,
Qingqing Huang,
Joonseok Lee,
Ravi Ganti,
Dima Kuzmin
Abstract:
Multimodal learning can benefit from the representation power of pretrained Large Language Models (LLMs). However, state-of-the-art transformer based LLMs often ignore negations in natural language and there is no existing benchmark to quantitatively evaluate whether multimodal transformers inherit this weakness. In this study, we present a new multimodal question answering (QA) benchmark adapted…
▽ More
Multimodal learning can benefit from the representation power of pretrained Large Language Models (LLMs). However, state-of-the-art transformer based LLMs often ignore negations in natural language and there is no existing benchmark to quantitatively evaluate whether multimodal transformers inherit this weakness. In this study, we present a new multimodal question answering (QA) benchmark adapted from labeled music videos in AudioSet (Gemmeke et al., 2017) with the goal of systematically evaluating if multimodal transformers can perform complex reasoning to recognize new concepts as negation of previously learned concepts. We show that with standard fine-tuning approach multimodal transformers are still incapable of correctly interpreting negation irrespective of model size. However, our experiments demonstrate that augmenting the original training task distributions with negated QA examples allow the model to reliably reason with negation. To do this, we describe a novel data generation procedure that prompts the 540B-parameter PaLM model to automatically generate negated QA examples as compositions of easily accessible video tags. The generated examples contain more natural linguistic patterns and the gains compared to template-based task augmentation approach are significant.
△ Less
Submitted 9 January, 2023;
originally announced January 2023.
-
On Generalization and Regularization via Wasserstein Distributionally Robust Optimization
Authors:
Qinyu Wu,
Jonathan Yu-Meng Li,
Tiantian Mao
Abstract:
Wasserstein distributionally robust optimization (DRO) has found success in operations research and machine learning applications as a powerful means to obtain solutions with favourable out-of-sample performances. Two compelling explanations for the success are the generalization bounds derived from Wasserstein DRO and the equivalency between Wasserstein DRO and the regularization scheme commonly…
▽ More
Wasserstein distributionally robust optimization (DRO) has found success in operations research and machine learning applications as a powerful means to obtain solutions with favourable out-of-sample performances. Two compelling explanations for the success are the generalization bounds derived from Wasserstein DRO and the equivalency between Wasserstein DRO and the regularization scheme commonly applied in machine learning. Existing results on generalization bounds and the equivalency to regularization are largely limited to the setting where the Wasserstein ball is of a certain type and the decision criterion takes certain forms of an expected function. In this paper, we show that by focusing on Wasserstein DRO problems with affine decision rules, it is possible to obtain generalization bounds and the equivalency to regularization in a significantly broader setting where the Wasserstein ball can be of a general type and the decision criterion can be a general measure of risk, i.e., nonlinear in distributions. This allows for accommodating many important classification, regression, and risk minimization applications that have not been addressed to date using Wasserstein DRO. Our results are strong in that the generalization bounds do not suffer from the curse of dimensionality and the equivalency to regularization is exact. As a byproduct, our regularization results broaden considerably the class of Wasserstein DRO models that can be solved efficiently via regularization formulations.
△ Less
Submitted 12 December, 2022;
originally announced December 2022.
-
MuLan: A Joint Embedding of Music Audio and Natural Language
Authors:
Qingqing Huang,
Aren Jansen,
Joonseok Lee,
Ravi Ganti,
Judith Yue Li,
Daniel P. W. Ellis
Abstract:
Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedd…
▽ More
Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications.
△ Less
Submitted 25 August, 2022;
originally announced August 2022.
-
WaveCorr: Correlation-savvy Deep Reinforcement Learning for Portfolio Management
Authors:
Saeed Marzban,
Erick Delage,
Jonathan Yumeng Li,
Jeremie Desgagne-Bouchard,
Carl Dussault
Abstract:
The problem of portfolio management represents an important and challenging class of dynamic decision making problems, where rebalancing decisions need to be made over time with the consideration of many factors such as investors preferences, trading environments, and market conditions. In this paper, we present a new portfolio policy network architecture for deep reinforcement learning (DRL)that…
▽ More
The problem of portfolio management represents an important and challenging class of dynamic decision making problems, where rebalancing decisions need to be made over time with the consideration of many factors such as investors preferences, trading environments, and market conditions. In this paper, we present a new portfolio policy network architecture for deep reinforcement learning (DRL)that can exploit more effectively cross-asset dependency information and achieve better performance than state-of-the-art architectures. In particular, we introduce a new property, referred to as \textit{asset permutation invariance}, for portfolio policy networks that exploit multi-asset time series data, and design the first portfolio policy network, named WaveCorr, that preserves this invariance property when treating asset correlation information. At the core of our design is an innovative permutation invariant correlation processing layer. An extensive set of experiments are conducted using data from both Canadian (TSX) and American stock markets (S&P 500), and WaveCorr consistently outperforms other architectures with an impressive 3%-25% absolute improvement in terms of average annual return, and up to more than 200% relative improvement in average Sharpe ratio. We also measured an improvement of a factor of up to 5 in the stability of performance under random choices of initial asset ordering and weights. The stability of the network has been found as particularly valuable by our industrial partner.
△ Less
Submitted 28 September, 2021; v1 submitted 14 September, 2021;
originally announced September 2021.
-
Deep Reinforcement Learning for Equal Risk Pricing and Hedging under Dynamic Expectile Risk Measures
Authors:
Saeed Marzban,
Erick Delage,
Jonathan Yumeng Li
Abstract:
Recently equal risk pricing, a framework for fair derivative pricing, was extended to consider dynamic risk measures. However, all current implementations either employ a static risk measure that violates time consistency, or are based on traditional dynamic programming solution schemes that are impracticable in problems with a large number of underlying assets (due to the curse of dimensionality)…
▽ More
Recently equal risk pricing, a framework for fair derivative pricing, was extended to consider dynamic risk measures. However, all current implementations either employ a static risk measure that violates time consistency, or are based on traditional dynamic programming solution schemes that are impracticable in problems with a large number of underlying assets (due to the curse of dimensionality) or with incomplete asset dynamics information. In this paper, we extend for the first time a famous off-policy deterministic actor-critic deep reinforcement learning (ACRL) algorithm to the problem of solving a risk averse Markov decision process that models risk using a time consistent recursive expectile risk measure. This new ACRL algorithm allows us to identify high quality time consistent hedging policies (and equal risk prices) for options, such as basket options, that cannot be handled using traditional methods, or in context where only historical trajectories of the underlying assets are available. Our numerical experiments, which involve both a simple vanilla option and a more exotic basket option, confirm that the new ACRL algorithm can produce 1) in simple environments, nearly optimal hedging policies, and highly accurate prices, simultaneously for a range of maturities 2) in complex environments, good quality policies and prices using reasonable amount of computing resources; and 3) overall, hedging strategies that actually outperform the strategies produced using static risk measures when the risk is evaluated at later points of time.
△ Less
Submitted 8 September, 2021;
originally announced September 2021.
-
Accelerating Training using Tensor Decomposition
Authors:
Mostafa Elhoushi,
Ye Henry Tian,
Zihao Chen,
Farhan Shafiq,
Joey Yiwei Li
Abstract:
Tensor decomposition is one of the well-known approaches to reduce the latency time and number of parameters of a pre-trained model. However, in this paper, we propose an approach to use tensor decomposition to reduce training time of training a model from scratch. In our approach, we train the model from scratch (i.e., randomly initialized weights) with its original architecture for a small numbe…
▽ More
Tensor decomposition is one of the well-known approaches to reduce the latency time and number of parameters of a pre-trained model. However, in this paper, we propose an approach to use tensor decomposition to reduce training time of training a model from scratch. In our approach, we train the model from scratch (i.e., randomly initialized weights) with its original architecture for a small number of epochs, then the model is decomposed, and then continue training the decomposed model till the end. There is an optional step in our approach to convert the decomposed architecture back to the original architecture. We present results of using this approach on both CIFAR10 and Imagenet datasets, and show that there can be upto 2x speed up in training time with accuracy drop of upto 1.5% only, and in other cases no accuracy drop. This training acceleration approach is independent of hardware and is expected to have similar speed ups on both CPU and GPU platforms.
△ Less
Submitted 10 September, 2019;
originally announced September 2019.
-
DeepShift: Towards Multiplication-Less Neural Networks
Authors:
Mostafa Elhoushi,
Zihao Chen,
Farhan Shafiq,
Ye Henry Tian,
Joey Yiwei Li
Abstract:
The high computation, memory, and power budgets of inferring convolutional neural networks (CNNs) are major bottlenecks of model deployment to edge computing platforms, e.g., mobile devices and IoT. Moreover, training CNNs is time and energy-intensive even on high-grade servers. Convolution layers and fully connected layers, because of their intense use of multiplications, are the dominant contrib…
▽ More
The high computation, memory, and power budgets of inferring convolutional neural networks (CNNs) are major bottlenecks of model deployment to edge computing platforms, e.g., mobile devices and IoT. Moreover, training CNNs is time and energy-intensive even on high-grade servers. Convolution layers and fully connected layers, because of their intense use of multiplications, are the dominant contributor to this computation budget.
We propose to alleviate this problem by introducing two new operations: convolutional shifts and fully-connected shifts which replace multiplications with bitwise shift and sign flip** during both training and inference. During inference, both approaches require only 5 bits (or less) to represent the weights. This family of neural network architectures (that use convolutional shifts and fully connected shifts) is referred to as DeepShift models. We propose two methods to train DeepShift models: DeepShift-Q which trains regular weights constrained to powers of 2, and DeepShift-PS that trains the values of the shifts and sign flips directly.
Very close accuracy, and in some cases higher accuracy, to baselines are achieved. Converting pre-trained 32-bit floating-point baseline models of ResNet18, ResNet50, VGG16, and GoogleNet to DeepShift and training them for 15 to 30 epochs, resulted in Top-1/Top-5 accuracies higher than that of the original model.
Last but not least, we implemented the convolutional shifts and fully connected shift GPU kernels and showed a reduction in latency time of 25% when inferring ResNet18 compared to unoptimized multiplication-based GPU kernels. The code can be found at https://github.com/mostafaelhoushi/DeepShift.
△ Less
Submitted 7 July, 2021; v1 submitted 30 May, 2019;
originally announced May 2019.
-
Semi-orthogonal Non-negative Matrix Factorization with an Application in Text Mining
Authors:
Jack Yutong Li,
Ruoqing Zhu,
Annie Qu,
Han Ye,
Zhankun Sun
Abstract:
Emergency Department (ED) crowding is a worldwide issue that affects the efficiency of hospital management and the quality of patient care. This occurs when the request for an admit ward-bed to receive a patient is delayed until an admission decision is made by a doctor. To reduce the overcrowding and waiting time of ED, we build a classifier to predict the disposition of patients using manually-t…
▽ More
Emergency Department (ED) crowding is a worldwide issue that affects the efficiency of hospital management and the quality of patient care. This occurs when the request for an admit ward-bed to receive a patient is delayed until an admission decision is made by a doctor. To reduce the overcrowding and waiting time of ED, we build a classifier to predict the disposition of patients using manually-typed nurse notes collected during triage, thereby allowing hospital staff to begin necessary preparation beforehand. However, these triage notes involve high dimensional, noisy, and also sparse text data which makes model fitting and interpretation difficult. To address this issue, we propose the semi-orthogonal non-negative matrix factorization (SONMF) for both continuous and binary design matrices to first bi-cluster the patients and words into a reduced number of topics. The subjects can then be interpreted as a non-subtractive linear combination of orthogonal basis topic vectors. These generated topic vectors provide the hospital with a direct understanding of the cause of admission. We show that by using a transformation of basis, the classification accuracy can be further increased compared to the conventional bag-of-words model and alternative matrix factorization approaches. Through simulated data experiments, we also demonstrate that the proposed method outperforms other non-negative matrix factorization (NMF) methods in terms of factorization accuracy, rate of convergence, and degree of orthogonality.
△ Less
Submitted 4 July, 2019; v1 submitted 6 May, 2018;
originally announced May 2018.