-
Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model
Authors:
Longrong Yang,
Dong Sheng,
Chaoxiang Cai,
Fan Yang,
Size Li,
Di Zhang,
Xi Li
Abstract:
The Mixture-of-Experts (MoE) has gained increasing attention in the study of Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLMs encourage different experts to handle different tokens, and thus they e…
▽ More
The Mixture-of-Experts (MoE) has gained increasing attention in the study of Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLMs encourage different experts to handle different tokens, and thus they employ a router to predict the routing for each token. However, the predictions are based solely on sample features and do not truly reveal the optimization direction of tokens. This can lead to severe optimization conflicts between different tokens within an expert. To address this problem, this paper proposes a novel method based on token-level gradient analysis. Specifically, we first use token-level gradients to identify conflicting tokens in experts. Then, we add a specialized loss tailored to eliminate conflicts among tokens within each expert. Our method can serve as a plug-in for diverse Large Vision-Language Models, and extensive experimental results demonstrate the effectiveness of our method. The code will be publicly available at https://github.com/longrongyang/STGC.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Can Foundation Models Reliably Identify Spatial Hazards? A Case Study on Curb Segmentation
Authors:
Diwei Sheng,
Giles Hamilton-Fletcher,
Mahya Beheshti,
Chen Feng,
John-Ross Rizzo
Abstract:
Curbs serve as vital borders that delineate safe pedestrian zones from potential vehicular traffic hazards. Curbs also represent a primary spatial hazard during dynamic navigation with significant stumbling potential. Such vulnerabilities are particularly exacerbated for persons with blindness and low vision (PBLV). Accurate visual-based discrimination of curbs is paramount for assistive technolog…
▽ More
Curbs serve as vital borders that delineate safe pedestrian zones from potential vehicular traffic hazards. Curbs also represent a primary spatial hazard during dynamic navigation with significant stumbling potential. Such vulnerabilities are particularly exacerbated for persons with blindness and low vision (PBLV). Accurate visual-based discrimination of curbs is paramount for assistive technologies that aid PBLV with safe navigation in urban environments. Herein, we investigate the efficacy of curb segmentation for foundation models. We introduce the largest curb segmentation dataset to-date to benchmark leading foundation models. Our results show that state-of-the-art foundation models face significant challenges in curb segmentation. This is due to their high false-positive rates (up to 95%) with poor performance distinguishing curbs from curb-like objects or non-curb areas, such as sidewalks. In addition, the best-performing model averaged a 3.70-second inference time, underscoring problems in providing real-time assistance. In response, we propose solutions including filtered bounding box selections to achieve more accurate curb segmentation. Overall, despite the immediate flexibility of foundation models, their application for practical assistive technology applications still requires refinement. This research highlights the critical need for specialized datasets and tailored model training to address navigation challenges for PBLV and underscores implicit weaknesses in foundation models.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
NYC-Indoor-VPR: A Long-Term Indoor Visual Place Recognition Dataset with Semi-Automatic Annotation
Authors:
Diwei Sheng,
Anbang Yang,
John-Ross Rizzo,
Chen Feng
Abstract:
Visual Place Recognition (VPR) in indoor environments is beneficial to humans and robots for better localization and navigation. It is challenging due to appearance changes at various frequencies, and difficulties of obtaining ground truth metric trajectories for training and evaluation. This paper introduces the NYC-Indoor-VPR dataset, a unique and rich collection of over 36,000 images compiled f…
▽ More
Visual Place Recognition (VPR) in indoor environments is beneficial to humans and robots for better localization and navigation. It is challenging due to appearance changes at various frequencies, and difficulties of obtaining ground truth metric trajectories for training and evaluation. This paper introduces the NYC-Indoor-VPR dataset, a unique and rich collection of over 36,000 images compiled from 13 distinct crowded scenes in New York City taken under varying lighting conditions with appearance changes. Each scene has multiple revisits across a year. To establish the ground truth for VPR, we propose a semiautomatic annotation approach that computes the positional information of each image. Our method specifically takes pairs of videos as input and yields matched pairs of images along with their estimated relative locations. The accuracy of this matching is refined by human annotators, who utilize our annotation software to correlate the selected keyframes. Finally, we present a benchmark evaluation of several state-of-the-art VPR algorithms using our annotated dataset, revealing its challenge and thus value for VPR research.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Towards More Unified In-context Visual Understanding
Authors:
Dianmo Sheng,
Dongdong Chen,
Zhentao Tan,
Qiankun Liu,
Qi Chu,
Jianmin Bao,
Tao Gong,
Bin Liu,
Shengwei Xu,
Nenghai Yu
Abstract:
The rapid advancement of large language models (LLMs) has accelerated the emergence of in-context learning (ICL) as a cutting-edge approach in the natural language processing domain. Recently, ICL has been employed in visual understanding tasks, such as semantic segmentation and image captioning, yielding promising results. However, existing visual ICL framework can not enable producing content ac…
▽ More
The rapid advancement of large language models (LLMs) has accelerated the emergence of in-context learning (ICL) as a cutting-edge approach in the natural language processing domain. Recently, ICL has been employed in visual understanding tasks, such as semantic segmentation and image captioning, yielding promising results. However, existing visual ICL framework can not enable producing content across multiple modalities, which limits their potential usage scenarios. To address this issue, we present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space, structured as interleaved in-context sequences. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them, facilitating in-context learning. Thanks to this design, the model is capable of handling in-context vision understanding tasks with multimodal output in a unified pipeline.Experimental results demonstrate that our model achieves competitive performance compared with specialized models and previous ICL baselines. Overall, our research takes a further step toward unified multimodal in-context learning.
△ Less
Submitted 16 March, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
DFlow: Efficient Dataflow-based Invocation Workflow Execution for Function-as-a-Service
Authors:
Xiaoxiang Shi,
Chao Li,
Zijun Li,
Zihan Liu,
Dianmo Sheng,
Quan Chen,
**gwen Leng,
Minyi Guo
Abstract:
The Serverless Computing is becoming increasingly popular due to its ease of use and fine-grained billing. These features make it appealing for stateful application or serverless workflow. However, current serverless workflow systems utilize a controlflow-based invocation pattern to invoke functions. In this execution pattern, the function invocation depends on the state of the function. A functio…
▽ More
The Serverless Computing is becoming increasingly popular due to its ease of use and fine-grained billing. These features make it appealing for stateful application or serverless workflow. However, current serverless workflow systems utilize a controlflow-based invocation pattern to invoke functions. In this execution pattern, the function invocation depends on the state of the function. A function can only begin executing once all its precursor functions have completed. As a result, this pattern may potentially lead to longer end-to-end execution time. We design and implement the DFlow, a novel dataflow-based serverless workflow system that achieves high performance for serverless workflow. DFlow introduces a distributed scheduler (DScheduler) by using the dataflow-based invocation pattern to invoke functions. In this pattern, the function invocation depends on the data dependency between functions. The function can start to execute even its precursor functions are still running. DFlow further features a distributed store (DStore) that utilizes effective fine-grained optimization techniques to eliminate function interaction, thereby enabling efficient data exchange. With the support of DScheduler and DStore, DFlow can achieving an average improvement of 60% over CFlow, 40% over FaaSFlow, 25% over FaasFlowRedis, and 40% over KNIX on 99%-ile latency respectively. Further, it can improve network bandwidth utilization by 2x-4x over CFlow and 1.5x-3x over FaaSFlow, FaaSFlowRedis and KNIX, respectively. DFlow effectively reduces the cold startup latency, achieving an average improvement of 5.6x over CFlow and 1.1x over FaaSFlow
△ Less
Submitted 4 July, 2023; v1 submitted 19 June, 2023;
originally announced June 2023.
-
X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion
Authors:
Hanqing Zhao,
Dianmo Sheng,
Jianmin Bao,
Dongdong Chen,
Dong Chen,
Fang Wen,
Lu Yuan,
Ce Liu,
Wenbo Zhou,
Qi Chu,
Weiming Zhang,
Nenghai Yu
Abstract:
Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous wor…
▽ More
Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed ``X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP, +6.5 mask AP on long-tail classes. Our code and models are available at https://github.com/yoctta/XPaste.
△ Less
Submitted 31 May, 2023; v1 submitted 7 December, 2022;
originally announced December 2022.
-
i-Razor: A Differentiable Neural Input Razor for Feature Selection and Dimension Search in DNN-Based Recommender Systems
Authors:
Yao Yao,
Bin Liu,
Haoxun He,
Dakui Sheng,
Ke Wang,
Li Xiao,
Huanhuan Cao
Abstract:
Input features play a crucial role in DNN-based recommender systems with thousands of categorical and continuous fields from users, items, contexts, and interactions. Noisy features and inappropriate embedding dimension assignments can deteriorate the performance of recommender systems and introduce unnecessary complexity in model training and online serving. Optimizing the input configuration of…
▽ More
Input features play a crucial role in DNN-based recommender systems with thousands of categorical and continuous fields from users, items, contexts, and interactions. Noisy features and inappropriate embedding dimension assignments can deteriorate the performance of recommender systems and introduce unnecessary complexity in model training and online serving. Optimizing the input configuration of DNN models, including feature selection and embedding dimension assignment, has become one of the essential topics in feature engineering. However, in existing industrial practices, feature selection and dimension search are optimized sequentially, i.e., feature selection is performed first, followed by dimension search to determine the optimal dimension size for each selected feature. Such a sequential optimization mechanism increases training costs and risks generating suboptimal input configurations. To address this problem, we propose a differentiable neural input razor (i-Razor) that enables joint optimization of feature selection and dimension search. Concretely, we introduce an end-to-end differentiable model to learn the relative importance of different embedding regions of each feature. Furthermore, a flexible pruning algorithm is proposed to achieve feature filtering and dimension derivation simultaneously. Extensive experiments on two large-scale public datasets in the Click-Through-Rate (CTR) prediction task demonstrate the efficacy and superiority of i-Razor in balancing model complexity and performance.
△ Less
Submitted 11 November, 2023; v1 submitted 1 April, 2022;
originally announced April 2022.
-
NYU-VPR: Long-Term Visual Place Recognition Benchmark with View Direction and Data Anonymization Influences
Authors:
Diwei Sheng,
Yuxiang Chai,
Xinru Li,
Chen Feng,
Jianzhe Lin,
Claudio Silva,
John-Ross Rizzo
Abstract:
Visual place recognition (VPR) is critical in not only localization and map** for autonomous driving vehicles, but also in assistive navigation for the visually impaired population. To enable a long-term VPR system on a large scale, several challenges need to be addressed. First, different applications could require different image view directions, such as front views for self-driving cars while…
▽ More
Visual place recognition (VPR) is critical in not only localization and map** for autonomous driving vehicles, but also in assistive navigation for the visually impaired population. To enable a long-term VPR system on a large scale, several challenges need to be addressed. First, different applications could require different image view directions, such as front views for self-driving cars while side views for the low vision people. Second, VPR in metropolitan scenes can often cause privacy concerns due to the imaging of pedestrian and vehicle identity information, calling for the need for data anonymization before VPR queries and database construction. Both factors could lead to VPR performance variations that are not well understood yet. To study their influences, we present the NYU-VPR dataset that contains more than 200,000 images over a 2km by 2km area near the New York University campus, taken within the whole year of 2016. We present benchmark results on several popular VPR algorithms showing that side views are significantly more challenging for current VPR methods while the influence of data anonymization is almost negligible, together with our hypothetical explanations and in-depth analysis.
△ Less
Submitted 25 July, 2022; v1 submitted 17 October, 2021;
originally announced October 2021.
-
Deep neural network-based classification model for Sentiment Analysis
Authors:
Donghang Pan,
**gling Yuan,
Lin Li,
Deming Sheng
Abstract:
The growing prosperity of social networks has brought great challenges to the sentimental tendency mining of users. As more and more researchers pay attention to the sentimental tendency of online users, rich research results have been obtained based on the sentiment classification of explicit texts. However, research on the implicit sentiment of users is still in its infancy. Aiming at the diffic…
▽ More
The growing prosperity of social networks has brought great challenges to the sentimental tendency mining of users. As more and more researchers pay attention to the sentimental tendency of online users, rich research results have been obtained based on the sentiment classification of explicit texts. However, research on the implicit sentiment of users is still in its infancy. Aiming at the difficulty of implicit sentiment classification, a research on implicit sentiment classification model based on deep neural network is carried out. Classification models based on DNN, LSTM, Bi-LSTM and CNN were established to judge the tendency of the user's implicit sentiment text. Based on the Bi-LSTM model, the classification model of word-level attention mechanism is studied. The experimental results on the public dataset show that the established LSTM series classification model and CNN classification model can achieve good sentiment classification effect, and the classification effect is significantly better than the DNN model. The Bi-LSTM based attention mechanism classification model obtained the optimal R value in the positive category identification.
△ Less
Submitted 3 July, 2019;
originally announced July 2019.
-
Convolutional neural networks with fractional order gradient method
Authors:
Dian Sheng,
Yiheng Wei,
Yuquan Chen,
Yong Wang
Abstract:
This paper proposes a fractional order gradient method for the backward propagation of convolutional neural networks. To overcome the problem that fractional order gradient method cannot converge to real extreme point, a simplified fractional order gradient method is designed based on Caputo's definition. The parameters within layers are updated by the designed gradient method, but the propagation…
▽ More
This paper proposes a fractional order gradient method for the backward propagation of convolutional neural networks. To overcome the problem that fractional order gradient method cannot converge to real extreme point, a simplified fractional order gradient method is designed based on Caputo's definition. The parameters within layers are updated by the designed gradient method, but the propagations between layers still use integer order gradients, and thus the complicated derivatives of composite functions are avoided and the chain rule will be kept. By connecting every layers in series and adding loss functions, the proposed convolutional neural networks can be trained smoothly according to various tasks. Some practical experiments are carried out in order to demonstrate fast convergence, high accuracy and ability to escape local optimal point at last.
△ Less
Submitted 16 September, 2019; v1 submitted 13 May, 2019;
originally announced May 2019.
-
A Feature Learning Siamese Model for Intelligent Control of the Dynamic Range Compressor
Authors:
Di Sheng,
György Fazekas
Abstract:
In this paper, a siamese DNN model is proposed to learn the characteristics of the audio dynamic range compressor (DRC). This facilitates an intelligent control system that uses audio examples to configure the DRC, a widely used non-linear audio signal conditioning technique in the areas of music production, speech communication and broadcasting. Several alternative siamese DNN architectures are p…
▽ More
In this paper, a siamese DNN model is proposed to learn the characteristics of the audio dynamic range compressor (DRC). This facilitates an intelligent control system that uses audio examples to configure the DRC, a widely used non-linear audio signal conditioning technique in the areas of music production, speech communication and broadcasting. Several alternative siamese DNN architectures are proposed to learn feature embeddings that can characterise subtle effects due to dynamic range compression. These models are compared with each other as well as handcrafted features proposed in previous work. The evaluation of the relations between the hyperparameters of DNN and DRC parameters are also provided. The best model is able to produce a universal feature embedding that is capable of predicting multiple DRC parameters simultaneously, which is a significant improvement from our previous research. The feature embedding shows better performance than handcrafted audio features when predicting DRC parameters for both mono-instrument audio loops and polyphonic music pieces.
△ Less
Submitted 1 May, 2019;
originally announced May 2019.