Search | arXiv e-print repository

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Authors: Yubin Shi, Yixuan Chen, Mingzhi Dong, Xiaochen Yang, Dongsheng Li, Yujiang Wang, Robert P. Dick, Qin Lv, Yingying Zhao, Fan Yang, Tun Lu, Ning Gu, Li Shang

Abstract: Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-atten… ▽ More Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $λ_{\max}$. A large $λ_{\max}$ indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively. Inspired by the discovery, we propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $λ_{\max}$ exceeding a dynamic threshold selectively, concentrating the model on learning common features and ignoring those inconsistent ones. Unlike most existing training schemes with a complete BP cycle across all network modules, MAT can significantly save computations by its partially-updating strategy and can further improve performance. Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: Accepted at NeurIPS 2023

arXiv:2311.18251 [pdf, other]

Can Large Language Models Be Good Companions? An LLM-Based Eyewear System with Conversational Common Ground

Authors: Zhenyu Xu, Hailin Xu, Zhouyang Lu, Yingying Zhao, Rui Zhu, Yujiang Wang, Mingzhi Dong, Yuhu Chang, Qin Lv, Robert P. Dick, Fan Yang, Tun Lu, Ning Gu, Li Shang

Abstract: Develo** chatbots as personal companions has long been a goal of artificial intelligence researchers. Recent advances in Large Language Models (LLMs) have delivered a practical solution for endowing chatbots with anthropomorphic language capabilities. However, it takes more than LLMs to enable chatbots that can act as companions. Humans use their understanding of individual personalities to driv… ▽ More Develo** chatbots as personal companions has long been a goal of artificial intelligence researchers. Recent advances in Large Language Models (LLMs) have delivered a practical solution for endowing chatbots with anthropomorphic language capabilities. However, it takes more than LLMs to enable chatbots that can act as companions. Humans use their understanding of individual personalities to drive conversations. Chatbots also require this capability to enable human-like companionship. They should act based on personalized, real-time, and time-evolving knowledge of their owner. We define such essential knowledge as the \textit{common ground} between chatbots and their owners, and we propose to build a common-ground-aware dialogue system from an LLM-based module, named \textit{OS-1}, to enable chatbot companionship. Hosted by eyewear, OS-1 can sense the visual and audio signals the user receives and extract real-time contextual semantics. Those semantics are categorized and recorded to formulate historical contexts from which the user's profile is distilled and evolves over time, i.e., OS-1 gradually learns about its user. OS-1 combines knowledge from real-time semantics, historical contexts, and user-specific profiles to produce a common-ground-aware prompt input into the LLM module. The LLM's output is converted to audio, spoken to the wearer when appropriate.We conduct laboratory and in-field studies to assess OS-1's ability to build common ground between the chatbot and its user. The technical feasibility and capabilities of the system are also evaluated. OS-1, with its common-ground awareness, can significantly improve user satisfaction and potentially lead to downstream tasks such as personal emotional support and assistance. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: 36 pages, 25 figures, Under review at ACM IMWUT

arXiv:2311.12997 [pdf, other]

Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks

Authors: Rahul Ramesh, Ekdeep Singh Lubana, Mikail Khona, Robert P. Dick, Hidenori Tanaka

Abstract: Transformers trained on huge text corpora exhibit a remarkable set of capabilities, e.g., performing basic arithmetic. Given the inherent compositional nature of language, one can expect the model to learn to compose these capabilities, potentially yielding a combinatorial explosion of what operations it can perform on an input. Motivated by the above, we train autoregressive Transformer models on… ▽ More Transformers trained on huge text corpora exhibit a remarkable set of capabilities, e.g., performing basic arithmetic. Given the inherent compositional nature of language, one can expect the model to learn to compose these capabilities, potentially yielding a combinatorial explosion of what operations it can perform on an input. Motivated by the above, we train autoregressive Transformer models on a synthetic data-generating process that involves compositions of a set of well-defined monolithic capabilities. Through a series of extensive and systematic experiments on this data-generating process, we show that: (1) autoregressive Transformers can learn compositional structures from small amounts of training data and generalize to exponentially or even combinatorially many functions; (2) generating intermediate outputs when composing functions is more effective for generalizing to new, unseen compositions than not generating any intermediate outputs (3) biases in the order of the compositions in the training data result in Transformers that fail to compose some combinations of functions; and (4) the attention layers select which capability to apply while the feed-forward layers execute the selected capability. △ Less

Submitted 5 February, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

arXiv:2311.12786 [pdf, other]

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Authors: Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, David Scott Krueger

Abstract: Fine-tuning large pre-trained models has become the de facto strategy for develo** both task-specific and general-purpose machine learning systems, including develo** models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely nov… ▽ More Fine-tuning large pre-trained models has become the de facto strategy for develo** both task-specific and general-purpose machine learning systems, including develo** models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such hidden capabilities are relevant leads to sample-efficient 'revival' of the capability, i.e., the model begins reusing these capability after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup. △ Less

Submitted 21 November, 2023; originally announced November 2023.

arXiv:2310.17639 [pdf, other]

In-Context Learning Dynamics with Random Binary Sequences

Authors: Eric J. Bigelow, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Tomer D. Ullman

Abstract: Large language models (LLMs) trained on huge corpora of text datasets demonstrate intriguing capabilities, achieving state-of-the-art performance on tasks they were not explicitly trained for. The precise nature of LLM capabilities is often mysterious, and different prompts can elicit different capabilities through in-context learning. We propose a framework that enables us to analyze in-context l… ▽ More Large language models (LLMs) trained on huge corpora of text datasets demonstrate intriguing capabilities, achieving state-of-the-art performance on tasks they were not explicitly trained for. The precise nature of LLM capabilities is often mysterious, and different prompts can elicit different capabilities through in-context learning. We propose a framework that enables us to analyze in-context learning dynamics to understand latent concepts underlying LLMs' behavioral patterns. This provides a more nuanced understanding than success-or-failure evaluation benchmarks, but does not require observing internal activations as a mechanistic interpretation of circuits would. Inspired by the cognitive science of human randomness perception, we use random binary sequences as context and study dynamics of in-context learning by manipulating properties of context data, such as sequence length. In the latest GPT-3.5+ models, we find emergent abilities to generate seemingly random numbers and learn basic formal languages, with striking in-context learning dynamics where model outputs transition sharply from seemingly random behaviors to deterministic repetition. △ Less

Submitted 15 April, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.09336 [pdf, other]

Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task

Authors: Maya Okawa, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka

Abstract: Modern generative models exhibit unprecedented capabilities to generate extremely realistic data. However, given the inherent compositionality of the real world, reliable use of these models in practical applications requires that they exhibit the capability to compose a novel set of concepts to generate outputs not seen in the training data set. Prior work demonstrates that recent diffusion model… ▽ More Modern generative models exhibit unprecedented capabilities to generate extremely realistic data. However, given the inherent compositionality of the real world, reliable use of these models in practical applications requires that they exhibit the capability to compose a novel set of concepts to generate outputs not seen in the training data set. Prior work demonstrates that recent diffusion models do exhibit intriguing compositional generalization abilities, but also fail unpredictably. Motivated by this, we perform a controlled study for understanding compositional generalization in conditional diffusion models in a synthetic setting, varying different attributes of the training data and measuring the model's ability to generate samples out-of-distribution. Our results show: (i) the order in which the ability to generate samples from a concept and compose them emerges is governed by the structure of the underlying data-generating process; (ii) performance on compositional tasks exhibits a sudden "emergence" due to multiplicative reliance on the performance of constituent tasks, partially explaining emergent phenomena seen in generative models; and (iii) composing concepts with lower frequency in the training data to generate out-of-distribution samples requires considerably more optimization steps compared to generating in-distribution samples. Overall, our study lays a foundation for understanding capabilities and compositionality in generative models from a data-centric perspective. △ Less

Submitted 16 February, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: 37th Conference on Neural Information Processing Systems (NeurIPS)

arXiv:2211.08422 [pdf, other]

Mechanistic Mode Connectivity

Authors: Ekdeep Singh Lubana, Eric J. Bigelow, Robert P. Dick, David Krueger, Hidenori Tanaka

Abstract: We study neural network loss landscapes through the lens of mode connectivity, the observation that minimizers of neural networks retrieved via training on a dataset are connected via simple paths of low loss. Specifically, we ask the following question: are minimizers that rely on different mechanisms for making their predictions connected via simple paths of low loss? We provide a definition of… ▽ More We study neural network loss landscapes through the lens of mode connectivity, the observation that minimizers of neural networks retrieved via training on a dataset are connected via simple paths of low loss. Specifically, we ask the following question: are minimizers that rely on different mechanisms for making their predictions connected via simple paths of low loss? We provide a definition of mechanistic similarity as shared invariances to input transformations and demonstrate that lack of linear connectivity between two models implies they use dissimilar mechanisms for making their predictions. Relevant to practice, this result helps us demonstrate that naive fine-tuning on a downstream dataset can fail to alter a model's mechanisms, e.g., fine-tuning can fail to eliminate a model's reliance on spurious attributes. Our analysis also motivates a method for targeted alteration of a model's mechanisms, named connectivity-based fine-tuning (CBFT), which we analyze using several synthetic datasets for the task of reducing a model's reliance on spurious attributes. △ Less

Submitted 1 June, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: ICML, 2023

arXiv:2205.11506 [pdf, other]

Orchestra: Unsupervised Federated Learning via Globally Consistent Clustering

Authors: Ekdeep Singh Lubana, Chi Ian Tang, Fahim Kawsar, Robert P. Dick, Akhil Mathur

Abstract: Federated learning is generally used in tasks where labels are readily available (e.g., next word prediction). Relaxing this constraint requires design of unsupervised learning techniques that can support desirable properties for federated training: robustness to statistical/systems heterogeneity, scalability with number of participants, and communication efficiency. Prior work on this topic has f… ▽ More Federated learning is generally used in tasks where labels are readily available (e.g., next word prediction). Relaxing this constraint requires design of unsupervised learning techniques that can support desirable properties for federated training: robustness to statistical/systems heterogeneity, scalability with number of participants, and communication efficiency. Prior work on this topic has focused on directly extending centralized self-supervised learning techniques, which are not designed to have the properties listed above. To address this situation, we propose Orchestra, a novel unsupervised federated learning technique that exploits the federation's hierarchy to orchestrate a distributed clustering task and enforce a globally consistent partitioning of clients' data into discriminable clusters. We show the algorithmic pipeline in Orchestra guarantees good generalization performance under a linear probe, allowing it to outperform alternative techniques in a broad range of conditions, including variation in heterogeneity, number of clients, participation ratio, and local epochs. △ Less

Submitted 11 June, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: Camera-ready ICML, 2022

arXiv:2201.09933 [pdf, other]

doi 10.1145/3517250

Do Smart Glasses Dream of Sentimental Visions? Deep Emotionship Analysis for Eyewear Devices

Authors: Yingying Zhao, Yuhu Chang, Yutian Lu, Yujiang Wang, Mingzhi Dong, Qin Lv, Robert P. Dick, Fan Yang, Tun Lu, Ning Gu, Li Shang

Abstract: Emotion recognition in smart eyewear devices is highly valuable but challenging. One key limitation of previous works is that the expression-related information like facial or eye images is considered as the only emotional evidence. However, emotional status is not isolated; it is tightly associated with people's visual perceptions, especially those sentimental ones. However, little work has exami… ▽ More Emotion recognition in smart eyewear devices is highly valuable but challenging. One key limitation of previous works is that the expression-related information like facial or eye images is considered as the only emotional evidence. However, emotional status is not isolated; it is tightly associated with people's visual perceptions, especially those sentimental ones. However, little work has examined such associations to better illustrate the cause of different emotions. In this paper, we study the emotionship analysis problem in eyewear systems, an ambitious task that requires not only classifying the user's emotions but also semantically understanding the potential cause of such emotions. To this end, we devise EMOShip, a deep-learning-based eyewear system that can automatically detect the wearer's emotional status and simultaneously analyze its associations with semantic-level visual perceptions. Experimental studies with 20 participants demonstrate that, thanks to the emotionship awareness, EMOShip not only achieves superior emotion recognition accuracy over existing methods (80.2% vs. 69.4%), but also provides a valuable understanding of the cause of emotions. Pilot studies with 20 participants further motivate the potential use of EMOShip to empower emotion-aware applications, such as emotionship self-reflection and emotionship life-logging. △ Less

Submitted 19 April, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

Comments: The EMO-Film dataset is available at: https://github.com/MemX-Research/EMOShip

Journal ref: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), Volume 6, Issue 1, Article 38. March 2022

arXiv:2106.05956 [pdf, other]

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Authors: Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka

Abstract: Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning. Recent works have identified a multitude of beneficial properties in BatchNorm to explain its success. However, given the pursuit of alternative normalization layers, these properties need to be generalized so that any given layer's success/failure can be accurately predicted. In this work, we take a first… ▽ More Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning. Recent works have identified a multitude of beneficial properties in BatchNorm to explain its success. However, given the pursuit of alternative normalization layers, these properties need to be generalized so that any given layer's success/failure can be accurately predicted. In this work, we take a first step towards this goal by extending known properties of BatchNorm in randomly initialized deep neural networks (DNNs) to several recently proposed normalization layers. Our primary findings follow: (i) similar to BatchNorm, activations-based normalization layers can prevent exponential growth of activations in ResNets, but parametric techniques require explicit remedies; (ii) use of GroupNorm can ensure an informative forward propagation, with different samples being assigned dissimilar activations, but increasing group size results in increasingly indistinguishable activations for different samples, explaining slow convergence speed in models with LayerNorm; and (iii) small group sizes result in large gradient norm in earlier layers, hence explaining training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals a unified set of mechanisms that underpin the success of normalization methods in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers. △ Less

Submitted 26 October, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: Accepted at NeurIPS 2021

arXiv:2105.00916 [pdf, other]

doi 10.1145/3463509

MemX: An Attention-Aware Smart Eyewear System for Personalized Moment Auto-capture

Authors: Yuhu Chang, Yingying Zhao, Mingzhi Dong, Yujiang Wang, Yutian Lu, Qin Lv, Robert P. Dick, Tun Lu, Ning Gu, Li Shang

Abstract: This work presents MemX: a biologically-inspired attention-aware eyewear system developed with the goal of pursuing the long-awaited vision of a personalized visual Memex. MemX captures human visual attention on the fly, analyzes the salient visual content, and records moments of personal interest in the form of compact video snippets. Accurate attentive scene detection and analysis on resource-co… ▽ More This work presents MemX: a biologically-inspired attention-aware eyewear system developed with the goal of pursuing the long-awaited vision of a personalized visual Memex. MemX captures human visual attention on the fly, analyzes the salient visual content, and records moments of personal interest in the form of compact video snippets. Accurate attentive scene detection and analysis on resource-constrained platforms is challenging because these tasks are computation and energy intensive. We propose a new temporal visual attention network that unifies human visual attention tracking and salient visual content analysis. Attention tracking focuses computation-intensive video analysis on salient regions, while video analysis makes human attention detection and tracking more accurate. Using the YouTube-VIS dataset and 30 participants, we experimentally show that MemX significantly improves the attention tracking accuracy over the eye-tracking-alone method, while maintaining high system energy efficiency. We have also conducted 11 in-field pilot studies across a range of daily usage scenarios, which demonstrate the feasibility and potential benefits of MemX. △ Less

Submitted 9 October, 2021; v1 submitted 3 May, 2021; originally announced May 2021.

Comments: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)

Journal ref: Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Volume 5 Issue 2, Article 56. June 2021

arXiv:2104.04443 [pdf, other]

doi 10.1109/TMM.2021.3076612

A Reinforcement-Learning-Based Energy-Efficient Framework for Multi-Task Video Analytics Pipeline

Authors: Yingying Zhao, Mingzhi Dong, Yujiang Wang, Da Feng, Qin Lv, Robert P. Dick, Dongsheng Li, Tun Lu, Ning Gu, Li Shang

Abstract: Deep-learning-based video processing has yielded transformative results in recent years. However, the video analytics pipeline is energy-intensive due to high data rates and reliance on complex inference algorithms, which limits its adoption in energy-constrained applications. Motivated by the observation of high and variable spatial redundancy and temporal dynamics in video data streams, we desig… ▽ More Deep-learning-based video processing has yielded transformative results in recent years. However, the video analytics pipeline is energy-intensive due to high data rates and reliance on complex inference algorithms, which limits its adoption in energy-constrained applications. Motivated by the observation of high and variable spatial redundancy and temporal dynamics in video data streams, we design and evaluate an adaptive-resolution optimization framework to minimize the energy use of multi-task video analytics pipelines. Instead of heuristically tuning the input data resolution of individual tasks, our framework utilizes deep reinforcement learning to dynamically govern the input resolution and computation of the entire video analytics pipeline. By monitoring the impact of varying resolution on the quality of high-dimensional video analytics features, hence the accuracy of video analytics results, the proposed end-to-end optimization framework learns the best non-myopic policy for dynamically controlling the resolution of input video streams to globally optimize energy efficiency. Governed by reinforcement learning, optical flow is incorporated into the framework to minimize unnecessary spatio-temporal redundancy that leads to re-computation, while preserving accuracy. The proposed framework is applied to video instance segmentation which is one of the most challenging computer vision tasks, and achieves better energy efficiency than all baseline methods of similar accuracy on the YouTube-VIS dataset. △ Less

Submitted 2 May, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

Comments: IEEE Transactions on Multimedia

arXiv:2102.09332 [pdf, other]

doi 10.1109/TIM.2021.3104415

HVAQ: A High-Resolution Vision-Based Air Quality Dataset

Authors: Zuohui Chen, Tony Zhang, Zhuangzhi Chen, Yun Xiang, Qi Xuan, Robert P. Dick

Abstract: Air pollutants, such as particulate matter, negatively impact human health. Most existing pollution monitoring techniques use stationary sensors, which are typically sparsely deployed. However, real-world pollution distributions vary rapidly with position and the visual effects of air pollution can be used to estimate concentration, potentially at high spatial resolution. Accurate pollution monito… ▽ More Air pollutants, such as particulate matter, negatively impact human health. Most existing pollution monitoring techniques use stationary sensors, which are typically sparsely deployed. However, real-world pollution distributions vary rapidly with position and the visual effects of air pollution can be used to estimate concentration, potentially at high spatial resolution. Accurate pollution monitoring requires either densely deployed conventional point sensors, at-a-distance vision-based pollution monitoring, or a combination of both. The main contribution of this paper is that to the best of our knowledge, it is the first publicly available, high temporal and spatial resolution air quality dataset containing simultaneous point sensor measurements and corresponding images. The dataset enables, for the first time, high spatial resolution evaluation of image-based air pollution estimation algorithms. It contains PM2.5, PM10, temperature, and humidity data. We evaluate several state-of-art vision-based PM concentration estimation algorithms on our dataset and quantify the increase in accuracy resulting from higher point sensor density and the use of images. It is our intent and belief that this dataset can enable advances by other research teams working on air quality estimation. Our dataset is available at https://github.com/implicitDeclaration/HVAQ-dataset/tree/master. △ Less

Submitted 16 October, 2021; v1 submitted 18 February, 2021; originally announced February 2021.

arXiv:2102.02805 [pdf, other]

How do Quadratic Regularizers Prevent Catastrophic Forgetting: The Role of Interpolation

Authors: Ekdeep Singh Lubana, Puja Trivedi, Danai Koutra, Robert P. Dick

Abstract: Catastrophic forgetting undermines the effectiveness of deep neural networks (DNNs) in scenarios such as continual learning and lifelong learning. While several methods have been proposed to tackle this problem, there is limited work explaining why these methods work well. This paper has the goal of better explaining a popularly used technique for avoiding catastrophic forgetting: quadratic regula… ▽ More Catastrophic forgetting undermines the effectiveness of deep neural networks (DNNs) in scenarios such as continual learning and lifelong learning. While several methods have been proposed to tackle this problem, there is limited work explaining why these methods work well. This paper has the goal of better explaining a popularly used technique for avoiding catastrophic forgetting: quadratic regularization. We show that quadratic regularizers prevent forgetting of past tasks by interpolating current and previous values of model parameters at every training iteration. Over multiple training iterations, this interpolation operation reduces the learning rates of more important model parameters, thereby minimizing their movement. Our analysis also reveals two drawbacks of quadratic regularization: (a) dependence of parameter interpolation on training hyperparameters, which often leads to training instability and (b) assignment of lower importance to deeper layers, which are generally the place forgetting occurs in DNNs. Via a simple modification to the order of operations, we show these drawbacks can be easily avoided, resulting in 6.2\% higher average accuracy at 4.5\% lower average forgetting. We confirm the robustness of our results by training over 2000 models in different settings. Code available at \url{https://github.com/EkdeepSLubana/QRforgetting} △ Less

Submitted 12 August, 2022; v1 submitted 4 February, 2021; originally announced February 2021.

Comments: Camera-ready for Conference on Lifelong Learning Agents (CoLLAs), 2022

arXiv:2009.11839 [pdf, other]

A Gradient Flow Framework For Analyzing Network Pruning

Authors: Ekdeep Singh Lubana, Robert P. Dick

Abstract: Recent network pruning methods focus on pruning models early-on in training. To estimate the impact of removing a parameter, these methods use importance measures that were originally designed to prune trained models. Despite lacking justification for their use early-on in training, such measures result in surprisingly low accuracy loss. To better explain this behavior, we develop a general framew… ▽ More Recent network pruning methods focus on pruning models early-on in training. To estimate the impact of removing a parameter, these methods use importance measures that were originally designed to prune trained models. Despite lacking justification for their use early-on in training, such measures result in surprisingly low accuracy loss. To better explain this behavior, we develop a general framework that uses gradient flow to unify state-of-the-art importance measures through the norm of model parameters. We use this framework to determine the relationship between pruning measures and evolution of model parameters, establishing several results related to pruning models early-on in training: (i) magnitude-based pruning removes parameters that contribute least to reduction in loss, resulting in models that converge faster than magnitude-agnostic methods; (ii) loss-preservation based pruning preserves first-order model evolution dynamics and is therefore appropriate for pruning minimally trained models; and (iii) gradient-norm based pruning affects second-order model evolution dynamics, such that increasing gradient norm via pruning can produce poorly performing models. We validate our claims on several VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10/CIFAR-100. Code available at https://github.com/EkdeepSLubana/flowandprune. △ Less

Submitted 23 September, 2021; v1 submitted 24 September, 2020; originally announced September 2020.

Comments: Accepted at ICLR, 2021

arXiv:2009.05014 [pdf, other]

OrthoReg: Robust Network Pruning Using Orthonormality Regularization

Authors: Ekdeep Singh Lubana, Puja Trivedi, Conrad Hougen, Robert P. Dick, Alfred O. Hero

Abstract: Network pruning in Convolutional Neural Networks (CNNs) has been extensively investigated in recent years. To determine the impact of pruning a group of filters on a network's accuracy, state-of-the-art pruning methods consistently assume filters of a CNN are independent. This allows the importance of a group of filters to be estimated as the sum of importances of individual filters. However, over… ▽ More Network pruning in Convolutional Neural Networks (CNNs) has been extensively investigated in recent years. To determine the impact of pruning a group of filters on a network's accuracy, state-of-the-art pruning methods consistently assume filters of a CNN are independent. This allows the importance of a group of filters to be estimated as the sum of importances of individual filters. However, overparameterization in modern networks results in highly correlated filters that invalidate this assumption, thereby resulting in incorrect importance estimates. To address this issue, we propose OrthoReg, a principled regularization strategy that enforces orthonormality on a network's filters to reduce inter-filter correlation, thereby allowing reliable, efficient determination of group importance estimates, improved trainability of pruned networks, and efficient, simultaneous pruning of large groups of filters. When used for iterative pruning on VGG-13, MobileNet-V1, and ResNet-34, OrthoReg consistently outperforms five baseline techniques, including the state-of-the-art, on CIFAR-100 and Tiny-ImageNet. For the recently proposed Early-Bird Ticket hypothesis, which claims networks become amenable to pruning early-on in training and can be pruned after a few epochs to minimize training expenditure, we find OrthoReg significantly outperforms prior work. Code available at https://github.com/EkdeepSLubana/OrthoReg. △ Less

Submitted 10 September, 2020; originally announced September 2020.

arXiv:1403.5871 [pdf, other]

The Mason Test: A Defense Against Sybil Attacks in Wireless Networks Without Trusted Authorities

Authors: Yue Liu, David R. Bild, Robert P. Dick, Z. Morley Mao, Dan S. Wallach

Abstract: Wireless networks are vulnerable to Sybil attacks, in which a malicious node poses as many identities in order to gain disproportionate influence. Many defenses based on spatial variability of wireless channels exist, but depend either on detailed, multi-tap channel estimation - something not exposed on commodity 802.11 devices - or valid RSSI observations from multiple trusted sources, e.g., corp… ▽ More Wireless networks are vulnerable to Sybil attacks, in which a malicious node poses as many identities in order to gain disproportionate influence. Many defenses based on spatial variability of wireless channels exist, but depend either on detailed, multi-tap channel estimation - something not exposed on commodity 802.11 devices - or valid RSSI observations from multiple trusted sources, e.g., corporate access points - something not directly available in ad hoc and delay-tolerant networks with potentially malicious neighbors. We extend these techniques to be practical for wireless ad hoc networks of commodity 802.11 devices. Specifically, we propose two efficient methods for separating the valid RSSI observations of behaving nodes from those falsified by malicious participants. Further, we note that prior signalprint methods are easily defeated by mobile attackers and develop an appropriate challenge-response defense. Finally, we present the Mason test, the first implementation of these techniques for ad hoc and delay-tolerant networks of commodity 802.11 devices. We illustrate its performance in several real-world scenarios. △ Less

Submitted 24 March, 2014; originally announced March 2014.

arXiv:1403.4677 [pdf, other]

Performance Analysis of Location Profile Routing

Authors: David R. Bild, Yue Liu, Robert P. Dick, Z. Morley Mao, Dan S. Wallach

Abstract: We propose using the predictability of human motion to eliminate the overhead of distributed location services in human-carried MANETs, dubbing the technique location profile routing. This method outperforms the Geographic Hashing Location Service when nodes change locations 2x more frequently than they initiate connections (e.g., start new TCP streams), as in applications like text- and instant-m… ▽ More We propose using the predictability of human motion to eliminate the overhead of distributed location services in human-carried MANETs, dubbing the technique location profile routing. This method outperforms the Geographic Hashing Location Service when nodes change locations 2x more frequently than they initiate connections (e.g., start new TCP streams), as in applications like text- and instant-messaging. Prior characterizations of human mobility are used to show that location profile routing achieves a 93% delivery ratio with a 1.75x first-packet latency increase relative to an oracle location service. △ Less

Submitted 18 March, 2014; originally announced March 2014.

arXiv:1402.2671 [pdf, other]

doi 10.1145/2700060

Aggregate Characterization of User Behavior in Twitter and Analysis of the Retweet Graph

Authors: David R. Bild, Yue Liu, Robert P. Dick, Z. Morley Mao, Dan S. Wallach

Abstract: Most previous analysis of Twitter user behavior is focused on individual information cascades and the social followers graph. We instead study aggregate user behavior and the retweet graph with a focus on quantitative descriptions. We find that the lifetime tweet distribution is a type-II discrete Weibull stemming from a power law hazard function, the tweet rate distribution, although asymptotical… ▽ More Most previous analysis of Twitter user behavior is focused on individual information cascades and the social followers graph. We instead study aggregate user behavior and the retweet graph with a focus on quantitative descriptions. We find that the lifetime tweet distribution is a type-II discrete Weibull stemming from a power law hazard function, the tweet rate distribution, although asymptotically power law, exhibits a lognormal cutoff over finite sample intervals, and the inter-tweet interval distribution is power law with exponential cutoff. The retweet graph is small-world and scale-free, like the social graph, but is less disassortative and has much stronger clustering. These differences are consistent with it better capturing the real-world social relationships of and trust between users. Beyond just understanding and modeling human communication patterns and social networks, applications for alternative, decentralized microblogging systems-both predicting real-word performance and detecting spam-are discussed. △ Less

Submitted 11 February, 2014; originally announced February 2014.

Comments: 17 pages, 21 figures

Journal ref: ACM Trans. Internet Technol. 15, 1, Article 4 (February 2015), 24 pages

Showing 1–19 of 19 results for author: Dick, R P