Search | arXiv e-print repository

Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference

Authors: Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane

Abstract: The computational cost of transformer models makes them inefficient in low-latency or low-power applications. While techniques such as quantization or linear attention can reduce the computational load, they may incur a reduction in accuracy. In addition, globally reducing the cost for all inputs may be sub-optimal. We observe that for each layer, the full width of the layer may be needed only for… ▽ More The computational cost of transformer models makes them inefficient in low-latency or low-power applications. While techniques such as quantization or linear attention can reduce the computational load, they may incur a reduction in accuracy. In addition, globally reducing the cost for all inputs may be sub-optimal. We observe that for each layer, the full width of the layer may be needed only for a small subset of tokens inside a batch and that the "effective" width needed to process a token can vary from layer to layer. Motivated by this observation, we introduce the Adaptive Computation Module (ACM), a generic module that dynamically adapts its computational load to match the estimated difficulty of the input on a per-token basis. An ACM consists of a sequence of learners that progressively refine the output of their preceding counterparts. An additional gating mechanism determines the optimal number of learners to execute for each token. We also describe a distillation technique to replace any pre-trained model with an "ACMized" variant. The distillation phase is designed to be highly parallelizable across layers while being simple to plug-and-play into existing networks. Our evaluation of transformer models in computer vision and speech recognition demonstrates that substituting layers with ACMs significantly reduces inference costs without degrading the downstream accuracy for a wide interval of user-defined budgets. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2310.04361 [pdf, other]

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Authors: Filip Szatkowski, Bartosz Wójcik, Mikołaj Piórczyński, Simone Scardapane

Abstract: Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexpl… ▽ More Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. In particular, we show that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic-k expert selection rule that adjusts the number of executed experts on a per-token basis. Finally, we extend this approach to multi-head attention projections, which results in additional savings compared to only converting the FFN blocks. The proposed method, Dense to Dynamic-$k$ Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, allowing us to save up to 60% of inference cost without significantly affecting model performance. △ Less

Submitted 7 June, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

arXiv:2309.12033 [pdf, other]

Face Identity-Aware Disentanglement in StyleGAN

Authors: Adrian Suwała, Bartosz Wójcik, Magdalena Proszewska, Jacek Tabor, Przemysław Spurek, Marek Śmieja

Abstract: Conditional GANs are frequently used for manipulating the attributes of face images, such as expression, hairstyle, pose, or age. Even though the state-of-the-art models successfully modify the requested attributes, they simultaneously modify other important characteristics of the image, such as a person's identity. In this paper, we focus on solving this problem by introducing PluGeN4Faces, a plu… ▽ More Conditional GANs are frequently used for manipulating the attributes of face images, such as expression, hairstyle, pose, or age. Even though the state-of-the-art models successfully modify the requested attributes, they simultaneously modify other important characteristics of the image, such as a person's identity. In this paper, we focus on solving this problem by introducing PluGeN4Faces, a plugin to StyleGAN, which explicitly disentangles face attributes from a person's identity. Our key idea is to perform training on images retrieved from movie frames, where a given person appears in various poses and with different attributes. By applying a type of contrastive loss, we encourage the model to group images of the same person in similar regions of latent space. Our experiments demonstrate that the modifications of face attributes performed by PluGeN4Faces are significantly less invasive on the remaining characteristics of the image than in the existing state-of-the-art models. △ Less

Submitted 21 September, 2023; originally announced September 2023.

arXiv:2210.05282 [pdf, other]

doi 10.1109/ACCESS.2022.3212918

Computer Vision based inspection on post-earthquake with UAV synthetic dataset

Authors: Mateusz Żarski, Bartosz Wójcik, Jarosław A. Miszczak, Bartłomiej Blachowski, Mariusz Ostrowski

Abstract: The area affected by the earthquake is vast and often difficult to entirely cover, and the earthquake itself is a sudden event that causes multiple defects simultaneously, that cannot be effectively traced using traditional, manual methods. This article presents an innovative approach to the problem of detecting damage after sudden events by using an interconnected set of deep machine learning mod… ▽ More The area affected by the earthquake is vast and often difficult to entirely cover, and the earthquake itself is a sudden event that causes multiple defects simultaneously, that cannot be effectively traced using traditional, manual methods. This article presents an innovative approach to the problem of detecting damage after sudden events by using an interconnected set of deep machine learning models organized in a single pipeline and allowing for easy modification and swap** models seamlessly. Models in the pipeline were trained with a synthetic dataset and were adapted to be further evaluated and used with unmanned aerial vehicles (UAVs) in real-world conditions. Thanks to the methods presented in the article, it is possible to obtain high accuracy in detecting buildings defects, segmenting constructions into their components and estimating their technical condition based on a single drone flight. △ Less

Submitted 11 October, 2022; originally announced October 2022.

Comments: 15 pages, 8 figures, published version, software available from https://github.com/MatZar01/IC_SHM_P2

Journal ref: IEEE Access, Vol. 10 (2022), pp. 108134-108144

arXiv:2206.13923 [pdf, other]

SLOVA: Uncertainty Estimation Using Single Label One-Vs-All Classifier

Authors: Bartosz Wójcik, Jacek Grela, Marek Śmieja, Krzysztof Misztal, Jacek Tabor

Abstract: Deep neural networks present impressive performance, yet they cannot reliably estimate their predictive confidence, limiting their applicability in high-risk domains. We show that applying a multi-label one-vs-all loss reveals classification ambiguity and reduces model overconfidence. The introduced SLOVA (Single Label One-Vs-All) model redefines typical one-vs-all predictive probabilities to a si… ▽ More Deep neural networks present impressive performance, yet they cannot reliably estimate their predictive confidence, limiting their applicability in high-risk domains. We show that applying a multi-label one-vs-all loss reveals classification ambiguity and reduces model overconfidence. The introduced SLOVA (Single Label One-Vs-All) model redefines typical one-vs-all predictive probabilities to a single label situation, where only one class is the correct answer. The proposed classifier is confident only if a single class has a high probability and other probabilities are negligible. Unlike the typical softmax function, SLOVA naturally detects out-of-distribution samples if the probabilities of all other classes are small. The model is additionally fine-tuned with exponential calibration, which allows us to precisely align the confidence score with model accuracy. We verify our approach on three tasks. First, we demonstrate that SLOVA is competitive with the state-of-the-art on in-distribution calibration. Second, the performance of SLOVA is robust under dataset shifts. Finally, our approach performs extremely well in the detection of out-of-distribution samples. Consequently, SLOVA is a tool that can be used in various applications where uncertainty modeling is required. △ Less

Submitted 28 June, 2022; originally announced June 2022.

arXiv:2206.07996 [pdf, other]

Continual Learning with Guarantees via Weight Interval Constraints

Authors: Maciej Wołczyk, Karol J. Piczak, Bartosz Wójcik, Łukasz Pustelnik, Paweł Morawiecki, Jacek Tabor, Tomasz Trzciński, Przemysław Spurek

Abstract: We introduce a new training paradigm that enforces interval constraints on neural network parameter space to control forgetting. Contemporary Continual Learning (CL) methods focus on training neural networks efficiently from a stream of data, while reducing the negative impact of catastrophic forgetting, yet they do not provide any firm guarantees that network performance will not deteriorate unco… ▽ More We introduce a new training paradigm that enforces interval constraints on neural network parameter space to control forgetting. Contemporary Continual Learning (CL) methods focus on training neural networks efficiently from a stream of data, while reducing the negative impact of catastrophic forgetting, yet they do not provide any firm guarantees that network performance will not deteriorate uncontrollably over time. In this work, we show how to put bounds on forgetting by reformulating continual learning of a model as a continual contraction of its parameter space. To that end, we propose Hyperrectangle Training, a new training methodology where each task is represented by a hyperrectangle in the parameter space, fully contained in the hyperrectangles of the previous tasks. This formulation reduces the NP-hard CL problem back to polynomial time while providing full resilience against forgetting. We validate our claim by develo** InterContiNet (Interval Continual Learning) algorithm which leverages interval arithmetic to effectively model parameter regions as hyperrectangles. Through experimental results, we show that our approach performs well in a continual learning setup without storing data from previous tasks. △ Less

Submitted 16 June, 2022; originally announced June 2022.

Comments: Short presentation at ICML 2022

arXiv:2106.10944 [pdf, other]

doi 0.24425/bpasts.2023.147340

Hard hat wearing detection based on head keypoint localization

Authors: Bartosz Wójcik, Mateusz Żarski, Kamil Książek, Jarosław Adam Miszczak, Mirosław Jan Skibniewski

Abstract: In recent years, a lot of attention is paid to deep learning methods in the context of vision-based construction site safety systems, especially regarding personal protective equipment. However, despite all this attention, there is still no reliable way to establish the relationship between workers and their hard hats. To answer this problem a combination of deep learning, object detection and hea… ▽ More In recent years, a lot of attention is paid to deep learning methods in the context of vision-based construction site safety systems, especially regarding personal protective equipment. However, despite all this attention, there is still no reliable way to establish the relationship between workers and their hard hats. To answer this problem a combination of deep learning, object detection and head keypoint localization, with simple rule-based reasoning is proposed in this article. In tests, this solution surpassed the previous methods based on the relative bounding box position of different instances, as well as direct detection of hard hat wearers and non-wearers. The results show that the conjunction of novel deep learning methods with humanly-interpretable rule-based systems can result in a solution that is both reliable and can successfully mimic manual, on-site supervision. This work is the next step in the development of fully autonomous construction site safety systems and shows that there is still room for improvement in this area. △ Less

Submitted 24 June, 2022; v1 submitted 21 June, 2021; originally announced June 2021.

Comments: 17 pages, 9 figures and 9 tables

Journal ref: Bull. Pol. Acad. Sci. Tech. Sci. Vol. 71, No. 6, pp. e147340 (2023)

arXiv:2106.05409 [pdf, other]

Zero Time Waste: Recycling Predictions in Early Exit Neural Networks

Authors: Maciej Wołczyk, Bartosz Wójcik, Klaudia Bałazy, Igor Podolak, Jacek Tabor, Marek Śmieja, Tomasz Trzciński

Abstract: The problem of reducing processing time of large deep learning models is a fundamental challenge in many real-world applications. Early exit methods strive towards this goal by attaching additional Internal Classifiers (ICs) to intermediate layers of a neural network. ICs can quickly return predictions for easy examples and, as a result, reduce the average inference time of the whole model. Howeve… ▽ More The problem of reducing processing time of large deep learning models is a fundamental challenge in many real-world applications. Early exit methods strive towards this goal by attaching additional Internal Classifiers (ICs) to intermediate layers of a neural network. ICs can quickly return predictions for easy examples and, as a result, reduce the average inference time of the whole model. However, if a particular IC does not decide to return an answer early, its predictions are discarded, with its computations effectively being wasted. To solve this issue, we introduce Zero Time Waste (ZTW), a novel approach in which each IC reuses predictions returned by its predecessors by (1) adding direct connections between ICs and (2) combining previous outputs in an ensemble-like manner. We conduct extensive experiments across various datasets and architectures to demonstrate that ZTW achieves a significantly better accuracy vs. inference time trade-off than other recently proposed early exit methods. △ Less

Submitted 5 December, 2021; v1 submitted 9 June, 2021; originally announced June 2021.

Comments: Accepted at NeurIPS 2021

arXiv:2006.10013 [pdf, other]

Adversarial Examples Detection and Analysis with Layer-wise Autoencoders

Authors: Bartosz Wójcik, Paweł Morawiecki, Marek Śmieja, Tomasz Krzyżek, Przemysław Spurek, Jacek Tabor

Abstract: We present a mechanism for detecting adversarial examples based on data representations taken from the hidden layers of the target network. For this purpose, we train individual autoencoders at intermediate layers of the target network. This allows us to describe the manifold of true data and, in consequence, decide whether a given example has the same characteristics as true data. It also gives u… ▽ More We present a mechanism for detecting adversarial examples based on data representations taken from the hidden layers of the target network. For this purpose, we train individual autoencoders at intermediate layers of the target network. This allows us to describe the manifold of true data and, in consequence, decide whether a given example has the same characteristics as true data. It also gives us insight into the behavior of adversarial examples and their flow through the layers of a deep neural network. Experimental results show that our method outperforms the state of the art in supervised and unsupervised settings. △ Less

Submitted 17 June, 2020; originally announced June 2020.

arXiv:2004.12337 [pdf, other]

doi 10.1016/j.softx.2021.100893

KrakN: Transfer Learning framework for thin crack detection in infrastructure maintenance

Authors: Mateusz Żarski, Bartosz Wójcik, Jarosław Adam Miszczak

Abstract: Monitoring the technical condition of infrastructure is a crucial element to its maintenance. Currently applied methods are outdated, labour-intensive and inaccurate. At the same time, the latest methods using Artificial Intelligence techniques are severely limited in their application due to two main factors -- labour-intensive gathering of new datasets and high demand for computing power. We pro… ▽ More Monitoring the technical condition of infrastructure is a crucial element to its maintenance. Currently applied methods are outdated, labour-intensive and inaccurate. At the same time, the latest methods using Artificial Intelligence techniques are severely limited in their application due to two main factors -- labour-intensive gathering of new datasets and high demand for computing power. We propose to utilize custom made framework -- KrakN, to overcome these limiting factors. It enables the development of unique infrastructure defects detectors on digital images, achieving the accuracy of above 90%. The framework supports semi-automatic creation of new datasets and has modest computing power requirements. It is implemented in the form of a ready-to-use software package openly distributed to the public. Thus, it can be used to immediately implement the methods proposed in this paper in the process of infrastructure management by government units, regardless of their financial capabilities. △ Less

Submitted 11 October, 2020; v1 submitted 26 April, 2020; originally announced April 2020.

Comments: 23 pages, 15 figures and flowcharts, software available at https://github.com/MatZar01/KrakN, https://doi.org/10.5281/zenodo.3764697, and https://doi.org/10.5281/zenodo.3755452, dataset available from https://doi.org/10.5281/zenodo.3759845

Journal ref: SoftwareX, Vol 16, 100893 (2021)

arXiv:2004.08172 [pdf, other]

Finding the Optimal Network Depth in Classification Tasks

Authors: Bartosz Wójcik, Maciej Wołczyk, Klaudia Bałazy, Jacek Tabor

Abstract: We develop a fast end-to-end method for training lightweight neural networks using multiple classifier heads. By allowing the model to determine the importance of each head and rewarding the choice of a single shallow classifier, we are able to detect and remove unneeded components of the network. This operation, which can be seen as finding the optimal depth of the model, significantly reduces th… ▽ More We develop a fast end-to-end method for training lightweight neural networks using multiple classifier heads. By allowing the model to determine the importance of each head and rewarding the choice of a single shallow classifier, we are able to detect and remove unneeded components of the network. This operation, which can be seen as finding the optimal depth of the model, significantly reduces the number of parameters and accelerates inference across different hardware processing units, which is not the case for many standard pruning methods. We show the performance of our method on multiple network architectures and datasets, analyze its optimization properties, and conduct ablation studies. △ Less

Submitted 17 April, 2020; originally announced April 2020.

arXiv:1905.12947 [pdf, other]

One-element Batch Training by Moving Window

Authors: Przemysław Spurek, Szymon Knop, Jacek Tabor, Igor Podolak, Bartosz Wójcik

Abstract: Several deep models, esp. the generative, compare the samples from two distributions (e.g. WAE like AutoEncoder models, set-processing deep networks, etc) in their cost functions. Using all these methods one cannot train the model directly taking small size (in extreme -- one element) batches, due to the fact that samples are to be compared. We propose a generic approach to training such models… ▽ More Several deep models, esp. the generative, compare the samples from two distributions (e.g. WAE like AutoEncoder models, set-processing deep networks, etc) in their cost functions. Using all these methods one cannot train the model directly taking small size (in extreme -- one element) batches, due to the fact that samples are to be compared. We propose a generic approach to training such models using one-element mini-batches. The idea is based on splitting the batch in latent into parts: previous, i.e. historical, elements used for latent space distribution matching and the current ones, used both for latent distribution computation and the minimization process. Due to the smaller memory requirements, this allows to train networks on higher resolution images then in the classical approach. △ Less

Submitted 31 May, 2019; v1 submitted 30 May, 2019; originally announced May 2019.

arXiv:1902.07656 [pdf, other]

doi 10.4467/20838476SI.18.004.10409

LOSSGRAD: automatic learning rate in gradient descent

Authors: Bartosz Wójcik, Łukasz Maziarka, Jacek Tabor

Abstract: In this paper, we propose a simple, fast and easy to implement algorithm LOSSGRAD (locally optimal step-size in gradient descent), which automatically modifies the step-size in gradient descent during neural networks training. Given a function $f$, a point $x$, and the gradient $\nabla_x f$ of $f$, we aim to find the step-size $h$ which is (locally) optimal, i.e. satisfies:… ▽ More In this paper, we propose a simple, fast and easy to implement algorithm LOSSGRAD (locally optimal step-size in gradient descent), which automatically modifies the step-size in gradient descent during neural networks training. Given a function $f$, a point $x$, and the gradient $\nabla_x f$ of $f$, we aim to find the step-size $h$ which is (locally) optimal, i.e. satisfies: $$ h=arg\,min_{t \geq 0} f(x-t \nabla_x f). $$ Making use of quadratic approximation, we show that the algorithm satisfies the above assumption. We experimentally show that our method is insensitive to the choice of initial learning rate while achieving results comparable to other methods. △ Less

Submitted 20 February, 2019; originally announced February 2019.

Comments: TFML 2019

Journal ref: Schedae Informaticae, 2018, Volume 27

arXiv:1306.6294 [pdf, other]

Learning Trajectory Preferences for Manipulators via Iterative Improvement

Authors: Ashesh Jain, Brian Wojcik, Thorsten Joachims, Ashutosh Saxena

Abstract: We consider the problem of learning good trajectories for manipulation tasks. This is challenging because the criterion defining a good trajectory varies with users, tasks and environments. In this paper, we propose a co-active online learning framework for teaching robots the preferences of its users for object manipulation tasks. The key novelty of our approach lies in the type of feedback expec… ▽ More We consider the problem of learning good trajectories for manipulation tasks. This is challenging because the criterion defining a good trajectory varies with users, tasks and environments. In this paper, we propose a co-active online learning framework for teaching robots the preferences of its users for object manipulation tasks. The key novelty of our approach lies in the type of feedback expected from the user: the human user does not need to demonstrate optimal trajectories as training data, but merely needs to iteratively provide trajectories that slightly improve over the trajectory currently proposed by the system. We argue that this co-active preference feedback can be more easily elicited from the user than demonstrations of optimal trajectories, which are often challenging and non-intuitive to provide on high degrees of freedom manipulators. Nevertheless, theoretical regret bounds of our algorithm match the asymptotic rates of optimal trajectory algorithms. We demonstrate the generalizability of our algorithm on a variety of grocery checkout tasks, for whom, the preferences were not only influenced by the object being manipulated but also by the surrounding environment.\footnote{For more details and a demonstration video, visit: \url{http://pr.cs.cornell.edu/coactive}} △ Less

Submitted 5 November, 2013; v1 submitted 26 June, 2013; originally announced June 2013.

Comments: 9 pages. To appear in NIPS 2013

Showing 1–14 of 14 results for author: Wojcik, B