Search | arXiv e-print repository

arXiv:2012.08482 [pdf, other]

Learning Aggregation Functions

Authors: Giovanni Pellegrini, Alessandro Tibo, Paolo Frasconi, Andrea Passerini, Manfred Jaeger

Abstract: Learning on sets is increasingly gaining attention in the machine learning community, due to its widespread applicability. Typically, representations over sets are computed by using fixed aggregation functions such as sum or maximum. However, recent results showed that universal function representation by sum- (or max-) decomposition requires either highly discontinuous (and thus poorly learnable)… ▽ More Learning on sets is increasingly gaining attention in the machine learning community, due to its widespread applicability. Typically, representations over sets are computed by using fixed aggregation functions such as sum or maximum. However, recent results showed that universal function representation by sum- (or max-) decomposition requires either highly discontinuous (and thus poorly learnable) map**s, or a latent dimension equal to the maximum number of elements in the set. To mitigate this problem, we introduce a learnable aggregation function (LAF) for sets of arbitrary cardinality. LAF can approximate several extensively used aggregators (such as average, sum, maximum) as well as more complex functions (e.g., variance and skewness). We report experiments on semi-synthetic and real data showing that LAF outperforms state-of-the-art sum- (max-) decomposition architectures such as DeepSets and library-based architectures like Principal Neighborhood Aggregation, and can be effectively combined with attention-based architectures. △ Less

Submitted 3 June, 2021; v1 submitted 15 December, 2020; originally announced December 2020.

Comments: Extended version (with proof appendix) of paper that is to appear in Proceedings of IJCAI 2021

arXiv:2006.16370 [pdf, other]

doi 10.1109/JBHI.2020.3005016

Classification of cancer pathology reports: a large-scale comparative study

Authors: Stefano Martina, Leonardo Ventura, Paolo Frasconi

Abstract: We report about the application of state-of-the-art deep learning techniques to the automatic and interpretable assignment of ICD-O3 topography and morphology codes to free-text cancer reports. We present results on a large dataset (more than 80 000 labeled and 1 500 000 unlabeled anonymized reports written in Italian and collected from hospitals in Tuscany over more than a decade) and with a larg… ▽ More We report about the application of state-of-the-art deep learning techniques to the automatic and interpretable assignment of ICD-O3 topography and morphology codes to free-text cancer reports. We present results on a large dataset (more than 80 000 labeled and 1 500 000 unlabeled anonymized reports written in Italian and collected from hospitals in Tuscany over more than a decade) and with a large number of classes (134 morphological classes and 61 topographical classes). We compare alternative architectures in terms of prediction accuracy and interpretability and show that our best model achieves a multiclass accuracy of 90.3% on topography site assignment and 84.8% on morphology type assignment. We found that in this context hierarchical models are not better than flat models and that an element-wise maximum aggregator is slightly better than attentive models on site classification. Moreover, the maximum aggregator offers a way to interpret the classification process. △ Less

Submitted 29 June, 2020; originally announced June 2020.

Comments: 10 pages, 6 figures, 3 tables, accepted for publication in IEEE Journal of Biomedical and Health Informatics (J-BHI)

ACM Class: I.2.6; I.2.7; J.3

Journal ref: IEEE Journal of Biomedical and Health Informatics 24 (11), 3085-3094 (2020)

arXiv:1910.08525 [pdf, other]

MARTHE: Scheduling the Learning Rate Via Online Hypergradients

Authors: Michele Donini, Luca Franceschi, Massimiliano Pontil, Orchid Majumder, Paolo Frasconi

Abstract: We study the problem of fitting task-specific learning rate schedules from the perspective of hyperparameter optimization, aiming at good generalization. We describe the structure of the gradient of a validation error w.r.t. the learning rate schedule -- the hypergradient. Based on this, we introduce MARTHE, a novel online algorithm guided by cheap approximations of the hypergradient that uses pas… ▽ More We study the problem of fitting task-specific learning rate schedules from the perspective of hyperparameter optimization, aiming at good generalization. We describe the structure of the gradient of a validation error w.r.t. the learning rate schedule -- the hypergradient. Based on this, we introduce MARTHE, a novel online algorithm guided by cheap approximations of the hypergradient that uses past information from the optimization trajectory to simulate future behaviour. It interpolates between two recent techniques, RTHO (Franceschi et al., 2017) and HD (Baydin et al. 2018), and is able to produce learning rate schedules that are more stable leading to models that generalize better. △ Less

Submitted 17 May, 2020; v1 submitted 18 October, 2019; originally announced October 2019.

Comments: IJCAI 2020. Larger images. Code available at https://github.com/awslabs/adatune

arXiv:1810.11514 [pdf, other]

Learning and Interpreting Multi-Multi-Instance Learning Networks

Authors: Alessandro Tibo, Manfred Jaeger, Paolo Frasconi

Abstract: We introduce an extension of the multi-instance learning problem where examples are organized as nested bags of instances (e.g., a document could be represented as a bag of sentences, which in turn are bags of words). This framework can be useful in various scenarios, such as text and image classification, but also supervised learning over graphs. As a further advantage, multi-multi instance learn… ▽ More We introduce an extension of the multi-instance learning problem where examples are organized as nested bags of instances (e.g., a document could be represented as a bag of sentences, which in turn are bags of words). This framework can be useful in various scenarios, such as text and image classification, but also supervised learning over graphs. As a further advantage, multi-multi instance learning enables a particular way of interpreting predictions and the decision function. Our approach is based on a special neural network layer, called bag-layer, whose units aggregate bags of inputs of arbitrary size. We prove theoretically that the associated class of functions contains all Boolean functions over sets of sets of instances and we provide empirical evidence that functions of this kind can be actually learned on semi-synthetic datasets. We finally present experiments on text classification, on citation graphs, and social graph data, which show that our model obtains competitive results with respect to accuracy when compared to other approaches such as convolutional networks on graphs, while at the same time it supports a general approach to interpret the learnt model, as well as explain individual predictions. △ Less

Submitted 3 October, 2020; v1 submitted 26 October, 2018; originally announced October 2018.

Comments: JMLR

arXiv:1806.04941 [pdf, other]

Far-HO: A Bilevel Programming Package for Hyperparameter Optimization and Meta-Learning

Authors: Luca Franceschi, Riccardo Grazzi, Massimiliano Pontil, Saverio Salzo, Paolo Frasconi

Abstract: In (Franceschi et al., 2018) we proposed a unified mathematical framework, grounded on bilevel programming, that encompasses gradient-based hyperparameter optimization and meta-learning. We formulated an approximate version of the problem where the inner objective is solved iteratively, and gave sufficient conditions ensuring convergence to the exact problem. In this work we show how to optimize l… ▽ More In (Franceschi et al., 2018) we proposed a unified mathematical framework, grounded on bilevel programming, that encompasses gradient-based hyperparameter optimization and meta-learning. We formulated an approximate version of the problem where the inner objective is solved iteratively, and gave sufficient conditions ensuring convergence to the exact problem. In this work we show how to optimize learning rates, automatically weight the loss of single examples and learn hyper-representations with Far-HO, a software package based on the popular deep learning framework TensorFlow that allows to seamlessly tackle both HO and ML problems. △ Less

Submitted 13 June, 2018; originally announced June 2018.

Comments: This submission is a reduced version of (Franceschi et al., arXiv:1806.04910) which has been accepted at the main ICML 2018 conference. In this paper we illustrate the software framework, material that could not be included in the conference paper

arXiv:1806.04910 [pdf, other]

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Authors: Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, Massimilano Pontil

Abstract: We introduce a framework based on bilevel programming that unifies gradient-based hyperparameter optimization and meta-learning. We show that an approximate version of the bilevel problem can be solved by taking into explicit account the optimization dynamics for the inner objective. Depending on the specific setting, the outer variables take either the meaning of hyperparameters in a supervised l… ▽ More We introduce a framework based on bilevel programming that unifies gradient-based hyperparameter optimization and meta-learning. We show that an approximate version of the bilevel problem can be solved by taking into explicit account the optimization dynamics for the inner objective. Depending on the specific setting, the outer variables take either the meaning of hyperparameters in a supervised learning problem or parameters of a meta-learner. We provide sufficient conditions under which solutions of the approximate problem converge to those of the exact problem. We instantiate our approach for meta-learning in the case of deep learning where representation layers are treated as hyperparameters shared across a set of training episodes. In experiments, we confirm our theoretical findings, present encouraging results for few-shot learning and contrast the bilevel approach against classical approaches for learning-to-learn. △ Less

Submitted 3 July, 2018; v1 submitted 13 June, 2018; originally announced June 2018.

Comments: ICML 2018; code for replicating experiments at https://github.com/prolearner/hyper-representation, main package (Far-HO) at https://github.com/lucfra/FAR-HO

arXiv:1804.09808 [pdf, other]

Off the Beaten Track: Using Deep Learning to Interpolate Between Music Genres

Authors: Tijn Borghuis, Alessandro Tibo, Simone Conforti, Luca Canciello, Lorenzo Brusci, Paolo Frasconi

Abstract: We describe a system based on deep learning that generates drum patterns in the electronic dance music domain. Experimental results reveal that generated patterns can be employed to produce musically sound and creative transitions between different genres, and that the process of generation is of interest to practitioners in the field. We describe a system based on deep learning that generates drum patterns in the electronic dance music domain. Experimental results reveal that generated patterns can be employed to produce musically sound and creative transitions between different genres, and that the process of generation is of interest to practitioners in the field. △ Less

Submitted 2 May, 2018; v1 submitted 25 April, 2018; originally announced April 2018.

arXiv:1712.06283 [pdf, other]

A Bridge Between Hyperparameter Optimization and Learning-to-learn

Authors: Luca Franceschi, Michele Donini, Paolo Frasconi, Massimiliano Pontil

Abstract: We consider a class of a nested optimization problems involving inner and outer objectives. We observe that by taking into explicit account the optimization dynamics for the inner objective it is possible to derive a general framework that unifies gradient-based hyperparameter optimization and meta-learning (or learning-to-learn). Depending on the specific setting, the variables of the outer objec… ▽ More We consider a class of a nested optimization problems involving inner and outer objectives. We observe that by taking into explicit account the optimization dynamics for the inner objective it is possible to derive a general framework that unifies gradient-based hyperparameter optimization and meta-learning (or learning-to-learn). Depending on the specific setting, the variables of the outer objective take either the meaning of hyperparameters in a supervised learning problem or parameters of a meta-learner. We show that some recently proposed methods in the latter setting can be instantiated in our framework and tackled with the same gradient-based algorithms. Finally, we discuss possible design patterns for learning-to-learn and present encouraging preliminary experiments for few-shot learning. △ Less

Submitted 21 August, 2019; v1 submitted 18 December, 2017; originally announced December 2017.

Comments: NIPS 2017 workshop on Meta-learning (http://metalearning.ml/)

arXiv:1703.05537 [pdf]

doi 10.3389/frobt.2018.00042

Shift Aggregate Extract Networks

Authors: Francesco Orsini, Daniele Baracchi, Paolo Frasconi

Abstract: We introduce an architecture based on deep hierarchical decompositions to learn effective representations of large graphs. Our framework extends classic R-decompositions used in kernel methods, enabling nested part-of-part relations. Unlike recursive neural networks, which unroll a template on input graphs directly, we unroll a neural network template over the decomposition hierarchy, allowing us… ▽ More We introduce an architecture based on deep hierarchical decompositions to learn effective representations of large graphs. Our framework extends classic R-decompositions used in kernel methods, enabling nested part-of-part relations. Unlike recursive neural networks, which unroll a template on input graphs directly, we unroll a neural network template over the decomposition hierarchy, allowing us to deal with the high degree variability that typically characterize social network graphs. Deep hierarchical decompositions are also amenable to domain compression, a technique that reduces both space and time complexity by exploiting symmetries. We show empirically that our approach is able to outperform current state-of-the-art graph classification methods on large social network datasets, while at the same time being competitive on small chemobiological benchmark datasets. △ Less

Submitted 18 March, 2024; v1 submitted 16 March, 2017; originally announced March 2017.

Journal ref: Frontiers Robotics AI, 2018, 5(APR), 42

arXiv:1511.01168 [pdf, other]

Cell identification in whole-brain multiview images of neural activation

Authors: Marco Paciscopi, Ludovico Silvestri, Francesco Saverio Pavone, Paolo Frasconi

Abstract: We present a scalable method for brain cell identification in multiview confocal light sheet microscopy images. Our algorithmic pipeline includes a hierarchical registration approach and a novel multiview version of semantic deconvolution that simultaneously enhance visibility of fluorescent cell bodies, equalize their contrast, and fuses adjacent views into a single 3D images on which cell identi… ▽ More We present a scalable method for brain cell identification in multiview confocal light sheet microscopy images. Our algorithmic pipeline includes a hierarchical registration approach and a novel multiview version of semantic deconvolution that simultaneously enhance visibility of fluorescent cell bodies, equalize their contrast, and fuses adjacent views into a single 3D images on which cell identification is performed with mean shift. We present empirical results on a whole-brain image of an adult Arc-dVenus mouse acquired at 4micron resolution. Based on an annotated test volume containing 3278 cells, our algorithm achieves an $F_1$ measure of 0.89. △ Less

Submitted 3 November, 2015; originally announced November 2015.

ACM Class: J.3

arXiv:1205.3981 [pdf, other]

doi 10.1016/j.artint.2014.08.003

kLog: A Language for Logical and Relational Learning with Kernels

Authors: Paolo Frasconi, Fabrizio Costa, Luc De Raedt, Kurt De Grave

Abstract: We introduce kLog, a novel approach to statistical relational learning. Unlike standard approaches, kLog does not represent a probability distribution directly. It is rather a language to perform kernel-based learning on expressive logical and relational representations. kLog allows users to specify learning problems declaratively. It builds on simple but powerful concepts: learning from interpret… ▽ More We introduce kLog, a novel approach to statistical relational learning. Unlike standard approaches, kLog does not represent a probability distribution directly. It is rather a language to perform kernel-based learning on expressive logical and relational representations. kLog allows users to specify learning problems declaratively. It builds on simple but powerful concepts: learning from interpretations, entity/relationship data modeling, logic programming, and deductive databases. Access by the kernel to the rich representation is mediated by a technique we call graphicalization: the relational representation is first transformed into a graph --- in particular, a grounded entity/relationship diagram. Subsequently, a choice of graph kernel defines the feature space. kLog supports mixed numerical and symbolic data, as well as background knowledge in the form of Prolog or Datalog programs as in inductive logic programming systems. The kLog framework can be applied to tackle the same range of tasks that has made statistical relational learning so popular, including classification, regression, multitask learning, and collective classification. We also report about empirical comparisons, showing that kLog can be either more accurate, or much faster at the same level of accuracy, than Tilde and Alchemy. kLog is GPLv3 licensed and is available at http://klog.dinfo.unifi.it along with tutorials. △ Less

Submitted 28 July, 2014; v1 submitted 17 May, 2012; originally announced May 2012.

ACM Class: I.2.6; I.2.3; D.3.2

arXiv:cs/9510101 [pdf, ps]

Diffusion of Context and Credit Information in Markovian Models

Authors: Y. Bengio, P. Frasconi

Abstract: This paper studies the problem of ergodicity of transition probability matrices in Markovian models, such as hidden Markov models (HMMs), and how it makes very difficult the task of learning to represent long-term context for sequential data. This phenomenon hurts the forward propagation of long-term context information, as well as learning a hidden state representation to represent long-term co… ▽ More This paper studies the problem of ergodicity of transition probability matrices in Markovian models, such as hidden Markov models (HMMs), and how it makes very difficult the task of learning to represent long-term context for sequential data. This phenomenon hurts the forward propagation of long-term context information, as well as learning a hidden state representation to represent long-term context, which depends on propagating credit information backwards in time. Using results from Markov chain theory, we show that this problem of diffusion of context and credit is reduced when the transition probabilities approach 0 or 1, i.e., the transition probability matrices are sparse and the model essentially deterministic. The results found in this paper apply to learning approaches based on continuous optimization, such as gradient descent and the Baum-Welch algorithm. △ Less

Submitted 30 September, 1995; originally announced October 1995.

Comments: See http://www.jair.org/ for any accompanying files

Journal ref: Journal of Artificial Intelligence Research, Vol 3, (1995), 249-270

Showing 1–12 of 12 results for author: Frasconi, P