-
S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces
Authors:
Eric Nguyen,
Karan Goel,
Albert Gu,
Gordon W. Downs,
Preey Shah,
Tri Dao,
Stephen A. Baccus,
Christopher RĂ©
Abstract:
Visual data such as images and videos are typically modeled as discretizations of inherently continuous, multidimensional signals. Existing continuous-signal models attempt to exploit this fact by modeling the underlying signals of visual (e.g., image) data directly. However, these models have not yet been able to achieve competitive performance on practical vision tasks such as large-scale image…
▽ More
Visual data such as images and videos are typically modeled as discretizations of inherently continuous, multidimensional signals. Existing continuous-signal models attempt to exploit this fact by modeling the underlying signals of visual (e.g., image) data directly. However, these models have not yet been able to achieve competitive performance on practical vision tasks such as large-scale image and video classification. Building on a recent line of work on deep state space models (SSMs), we propose S4ND, a new multidimensional SSM layer that extends the continuous-signal modeling ability of SSMs to multidimensional data including images and videos. We show that S4ND can model large-scale visual data in $1$D, $2$D, and $3$D as continuous multidimensional signals and demonstrates strong performance by simply swap** Conv2D and self-attention layers with S4ND layers in existing state-of-the-art models. On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by $1.5\%$ when training with a $1$D sequence of patches, and matches ConvNeXt when modeling images in $2$D. For videos, S4ND improves on an inflated $3$D ConvNeXt in activity classification on HMDB-51 by $4\%$. S4ND implicitly learns global, continuous convolutional kernels that are resolution invariant by construction, providing an inductive bias that enables generalization across multiple resolutions. By develo** a simple bandlimiting modification to S4 to overcome aliasing, S4ND achieves strong zero-shot (unseen at training time) resolution performance, outperforming a baseline Conv2D by $40\%$ on CIFAR-10 when trained on $8 \times 8$ and tested on $32 \times 32$ images. When trained with progressive resizing, S4ND comes within $\sim 1\%$ of a high-resolution model while training $22\%$ faster.
△ Less
Submitted 13 October, 2022; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Topographic control of order in quasi-2D granular phase transitions
Authors:
J. G. Downs,
N. D. Smith,
K. K. Mandadapu,
J. P. Garrahan,
M. I. Smith
Abstract:
We experimentally investigate the nature of 2D phase transitions in a quasi-2D granular fluid. Using a surface decorated with periodically spaced dimples we observe interfacial tension between coexisting liquid and crystal phases. Measurements of the orientational and translational order parameters and associated susceptibilities indicate that the surface topography alters the order of the phase t…
▽ More
We experimentally investigate the nature of 2D phase transitions in a quasi-2D granular fluid. Using a surface decorated with periodically spaced dimples we observe interfacial tension between coexisting liquid and crystal phases. Measurements of the orientational and translational order parameters and associated susceptibilities indicate that the surface topography alters the order of the phase transition from a two-step continuous one to a first-order liquid-crystal one. The interplay of boundary inelasticity and geometry, either order-promoting or inhibiting, controls the wetting of the granular crystal / fluid. This order induced wetting has important consequences, determining how coexisting phases separate spatially.
△ Less
Submitted 17 August, 2021;
originally announced August 2021.
-
Scene Categorization from Contours: Medial Axis Based Salience Measures
Authors:
Morteza Rezanejad,
Gabriel Downs,
John Wilder,
Dirk B. Walther,
Allan Jepson,
Sven Dickinson,
Kaleem Siddiqi
Abstract:
The computer vision community has witnessed recent advances in scene categorization from images, with the state-of-the art systems now achieving impressive recognition rates on challenging benchmarks such as the Places365 dataset. Such systems have been trained on photographs which include color, texture and shading cues. The geometry of shapes and surfaces, as conveyed by scene contours, is not e…
▽ More
The computer vision community has witnessed recent advances in scene categorization from images, with the state-of-the art systems now achieving impressive recognition rates on challenging benchmarks such as the Places365 dataset. Such systems have been trained on photographs which include color, texture and shading cues. The geometry of shapes and surfaces, as conveyed by scene contours, is not explicitly considered for this task. Remarkably, humans can accurately recognize natural scenes from line drawings, which consist solely of contour-based shape cues. Here we report the first computer vision study on scene categorization of line drawings derived from popular databases including an artist scene database, MIT67, and Places365. Specifically, we use off-the-shelf pre-trained CNNs to perform scene classification given only contour information as input and find performance levels well above chance. We also show that medial-axis based contour salience methods can be used to select more informative subsets of contour pixels and that the variation in CNN classification performance on various choices for these subsets is qualitatively similar to that observed in human performance. Moreover, when the salience measures are used to weight the contours, as opposed to pruning them, we find that these weights boost our CNN performance above that for unweighted contour input. That is, the medial axis based salience weights appear to add useful information that is not available when CNNs are trained to use contours alone.
△ Less
Submitted 26 November, 2018;
originally announced November 2018.