-
Binary neutron star mergers using a discontinuous Galerkin-finite difference hybrid method
Authors:
Nils Deppe,
Francois Foucart,
Marceline S. Bonilla,
Michael Boyle,
Nicholas J. Corso,
Matthew D. Duez,
Matthew Giesler,
François Hébert,
Lawrence E. Kidder,
Yoonsoo Kim,
Prayush Kumar,
Isaac Legred,
Geoffrey Lovelace,
Elias R. Most,
Jordan Moxon,
Kyle C. Nelli,
Harald P. Pfeiffer,
Mark A. Scheel,
Saul A. Teukolsky,
William Throwe,
Nils L. Vu
Abstract:
We present a discontinuous Galerkin-finite difference hybrid scheme that allows high-order shock capturing with the discontinuous Galerkin method for general relativistic magnetohydrodynamics in dynamical spacetimes. We present several optimizations and stability improvements to our algorithm that allow the hybrid method to successfully simulate single, rotating, and binary neutron stars. The hybr…
▽ More
We present a discontinuous Galerkin-finite difference hybrid scheme that allows high-order shock capturing with the discontinuous Galerkin method for general relativistic magnetohydrodynamics in dynamical spacetimes. We present several optimizations and stability improvements to our algorithm that allow the hybrid method to successfully simulate single, rotating, and binary neutron stars. The hybrid method achieves the efficiency of discontinuous Galerkin methods throughout almost the entire spacetime during the inspiral phase, while being able to robustly capture shocks and resolve the stellar surfaces. We also use Cauchy-Characteristic evolution to compute the first gravitational waveforms at future null infinity from binary neutron star mergers. The simulations presented here are the first successful binary neutron star inspiral and merger simulations using discontinuous Galerkin methods.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Coronal Heating Rate in the Slow Solar Wind
Authors:
Daniele Telloni,
Marco Romoli,
Marco Velli,
Gary P. Zank,
Laxman Adhikari,
Cooper Downs,
Aleksandr Burtovoi,
Roberto Susino,
Daniele Spadaro,
Lingling Zhao,
Alessandro Liberatore,
Chen Shi,
Yara De Leo,
Lucia Abbo,
Federica Frassati,
Giovanna Jerse,
Federico Landini,
Gianalfredo Nicolini,
Maurizio Pancrazzi,
Giuliana Russano,
Clementina Sasso,
Vincenzo Andretta,
Vania Da Deppo,
Silvano Fineschi,
Catia Grimani
, et al. (37 additional authors not shown)
Abstract:
This Letter reports the first observational estimate of the heating rate in the slowly expanding solar corona. The analysis exploits the simultaneous remote and local observations of the same coronal plasma volume with the Solar Orbiter/Metis and the Parker Solar Probe instruments, respectively, and relies on the basic solar wind magnetohydrodynamic equations. As expected, energy losses are a mino…
▽ More
This Letter reports the first observational estimate of the heating rate in the slowly expanding solar corona. The analysis exploits the simultaneous remote and local observations of the same coronal plasma volume with the Solar Orbiter/Metis and the Parker Solar Probe instruments, respectively, and relies on the basic solar wind magnetohydrodynamic equations. As expected, energy losses are a minor fraction of the solar wind energy flux, since most of the energy dissipation that feeds the heating and acceleration of the coronal flow occurs much closer to the Sun than the heights probed in the present study, which range from 6.3 to 13.3 solar radii. The energy deposited to the supersonic wind is then used to explain the observed slight residual wind acceleration and to maintain the plasma in a non-adiabatic state. As derived in the Wentzel-Kramers-Brillouin limit, the present energy transfer rate estimates provide a lower limit, which can be very useful in refining the turbulence-based modeling of coronal heating and subsequent solar wind acceleration.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
A Long-Baseline Atom Interferometer at CERN: Conceptual Feasibility Study
Authors:
G. Arduini,
L. Badurina,
K. Balazs,
C. Baynham,
O. Buchmueller,
M. Buzio,
S. Calatroni,
J. -P. Corso,
J. Ellis,
Ch. Gaignant,
M. Guinchard,
T. Hakulinen,
R. Hobson,
A. Infantino,
D. Lafarge,
R. Langlois,
C. Marcel,
J. Mitchell,
M. Parodi,
M. Pentella,
D. Valuch,
H. Vincke
Abstract:
We present results from exploratory studies, supported by the Physics Beyond Colliders (PBC) Study Group, of the suitability of a CERN site and its infrastructure for hosting a vertical atom interferometer (AI) with a baseline of about 100 m. We first review the scientific motivations for such an experiment to search for ultralight dark matter and measure gravitational waves, and then outline the…
▽ More
We present results from exploratory studies, supported by the Physics Beyond Colliders (PBC) Study Group, of the suitability of a CERN site and its infrastructure for hosting a vertical atom interferometer (AI) with a baseline of about 100 m. We first review the scientific motivations for such an experiment to search for ultralight dark matter and measure gravitational waves, and then outline the general technical requirements for such an atom interferometer, using the AION-100 project as an example. We present a possible CERN site in the PX46 access shaft to the Large Hadron Collider (LHC), including the motivations for this choice and a description of its infrastructure. We then assess its compliance with the technical requirements of such an experiment and what upgrades may be needed. We analyse issues related to the proximity of the LHC machine and its ancillary hardware and present a preliminary safety analysis and the required mitigation measures and infrastructure modifications. In conclusion, we identify primary cost drivers and describe constraints on the experimental installation and operation schedules arising from LHC operation. We find no technical obstacles: the CERN site is a very promising location for an AI experiment with a vertical baseline of about 100 m.
△ Less
Submitted 2 April, 2023;
originally announced April 2023.
-
FLYEYE family tree, from smart fast cameras to MezzoCielo
Authors:
Roberto Ragazzoni,
Silvio Di Rosa,
Carmelo Arcidiacono,
Marco Dima,
Demetrio Magrin,
Alain J. Corso,
Jacopo Farinato,
Maria Pelizzo,
Giovanni L. Santi,
Matteo Simioni,
Simone Zaggia
Abstract:
We developed game-changing concepts for meter(s) class very-wide-field telescopes, spanning three orders of magnitude of the covered field of view. Multiple cameras and monocentric systems: from the Smart Fast Cameras (with a quasi-monocentric aperture), through the FlyEye, toward a MezzoCielo concept (both with a truly monocentric aperture). MezzoCielo (or "half of the sky") is the last developed…
▽ More
We developed game-changing concepts for meter(s) class very-wide-field telescopes, spanning three orders of magnitude of the covered field of view. Multiple cameras and monocentric systems: from the Smart Fast Cameras (with a quasi-monocentric aperture), through the FlyEye, toward a MezzoCielo concept (both with a truly monocentric aperture). MezzoCielo (or "half of the sky") is the last developed concept for a new class of telescopes. Such a concept is based on a fully spherical optical surface filled with a low refractive index, and high transparency liquid surrounded by multiple identical cameras. MezzoCielo is capable to reach field of views in the range of ten to twenty thousand square degrees.
△ Less
Submitted 1 February, 2023;
originally announced February 2023.
-
Design of The Kinetic Inductance Detector Based Focal Plane Assembly for The Terahertz Intensity Mapper
Authors:
L. -J. Liu,
R. M. J. Janssen,
C. M. Bradford,
S. Hailey-Dunsheath,
J. Fu,
J. P. Filippini,
J. E. Aguirre,
J. S. Bracks,
A. J. Corso,
C. Groppi,
J. Hoh,
R. P. Keenan,
I. N. Lowe,
D. P. Marrone,
P. Mauskopf,
R. Nie,
J. Redford,
I. Trumper,
J. D. Vieira
Abstract:
We report on the kinetic inductance detector (KID) array focal plane assembly design for the Terahertz Intensity Mapper (TIM). Each of the 2 arrays consists of 4 wafer-sized dies (quadrants), and the overall assembly must satisfy thermal and mechanical requirements, while maintaining high optical efficiency and a suitable electromagnetic environment for the KIDs. In particular, our design manages…
▽ More
We report on the kinetic inductance detector (KID) array focal plane assembly design for the Terahertz Intensity Mapper (TIM). Each of the 2 arrays consists of 4 wafer-sized dies (quadrants), and the overall assembly must satisfy thermal and mechanical requirements, while maintaining high optical efficiency and a suitable electromagnetic environment for the KIDs. In particular, our design manages to strictly maintain a 50 $\mathrm{μm}$ air gap between the array and the horn block. We have prototyped and are now testing a sub-scale assembly which houses a single quadrant for characterization before integration into the full array. The initial test result shows a $>$95\% yield, indicating a good performance of our TIM detector packaging design.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
Design and testing of Kinetic Inductance Detector package for the Terahertz Intensity Mapper
Authors:
L. -J. Liu,
R. M. J Janssen,
C. M. Bradford,
S. Hailey-Dunsheath,
J. P. Filippini,
J. E. Aguirre,
J. S. Bracks,
A. J. Corso,
J. Fu,
C. Groppi,
J. Hoh,
R. P. Keenan,
I. N. Lowe,
D. P. Marrone,
P. Mauskopf,
R. Nie,
J. Redford,
I. Trumper,
J. D. Vieira
Abstract:
The Terahertz Intensity Mapper (TIM) is designed to probe the star formation history in dust-obscured star-forming galaxies around the peak of cosmic star formation. This will be done via measurements of the redshifted 157.7 um line of singly ionized carbon ([CII]). TIM employs two R $\sim 250$ long-slit grating spectrometers covering 240-420 um. Each is equipped with a focal plane unit containing…
▽ More
The Terahertz Intensity Mapper (TIM) is designed to probe the star formation history in dust-obscured star-forming galaxies around the peak of cosmic star formation. This will be done via measurements of the redshifted 157.7 um line of singly ionized carbon ([CII]). TIM employs two R $\sim 250$ long-slit grating spectrometers covering 240-420 um. Each is equipped with a focal plane unit containing 4 wafer-sized subarrays of horn-coupled aluminum kinetic inductance detectors (KIDs). We present the design and performance of a prototype focal plane assembly for one of TIM's KID-based subarrays. Our design strictly maintain high optical efficiency and a suitable electromagnetic environment for the KIDs. The prototype detector housing in combination with the first flight-like quadrant are tested at 250 mK. Initial frequency scan shows that many resonances are affected by collisions and/or very shallow transmission dips as a result of a degraded internal quality factor (Q factor). This is attributed to the presence of an external magnetic field during cooldown. We report on a study of magnetic field dependence of the Q factor of our quadrant array. We implement a Helmholtz coil to vary the magnetic field at the detectors by (partially) nulling earth's. Our investigation shows that the earth magnetic field can significantly affect our KIDs' performance by degrading the Q factor by a factor of 2-5, well below those expected from the operational temperature or optical loading. We find that we can sufficiently recover our detectors' quality factor by tuning the current in the coils to generate a field that matches earth's magnetic field in magnitude to within a few uT. Therefore, it is necessary to employ a properly designed magnetic shield enclosing the TIM focal plane unit. Based on the results presented in this paper, we set a shielding requirement of |B| < 3 uT.
△ Less
Submitted 16 November, 2022;
originally announced November 2022.
-
The DAMIC-M Experiment: Status and First Results
Authors:
I. Arnquist,
N. Avalos,
P. Bailly,
D. Baxter,
X. Bertou,
M. Bogdan,
C. Bourgeois,
J. Brandt,
A. Cadiou,
N. Castelló-Mor,
A. E. Chavarria,
M. Conde,
N. J. Corso,
J. Cortabitarte Gutiérrez,
J. Cuevas-Zepeda,
A. Dastgheibi-Fard,
C. De Dominicis,
O. Deligny,
R. Desani,
M. Dhellot,
J-J. Dormard,
J. Duarte-Campderros,
E. Estrada,
D. Florin,
N. Gadola
, et al. (47 additional authors not shown)
Abstract:
The DAMIC-M (DArk Matter In CCDs at Modane) experiment employs thick, fully depleted silicon charged-coupled devices (CCDs) to search for dark matter particles with a target exposure of 1 kg-year. A novel skipper readout implemented in the CCDs provides single electron resolution through multiple non-destructive measurements of the individual pixel charge, pushing the detection threshold to the eV…
▽ More
The DAMIC-M (DArk Matter In CCDs at Modane) experiment employs thick, fully depleted silicon charged-coupled devices (CCDs) to search for dark matter particles with a target exposure of 1 kg-year. A novel skipper readout implemented in the CCDs provides single electron resolution through multiple non-destructive measurements of the individual pixel charge, pushing the detection threshold to the eV-scale. DAMIC-M will advance by several orders of magnitude the exploration of the dark matter particle hypothesis, in particular of candidates pertaining to the so-called "hidden sector." A prototype, the Low Background Chamber (LBC), with 20g of low background Skipper CCDs, has been recently installed at Laboratoire Souterrain de Modane and is currently taking data. We will report the status of the DAMIC-M experiment and first results obtained with LBC commissioning data.
△ Less
Submitted 25 November, 2022; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Iterative Vision-and-Language Navigation
Authors:
Jacob Krantz,
Shurjo Banerjee,
Wang Zhu,
Jason Corso,
Peter Anderson,
Stefan Lee,
Jesse Thomason
Abstract:
We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for evaluating language-guided agents navigating in a persistent environment over time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the agent's memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information. However, deployed robots occupy the same env…
▽ More
We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for evaluating language-guided agents navigating in a persistent environment over time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the agent's memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information. However, deployed robots occupy the same environment for long periods of time. The IVLN paradigm addresses this disparity by training and evaluating VLN agents that maintain memory across tours of scenes that consist of up to 100 ordered instruction-following Room-to-Room (R2R) episodes, each defined by an individual language instruction and a target path. We present discrete and continuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours each in 80 indoor scenes. We find that extending the implicit memory of high-performing transformer VLN agents is not sufficient for IVLN, but agents that build maps can benefit from environment persistence, motivating a renewed focus on map-building agents in VLN.
△ Less
Submitted 24 December, 2023; v1 submitted 6 October, 2022;
originally announced October 2022.
-
Swarm of lightsail nanosatellites for Solar System exploration
Authors:
Giovanni Santi,
Alain J. Corso,
Denis Garoli,
Giuseppe Emanuele Lio,
Marco Manente,
Giulio Favaro,
Marco Bazzan,
Giampaolo Piotto,
Nicola Andriolli,
Lucanos Strambini,
Daniele Pavarin,
Leonardo Badia,
Remo Proietti Zaccaria,
Philip Lubin,
Roberto Ragazzoni,
Maria G. Pelizzo
Abstract:
This paper presents a study for the realization of a space mission which employs nanosatellites driven by an external laser source im**ing on an optimized lightsail, as a valuable technology to launch swarms of spacecrafts into the Solar System. Nanosatellites propelled by laser can be useful for the heliosphere exploration and for planetary observation, if suitably equipped with sensors, or be…
▽ More
This paper presents a study for the realization of a space mission which employs nanosatellites driven by an external laser source im**ing on an optimized lightsail, as a valuable technology to launch swarms of spacecrafts into the Solar System. Nanosatellites propelled by laser can be useful for the heliosphere exploration and for planetary observation, if suitably equipped with sensors, or be adopted for the establishment of network systems when placed into specific orbits. By varying the area-to-mass ratio (i.e., the ratio between the sail area and the payload weight) and the laser power, it is ossible to insert nanosatellites into different hyperbolic orbits with respect to Earth, thus reaching the target by means of controlled trajectories in a relatively short amount of time. A mission involving nanosatellites of the order of 1 kg of mass is envisioned, by describing all the on-board subsystems and satisfying all the requirements in term of power and mass budget. Particular attention is paid to the telecommunication subsystem, which must offer all the necessary functionalities. To fabricate the lightsail, the thin films technology has been considered, by verifying the sail thermal stability during the thrust phase. Moreover, the problem of mechanical stability of the lightsail has been tackled, showing that the distance between the ligthsail structure and the payload plays a pivotal role. Some potential applications of the proposed technology are discussed, such as the map** of the heliospheric environment.
△ Less
Submitted 15 September, 2022; v1 submitted 23 August, 2022;
originally announced August 2022.
-
Learning to Estimate External Forces of Human Motion in Video
Authors:
Nathan Louis,
Tylan N. Templin,
Travis D. Eliason,
Daniel P. Nicolella,
Jason J. Corso
Abstract:
Analyzing sports performance or preventing injuries requires capturing ground reaction forces (GRFs) exerted by the human body during certain movements. Standard practice uses physical markers paired with force plates in a controlled environment, but this is marred by high costs, lengthy implementation time, and variance in repeat experiments; hence, we propose GRF inference from video. While rece…
▽ More
Analyzing sports performance or preventing injuries requires capturing ground reaction forces (GRFs) exerted by the human body during certain movements. Standard practice uses physical markers paired with force plates in a controlled environment, but this is marred by high costs, lengthy implementation time, and variance in repeat experiments; hence, we propose GRF inference from video. While recent work has used LSTMs to estimate GRFs from 2D viewpoints, these can be limited in their modeling and representation capacity. First, we propose using a transformer architecture to tackle the GRF from video task, being the first to do so. Then we introduce a new loss to minimize high impact peaks in regressed curves. We also show that pre-training and multi-task learning on 2D-to-3D human pose estimation improves generalization to unseen motions. And pre-training on this different task provides good initial weights when finetuning on smaller (rarer) GRF datasets. We evaluate on LAAS Parkour and a newly collected ForcePose dataset; we show up to 19% decrease in error compared to prior approaches.
△ Less
Submitted 12 July, 2022;
originally announced July 2022.
-
Precision measurement of Compton scattering in silicon with a skipper CCD for dark matter detection
Authors:
D. Norcini,
N. Castello-Mor,
D. Baxter,
N. J. Corso,
J. Cuevas-Zepeda,
C. De Dominicis,
A. Matalon,
S. Munagavalasa,
S. Paul,
P. Privitera,
K. Ramanathan,
R. Smida,
R. Thomas,
R. Yajur,
A. E. Chavarria,
K. McGuire,
P. Mitra,
A. Piers,
M. Settimo,
J. Cortabitarte Gutierrez,
J. Duarte-Campderros,
A. Lantero-Barreda,
A. Lopez-Virto,
I. Vila,
R. Vilar
, et al. (19 additional authors not shown)
Abstract:
Experiments aiming to directly detect dark matter through particle recoils can achieve energy thresholds of $\mathcal{O}(1\,\mathrm{eV})$. In this regime, ionization signals from small-angle Compton scatters of environmental $γ$-rays constitute a significant background. Monte Carlo simulations used to build background models have not been experimentally validated at these low energies. We report a…
▽ More
Experiments aiming to directly detect dark matter through particle recoils can achieve energy thresholds of $\mathcal{O}(1\,\mathrm{eV})$. In this regime, ionization signals from small-angle Compton scatters of environmental $γ$-rays constitute a significant background. Monte Carlo simulations used to build background models have not been experimentally validated at these low energies. We report a precision measurement of Compton scattering on silicon atomic shell electrons down to 23$\,$eV. A skipper charge-coupled device (CCD) with single-electron resolution, developed for the DAMIC-M experiment, was exposed to a $^{241}$Am $γ$-ray source over several months. Features associated with the silicon K, L$_{1}$, and L$_{2,3}$-shells are clearly identified, and scattering on valence electrons is detected for the first time below 100$\,$eV. We find that the relativistic impulse approximation for Compton scattering, which is implemented in Monte Carlo simulations commonly used by direct detection experiments, does not reproduce the measured spectrum below 0.5$\,$keV. The data are in better agreement with $ab$ $initio$ calculations originally developed for X-ray absorption spectroscopy.
△ Less
Submitted 2 July, 2022;
originally announced July 2022.
-
Q-TART: Quickly Training for Adversarial Robustness and in-Transferability
Authors:
Madan Ravi Ganesh,
Salimeh Yasaei Sekeh,
Jason J. Corso
Abstract:
Raw deep neural network (DNN) performance is not enough; in real-world settings, computational load, training efficiency and adversarial security are just as or even more important. We propose to simultaneously tackle Performance, Efficiency, and Robustness, using our proposed algorithm Q-TART, Quickly Train for Adversarial Robustness and in-Transferability. Q-TART follows the intuition that sampl…
▽ More
Raw deep neural network (DNN) performance is not enough; in real-world settings, computational load, training efficiency and adversarial security are just as or even more important. We propose to simultaneously tackle Performance, Efficiency, and Robustness, using our proposed algorithm Q-TART, Quickly Train for Adversarial Robustness and in-Transferability. Q-TART follows the intuition that samples highly susceptible to noise strongly affect the decision boundaries learned by DNNs, which in turn degrades their performance and adversarial susceptibility. By identifying and removing such samples, we demonstrate improved performance and adversarial robustness while using only a subset of the training data. Through our experiments we highlight Q-TART's high performance across multiple Dataset-DNN combinations, including ImageNet, and provide insights into the complementary behavior of Q-TART alongside existing adversarial training approaches to increase robustness by over 1.3% while using up to 17.9% less training time.
△ Less
Submitted 14 April, 2022;
originally announced April 2022.
-
The Forward Physics Facility at the High-Luminosity LHC
Authors:
Jonathan L. Feng,
Felix Kling,
Mary Hall Reno,
Juan Rojo,
Dennis Soldin,
Luis A. Anchordoqui,
Jamie Boyd,
Ahmed Ismail,
Lucian Harland-Lang,
Kevin J. Kelly,
Vishvas Pandey,
Sebastian Trojanowski,
Yu-Dai Tsai,
Jean-Marco Alameddine,
Takeshi Araki,
Akitaka Ariga,
Tomoko Ariga,
Kento Asai,
Alessandro Bacchetta,
Kincso Balazs,
Alan J. Barr,
Michele Battistin,
Jianming Bian,
Caterina Bertone,
Weidong Bai
, et al. (211 additional authors not shown)
Abstract:
High energy collisions at the High-Luminosity Large Hadron Collider (LHC) produce a large number of particles along the beam collision axis, outside of the acceptance of existing LHC experiments. The proposed Forward Physics Facility (FPF), to be located several hundred meters from the ATLAS interaction point and shielded by concrete and rock, will host a suite of experiments to probe Standard Mod…
▽ More
High energy collisions at the High-Luminosity Large Hadron Collider (LHC) produce a large number of particles along the beam collision axis, outside of the acceptance of existing LHC experiments. The proposed Forward Physics Facility (FPF), to be located several hundred meters from the ATLAS interaction point and shielded by concrete and rock, will host a suite of experiments to probe Standard Model (SM) processes and search for physics beyond the Standard Model (BSM). In this report, we review the status of the civil engineering plans and the experiments to explore the diverse physics signals that can be uniquely probed in the forward region. FPF experiments will be sensitive to a broad range of BSM physics through searches for new particle scattering or decay signatures and deviations from SM expectations in high statistics analyses with TeV neutrinos in this low-background environment. High statistics neutrino detection will also provide valuable data for fundamental topics in perturbative and non-perturbative QCD and in weak interactions. Experiments at the FPF will enable synergies between forward particle production at the LHC and astroparticle physics to be exploited. We report here on these physics topics, on infrastructure, detector, and simulation studies, and on future directions to realize the FPF's physics potential.
△ Less
Submitted 9 March, 2022;
originally announced March 2022.
-
Exploring the Solar Wind from its Source on the Corona into the Inner Heliosphere during the First Solar Orbiter - Parker Solar Probe Quadrature
Authors:
Daniele Telloni,
Vincenzo Andretta,
Ester Antonucci,
Alessandro Bemporad,
Giuseppe E. Capuano,
Silvano Fineschi,
Silvio Giordano,
Shadia Habbal,
Denise Perrone,
Rui F. Pinto,
Luca Sorriso-Valvo,
Daniele Spadaro,
Roberto Susino,
Lloyd D. Woodham,
Gary P. Zank,
Marco Romoli,
Stuart D. Bale,
Justin C. Kasper,
Frédéric Auchère,
Roberto Bruno,
Gerardo Capobianco,
Anthony W. Case,
Chiara Casini,
Marta Casti,
Paolo Chioetto
, et al. (46 additional authors not shown)
Abstract:
This Letter addresses the first Solar Orbiter (SO) -- Parker Solar Probe (PSP) quadrature, occurring on January 18, 2021, to investigate the evolution of solar wind from the extended corona to the inner heliosphere. Assuming ballistic propagation, the same plasma volume observed remotely in corona at altitudes between 3.5 and 6.3 solar radii above the solar limb with the Metis coronagraph on SO ca…
▽ More
This Letter addresses the first Solar Orbiter (SO) -- Parker Solar Probe (PSP) quadrature, occurring on January 18, 2021, to investigate the evolution of solar wind from the extended corona to the inner heliosphere. Assuming ballistic propagation, the same plasma volume observed remotely in corona at altitudes between 3.5 and 6.3 solar radii above the solar limb with the Metis coronagraph on SO can be tracked to PSP, orbiting at 0.1 au, thus allowing the local properties of the solar wind to be linked to the coronal source region from where it originated. Thanks to the close approach of PSP to the Sun and the simultaneous Metis observation of the solar corona, the flow-aligned magnetic field and the bulk kinetic energy flux density can be empirically inferred along the coronal current sheet with an unprecedented accuracy, allowing in particular estimation of the Alfvén radius at 8.7 solar radii during the time of this event. This is thus the very first study of the same solar wind plasma as it expands from the sub-Alfvénic solar corona to just above the Alfvén surface.
△ Less
Submitted 21 October, 2021;
originally announced October 2021.
-
Evaluating and Improving Interactions with Hazy Oracles
Authors:
Stephan J. Lemmer,
Jason J. Corso
Abstract:
Many AI systems integrate sensor inputs, world knowledge, and human-provided information to perform inference. While such systems often treat the human input as flawless, humans are better thought of as hazy oracles whose input may be ambiguous or outside of the AI system's understanding. In such situations it makes sense for the AI system to defer its inference while it disambiguates the human-pr…
▽ More
Many AI systems integrate sensor inputs, world knowledge, and human-provided information to perform inference. While such systems often treat the human input as flawless, humans are better thought of as hazy oracles whose input may be ambiguous or outside of the AI system's understanding. In such situations it makes sense for the AI system to defer its inference while it disambiguates the human-provided information by, for example, asking the human to rephrase the query. Though this approach has been considered in the past, current work is typically limited to application-specific methods and non-standardized human experiments. We instead introduce and formalize a general notion of deferred inference. Using this formulation, we then propose a novel evaluation centered around the Deferred Error Volume (DEV) metric, which explicitly considers the tradeoff between error reduction and the additional human effort required to achieve it. We demonstrate this new formalization and an innovative deferred inference method on the disparate tasks of Single-Target Video Object Tracking and Referring Expression Comprehension, ultimately reducing error by up to 48% without any change to the underlying model or its parameters.
△ Less
Submitted 30 November, 2022; v1 submitted 19 October, 2021;
originally announced October 2021.
-
Cross-View Exocentric to Egocentric Video Synthesis
Authors:
Gaowen Liu,
Hao Tang,
Hugo Latapie,
Jason Corso,
Yan Yan
Abstract:
Cross-view video synthesis task seeks to generate video sequences of one view from another dramatically different view. In this paper, we investigate the exocentric (third-person) view to egocentric (first-person) view video generation task. This is challenging because egocentric view sometimes is remarkably different from the exocentric view. Thus, transforming the appearances across the two diff…
▽ More
Cross-view video synthesis task seeks to generate video sequences of one view from another dramatically different view. In this paper, we investigate the exocentric (third-person) view to egocentric (first-person) view video generation task. This is challenging because egocentric view sometimes is remarkably different from the exocentric view. Thus, transforming the appearances across the two different views is a non-trivial task. Particularly, we propose a novel Bi-directional Spatial Temporal Attention Fusion Generative Adversarial Network (STA-GAN) to learn both spatial and temporal information to generate egocentric video sequences from the exocentric view. The proposed STA-GAN consists of three parts: temporal branch, spatial branch, and attention fusion. First, the temporal and spatial branches generate a sequence of fake frames and their corresponding features. The fake frames are generated in both downstream and upstream directions for both temporal and spatial branches. Next, the generated four different fake frames and their corresponding features (spatial and temporal branches in two directions) are fed into a novel multi-generation attention fusion module to produce the final video sequence. Meanwhile, we also propose a novel temporal and spatial dual-discriminator for more robust network optimization. Extensive experiments on the Side2Ego and Top2Ego datasets show that the proposed STA-GAN significantly outperforms the existing methods.
△ Less
Submitted 7 July, 2021;
originally announced July 2021.
-
Future perspectives in solar hot plasma observations in the soft X-rays
Authors:
Alain Jody Corso,
Giulio Del Zanna,
Vanessa Polito
Abstract:
The soft X-rays (SXRs: 90--150 $Å$) are among the most interesting spectral ranges to be investigated in the next generation of solar missions due to their unique capability of diagnosing phenomena involving hot plasma with temperatures up to 15~MK. Multilayer (ML) coatings are crucial for develo** SXR instrumentation, as so far they represent the only viable option for the development of high-e…
▽ More
The soft X-rays (SXRs: 90--150 $Å$) are among the most interesting spectral ranges to be investigated in the next generation of solar missions due to their unique capability of diagnosing phenomena involving hot plasma with temperatures up to 15~MK. Multilayer (ML) coatings are crucial for develo** SXR instrumentation, as so far they represent the only viable option for the development of high-efficiency mirrors in this spectral range. However, the current standard MLs are characterized by a very narrow spectral band which is incompatible with the science requirements expected for a SXR spectrometer. Nevertheless, recent advancement in the ML technology has made the development of non-periodic stacks repeatable and reliable, enabling the manufacturing of SXR mirrors with a valuable efficiency over a large range of wavelengths.
In this work, after reviewing the state-of-the-art ML coatings for the SXR range, we investigate the possibility of using M-fold and aperiodic stacks for the development of multiband SXR spectrometers. After selecting a possible choice of key spectral lines, some trade-off studies for an eight-bands spectrometer are also presented and discussed, giving an evaluation of their feasibility and potential performance.
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting
Authors:
Ryan Szeto,
Jason J. Corso
Abstract:
Quantitative evaluation has increased dramatically among recent video inpainting work, but the video and mask content used to gauge performance has received relatively little attention. Although attributes such as camera and background scene motion inherently change the difficulty of the task and affect methods differently, existing evaluation schemes fail to control for them, thereby providing mi…
▽ More
Quantitative evaluation has increased dramatically among recent video inpainting work, but the video and mask content used to gauge performance has received relatively little attention. Although attributes such as camera and background scene motion inherently change the difficulty of the task and affect methods differently, existing evaluation schemes fail to control for them, thereby providing minimal insight into inpainting failure modes. To address this gap, we propose the Diagnostic Evaluation of Video Inpainting on Landscapes (DEVIL) benchmark, which consists of two contributions: (i) a novel dataset of videos and masks labeled according to several key inpainting failure modes, and (ii) an evaluation scheme that samples slices of the dataset characterized by a fixed content attribute, and scores performance on each slice according to reconstruction, realism, and temporal consistency quality. By revealing systematic changes in performance induced by particular characteristics of the input content, our challenging benchmark enables more insightful analysis into video inpainting methods and serves as an invaluable diagnostic tool for the field. Our code and data are available at https://github.com/MichiganCOG/devil .
△ Less
Submitted 25 April, 2022; v1 submitted 11 May, 2021;
originally announced May 2021.
-
High resolution soft X-ray spectroscopy and the quest for the hot (5-10 MK) plasma in solar active regions
Authors:
G. Del Zanna,
V. Andretta,
P. J. Cargill,
A. J. Corso,
A. N. Daw,
L. Golub,
J. A. Klimchuk,
H. E. Mason
Abstract:
We discuss the diagnostics available to study the 5-10 MK plasma in the solar corona, which is key to understanding the heating in the cores of solar active regions. We present several simulated spectra, and show that excellent diagnostics are available in the soft X-rays, around 100 Angstroms, as six ionisation stages of Fe can simultaneously be observed, and electron densities derived, within a…
▽ More
We discuss the diagnostics available to study the 5-10 MK plasma in the solar corona, which is key to understanding the heating in the cores of solar active regions. We present several simulated spectra, and show that excellent diagnostics are available in the soft X-rays, around 100 Angstroms, as six ionisation stages of Fe can simultaneously be observed, and electron densities derived, within a narrow spectral region. As this spectral range is almost unexplored, we present an analysis of available and simulated spectra, to compare the hot emission with the cooler component. We adopt recently designed multilayers to present estimates of count rates in the hot lines, with a baseline spectrometer design. Excellent count rates are found, opening up the exciting opportunity to obtain high-resolution spectroscopy of hot plasma.
△ Less
Submitted 10 March, 2021;
originally announced March 2021.
-
Depth from Camera Motion and Object Detection
Authors:
Brent A. Griffin,
Jason J. Corso
Abstract:
This paper addresses the problem of learning to estimate the depth of detected objects given some measurement of camera motion (e.g., from robot kinematics or vehicle odometry). We achieve this by 1) designing a recurrent neural network (DBox) that estimates the depth of objects using a generalized representation of bounding boxes and uncalibrated camera movement and 2) introducing the Object Dept…
▽ More
This paper addresses the problem of learning to estimate the depth of detected objects given some measurement of camera motion (e.g., from robot kinematics or vehicle odometry). We achieve this by 1) designing a recurrent neural network (DBox) that estimates the depth of objects using a generalized representation of bounding boxes and uncalibrated camera movement and 2) introducing the Object Depth via Motion and Detection Dataset (ODMD). ODMD training data are extensible and configurable, and the ODMD benchmark includes 21,600 examples across four validation and test sets. These sets include mobile robot experiments using an end-effector camera to locate objects from the YCB dataset and examples with perturbations added to camera motion or bounding box data. In addition to the ODMD benchmark, we evaluate DBox in other monocular application domains, achieving state-of-the-art results on existing driving and robotics benchmarks and estimating the depth of objects using a camera phone.
△ Less
Submitted 1 March, 2021;
originally announced March 2021.
-
Temporally Guided Articulated Hand Pose Tracking in Surgical Videos
Authors:
Nathan Louis,
Luowei Zhou,
Steven J. Yule,
Roger D. Dias,
Milisa Manojlovich,
Francis D. Pagani,
Donald S. Likosky,
Jason J. Corso
Abstract:
Articulated hand pose tracking is an under-explored problem that carries the potential for use in an extensive number of applications, especially in the medical domain. With a robust and accurate tracking system on in-vivo surgical videos, the motion dynamics and movement patterns of the hands can be captured and analyzed for many rich tasks. In this work, we propose a novel hand pose estimation m…
▽ More
Articulated hand pose tracking is an under-explored problem that carries the potential for use in an extensive number of applications, especially in the medical domain. With a robust and accurate tracking system on in-vivo surgical videos, the motion dynamics and movement patterns of the hands can be captured and analyzed for many rich tasks. In this work, we propose a novel hand pose estimation model, Res152-CondPose, which improves detection and tracking accuracy by incorporating a hand pose prior into its pose prediction. We show improvements over state-of-the-art methods which provide frame-wise independent predictions, by following a temporally guided approach that effectively leverages past predictions. Additionally, we collect the first dataset, Surgical Hands, that provides multi-instance articulated hand pose annotations for in-vivo videos. Our dataset contains 76 video clips from 28 publicly available surgical videos and over 8.1k annotated hand pose instances. We provide bounding boxes, articulated hand pose annotations, and tracking IDs to enable multi-instance area-based and articulated tracking. When evaluated on Surgical Hands, we show our method outperforms the state-of-the-art method using mean Average Precision (mAP), to measure pose estimation accuracy, and Multiple Object Tracking Accuracy (MOTA), to assess pose tracking performance. Both the code and dataset are available at https://github.com/MichiganCOG/Surgical_ Hands_RELEASE.
△ Less
Submitted 20 October, 2021; v1 submitted 11 January, 2021;
originally announced January 2021.
-
Integrating Human Gaze into Attention for Egocentric Activity Recognition
Authors:
Kyle Min,
Jason J. Corso
Abstract:
It is well known that human gaze carries significant information about visual attention. However, there are three main difficulties in incorporating the gaze data in an attention mechanism of deep neural networks: 1) the gaze fixation points are likely to have measurement errors due to blinking and rapid eye movements; 2) it is unclear when and how much the gaze data is correlated with visual atte…
▽ More
It is well known that human gaze carries significant information about visual attention. However, there are three main difficulties in incorporating the gaze data in an attention mechanism of deep neural networks: 1) the gaze fixation points are likely to have measurement errors due to blinking and rapid eye movements; 2) it is unclear when and how much the gaze data is correlated with visual attention; and 3) gaze data is not always available in many real-world situations. In this work, we introduce an effective probabilistic approach to integrate human gaze into spatiotemporal attention for egocentric activity recognition. Specifically, we represent the locations of gaze fixation points as structured discrete latent variables to model their uncertainties. In addition, we model the distribution of gaze fixations using a variational method. The gaze distribution is learned during the training process so that the ground-truth annotations of gaze locations are no longer needed in testing situations since they are predicted from the learned gaze distribution. The predicted gaze locations are used to provide informative attentional cues to improve the recognition performance. Our method outperforms all the previous state-of-the-art approaches on EGTEA, which is a large-scale dataset for egocentric activity recognition provided with gaze measurements. We also perform an ablation study and qualitative analysis to demonstrate that our attention mechanism is effective.
△ Less
Submitted 8 November, 2020;
originally announced November 2020.
-
The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation
Authors:
Shurjo Banerjee,
Jesse Thomason,
Jason J. Corso
Abstract:
Autonomous robot systems for applications from search and rescue to assistive guidance should be able to engage in natural language dialog with people. To study such cooperative communication, we introduce Robot Simultaneous Localization and Map** with Natural Language (RobotSlang), a benchmark of 169 natural language dialogs between a human Driver controlling a robot and a human Commander provi…
▽ More
Autonomous robot systems for applications from search and rescue to assistive guidance should be able to engage in natural language dialog with people. To study such cooperative communication, we introduce Robot Simultaneous Localization and Map** with Natural Language (RobotSlang), a benchmark of 169 natural language dialogs between a human Driver controlling a robot and a human Commander providing guidance towards navigation goals. In each trial, the pair first cooperates to localize the robot on a global map visible to the Commander, then the Driver follows Commander instructions to move the robot to a sequence of target objects. We introduce a Localization from Dialog History (LDH) and a Navigation from Dialog History (NDH) task where a learned agent is given dialog and visual observations from the robot platform as input and must localize in the global map or navigate towards the next target object, respectively. RobotSlang is comprised of nearly 5k utterances and over 1k minutes of robot camera and control streams. We present an initial model for the NDH task, and show that an agent trained in simulation can follow the RobotSlang dialog-based navigation instructions for controlling a physical robot platform. Code and data are available at https://umrobotslang.github.io/.
△ Less
Submitted 23 October, 2020;
originally announced October 2020.
-
Mirrors for space telescopes: degradation issues
Authors:
D. Garoli,
L. V. Rodriguez De Marcos,
J. I. Larruquert,
A. J. Corso,
R. Proietti Zaccaria,
M. G. Pelizzo
Abstract:
Mirrors are a subset of optical components essential for the success of current and future space missions. Most of the telescopes for space programs ranging from Earth Observation to Astrophysics and covering all the electromagnetic spectrum from X-rays to Far-Infrared are based on reflective optics. Mirrors operate in diverse and harsh environments that range from Low-Earth Orbit, to interplaneta…
▽ More
Mirrors are a subset of optical components essential for the success of current and future space missions. Most of the telescopes for space programs ranging from Earth Observation to Astrophysics and covering all the electromagnetic spectrum from X-rays to Far-Infrared are based on reflective optics. Mirrors operate in diverse and harsh environments that range from Low-Earth Orbit, to interplanetary orbits and the deep space. The operational life of space observatories spans from minutes (sounding rockets) to decades (large observatories), and the performance of the mirrors within the optical system is susceptible to degrade, which results in a transient optical efficiency of the instrument. The degradation that occurs in space environments depends on the operational life on the orbital properties of the space mission, and it reduces the total system throughput and hence compromises the science return. Therefore, the knowledge of potential degradation physical mechanisms, how they affect mirror performance, and how to prevent it, is of paramount importance to ensure the long-term success of space telescopes. In this review we report an overview on current mirror technology for space missions with a particular focus on the importance of degradation and radiation resistance of the coating materials. Particular detail will be given to degradation effects on mirrors for the far and extreme UV as in these ranges the degradation is enhanced by the strong absorption of most contaminants.
△ Less
Submitted 30 September, 2020;
originally announced October 2020.
-
Ground-truth or DAER: Selective Re-query of Secondary Information
Authors:
Stephan J. Lemmer,
Jason J. Corso
Abstract:
Many vision tasks use secondary information at inference time -- a seed -- to assist a computer vision model in solving a problem. For example, an initial bounding box is needed to initialize visual object tracking. To date, all such work makes the assumption that the seed is a good one. However, in practice, from crowdsourcing to noisy automated seeds, this is often not the case. We hence propose…
▽ More
Many vision tasks use secondary information at inference time -- a seed -- to assist a computer vision model in solving a problem. For example, an initial bounding box is needed to initialize visual object tracking. To date, all such work makes the assumption that the seed is a good one. However, in practice, from crowdsourcing to noisy automated seeds, this is often not the case. We hence propose the problem of seed rejection -- determining whether to reject a seed based on the expected performance degradation when it is provided in place of a gold-standard seed. We provide a formal definition to this problem, and focus on two meaningful subgoals: understanding causes of error and understanding the model's response to noisy seeds conditioned on the primary input. With these goals in mind, we propose a novel training method and evaluation metrics for the seed rejection problem. We then use seeded versions of the viewpoint estimation and fine-grained classification tasks to evaluate these contributions. In these experiments, we show our method can reduce the number of seeds that need to be reviewed for a target performance by over 23% compared to strong baselines.
△ Less
Submitted 2 September, 2021; v1 submitted 15 September, 2020;
originally announced September 2020.
-
Detection and Localization of Robotic Tools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal and Detection
Authors:
Duygu Sarikaya,
Jason J. Corso,
Khurshid A. Guru
Abstract:
Video understanding of robot-assisted surgery (RAS) videos is an active research area. Modeling the gestures and skill level of surgeons presents an interesting problem. The insights drawn may be applied in effective skill acquisition, objective skill assessment, real-time feedback, and human-robot collaborative surgeries. We propose a solution to the tool detection and localization open problem i…
▽ More
Video understanding of robot-assisted surgery (RAS) videos is an active research area. Modeling the gestures and skill level of surgeons presents an interesting problem. The insights drawn may be applied in effective skill acquisition, objective skill assessment, real-time feedback, and human-robot collaborative surgeries. We propose a solution to the tool detection and localization open problem in RAS video understanding, using a strictly computer vision approach and the recent advances of deep learning. We propose an architecture using multimodal convolutional neural networks for fast detection and localization of tools in RAS videos. To our knowledge, this approach will be the first to incorporate deep neural networks for tool detection and localization in RAS videos. Our architecture applies a Region Proposal Network (RPN), and a multi-modal two stream convolutional network for object detection, to jointly predict objectness and localization on a fusion of image and temporal motion cues. Our results with an Average Precision (AP) of 91% and a mean computation time of 0.1 seconds per test frame detection indicate that our study is superior to conventionally used methods for medical imaging while also emphasizing the benefits of using RPN for precision and efficiency. We also introduce a new dataset, ATLAS Dione, for RAS video understanding. Our dataset provides video data of ten surgeons from Roswell Park Cancer Institute (RPCI) (Buffalo, NY) performing six different surgical tasks on the daVinci Surgical System (dVSS R ) with annotations of robotic tools per frame.
△ Less
Submitted 29 July, 2020;
originally announced August 2020.
-
Adversarial Background-Aware Loss for Weakly-supervised Temporal Activity Localization
Authors:
Kyle Min,
Jason J. Corso
Abstract:
Temporally localizing activities within untrimmed videos has been extensively studied in recent years. Despite recent advances, existing methods for weakly-supervised temporal activity localization struggle to recognize when an activity is not occurring. To address this issue, we propose a novel method named A2CL-PT. Two triplets of the feature space are considered in our approach: one triplet is…
▽ More
Temporally localizing activities within untrimmed videos has been extensively studied in recent years. Despite recent advances, existing methods for weakly-supervised temporal activity localization struggle to recognize when an activity is not occurring. To address this issue, we propose a novel method named A2CL-PT. Two triplets of the feature space are considered in our approach: one triplet is used to learn discriminative features for each activity class, and the other one is used to distinguish the features where no activity occurs (i.e. background features) from activity-related features for each video. To further improve the performance, we build our network using two parallel branches which operate in an adversarial way: the first branch localizes the most salient activities of a video and the second one finds other supplementary activities from non-localized parts of the video. Extensive experiments performed on THUMOS14 and ActivityNet datasets demonstrate that our proposed method is effective. Specifically, the average mAP of IoU thresholds from 0.1 to 0.9 on the THUMOS14 dataset is significantly improved from 27.9% to 30.0%.
△ Less
Submitted 13 July, 2020;
originally announced July 2020.
-
Learning Object Depth from Camera Motion and Video Object Segmentation
Authors:
Brent A. Griffin,
Jason J. Corso
Abstract:
Video object segmentation, i.e., the separation of a target object from background in video, has made significant progress on real and challenging videos in recent years. To leverage this progress in 3D applications, this paper addresses the problem of learning to estimate the depth of segmented objects given some measurement of camera motion (e.g., from robot kinematics or vehicle odometry). We a…
▽ More
Video object segmentation, i.e., the separation of a target object from background in video, has made significant progress on real and challenging videos in recent years. To leverage this progress in 3D applications, this paper addresses the problem of learning to estimate the depth of segmented objects given some measurement of camera motion (e.g., from robot kinematics or vehicle odometry). We achieve this by, first, introducing a diverse, extensible dataset and, second, designing a novel deep network that estimates the depth of objects using only segmentation masks and uncalibrated camera movement. Our data-generation framework creates artificial object segmentations that are scaled for changes in distance between the camera and object, and our network learns to estimate object depth even with segmentation errors. We demonstrate our approach across domains using a robot camera to locate objects from the YCB dataset and a vehicle camera to locate obstacles while driving.
△ Less
Submitted 18 December, 2020; v1 submitted 10 July, 2020;
originally announced July 2020.
-
Slimming Neural Networks using Adaptive Connectivity Scores
Authors:
Madan Ravi Ganesh,
Dawsin Blanchard,
Jason J. Corso,
Salimeh Yasaei Sekeh
Abstract:
In general, deep neural network (DNN) pruning methods fall into two categories: 1) Weight-based deterministic constraints, and 2) Probabilistic frameworks. While each approach has its merits and limitations there are a set of common practical issues such as, trial-and-error to analyze sensitivity and hyper-parameters to prune DNNs, which plague them both. In this work, we propose a new single-shot…
▽ More
In general, deep neural network (DNN) pruning methods fall into two categories: 1) Weight-based deterministic constraints, and 2) Probabilistic frameworks. While each approach has its merits and limitations there are a set of common practical issues such as, trial-and-error to analyze sensitivity and hyper-parameters to prune DNNs, which plague them both. In this work, we propose a new single-shot, fully automated pruning algorithm called Slimming Neural networks using Adaptive Connectivity Scores (SNACS). Our proposed approach combines a probabilistic pruning framework with constraints on the underlying weight matrices, via a novel connectivity measure, at multiple levels to capitalize on the strengths of both approaches while solving their deficiencies. In \alg{}, we propose a fast hash-based estimator of Adaptive Conditional Mutual Information (ACMI), that uses a weight-based scaling criterion, to evaluate the connectivity between filters and prune unimportant ones. To automatically determine the limit up to which a layer can be pruned, we propose a set of operating constraints that jointly define the upper pruning percentage limits across all the layers in a deep network. Finally, we define a novel sensitivity criterion for filters that measures the strength of their contributions to the succeeding layer and highlights critical filters that need to be completely protected from pruning. Through our experimental validation we show that SNACS is faster by over 17x the nearest comparable method and is the state of the art single-shot pruning method across three standard Dataset-DNN pruning benchmarks: CIFAR10-VGG16, CIFAR10-ResNet56 and ILSVRC2012-ResNet50.
△ Less
Submitted 17 December, 2021; v1 submitted 22 June, 2020;
originally announced June 2020.
-
Novel Object Viewpoint Estimation through Reconstruction Alignment
Authors:
Mohamed El Banani,
Jason J. Corso,
David F. Fouhey
Abstract:
The goal of this paper is to estimate the viewpoint for a novel object. Standard viewpoint estimation approaches generally fail on this task due to their reliance on a 3D model for alignment or large amounts of class-specific training data and their corresponding canonical pose. We overcome those limitations by learning a reconstruct and align approach. Our key insight is that although we do not h…
▽ More
The goal of this paper is to estimate the viewpoint for a novel object. Standard viewpoint estimation approaches generally fail on this task due to their reliance on a 3D model for alignment or large amounts of class-specific training data and their corresponding canonical pose. We overcome those limitations by learning a reconstruct and align approach. Our key insight is that although we do not have an explicit 3D model or a predefined canonical pose, we can still learn to estimate the object's shape in the viewer's frame and then use an image to provide our reference model or canonical pose. In particular, we propose learning two networks: the first maps images to a 3D geometry-aware feature bottleneck and is trained via an image-to-image translation loss; the second learns whether two instances of features are aligned. At test time, our model finds the relative transformation that best aligns the bottleneck features of our test image to a reference image. We evaluate our method on novel object viewpoint estimation by generalizing across different datasets, analyzing the impact of our different modules, and providing a qualitative analysis of the learned features to identify what representations are being learnt for alignment.
△ Less
Submitted 5 June, 2020;
originally announced June 2020.
-
MINT: Deep Network Compression via Mutual Information-based Neuron Trimming
Authors:
Madan Ravi Ganesh,
Jason J. Corso,
Salimeh Yasaei Sekeh
Abstract:
Most approaches to deep neural network compression via pruning either evaluate a filter's importance using its weights or optimize an alternative objective function with sparsity constraints. While these methods offer a useful way to approximate contributions from similar filters, they often either ignore the dependency between layers or solve a more difficult optimization objective than standard…
▽ More
Most approaches to deep neural network compression via pruning either evaluate a filter's importance using its weights or optimize an alternative objective function with sparsity constraints. While these methods offer a useful way to approximate contributions from similar filters, they often either ignore the dependency between layers or solve a more difficult optimization objective than standard cross-entropy. Our method, Mutual Information-based Neuron Trimming (MINT), approaches deep compression via pruning by enforcing sparsity based on the strength of the relationship between filters of adjacent layers, across every pair of layers. The relationship is calculated using conditional geometric mutual information which evaluates the amount of similar information exchanged between the filters using a graph-based criterion. When pruning a network, we ensure that retained filters contribute the majority of the information towards succeeding layers which ensures high performance. Our novel approach outperforms existing state-of-the-art compression-via-pruning methods on the standard benchmarks for this task: MNIST, CIFAR-10, and ILSVRC2012, across a variety of network architectures. In addition, we discuss our observations of a common denominator between our pruning methodology's response to adversarial attacks and calibration statistics when compared to the original network.
△ Less
Submitted 18 March, 2020;
originally announced March 2020.
-
Rethinking Curriculum Learning with Incremental Labels and Adaptive Compensation
Authors:
Madan Ravi Ganesh,
Jason J. Corso
Abstract:
Like humans, deep networks have been shown to learn better when samples are organized and introduced in a meaningful order or curriculum. Conventional curriculum learning schemes introduce samples in their order of difficulty. This forces models to begin learning from a subset of the available data while adding the external overhead of evaluating the difficulty of samples. In this work, we propose…
▽ More
Like humans, deep networks have been shown to learn better when samples are organized and introduced in a meaningful order or curriculum. Conventional curriculum learning schemes introduce samples in their order of difficulty. This forces models to begin learning from a subset of the available data while adding the external overhead of evaluating the difficulty of samples. In this work, we propose Learning with Incremental Labels and Adaptive Compensation (LILAC), a two-phase method that incrementally increases the number of unique output labels rather than the difficulty of samples while consistently using the entire dataset throughout training. In the first phase, Incremental Label Introduction, we partition data into mutually exclusive subsets, one that contains a subset of the ground-truth labels and another that contains the remaining data attached to a pseudo-label. Throughout the training process, we recursively reveal unseen ground-truth labels in fixed increments until all the labels are known to the model. In the second phase, Adaptive Compensation, we optimize the loss function using altered target vectors of previously misclassified samples. The target vectors of such samples are modified to a smoother distribution to help models learn better. On evaluating across three standard image benchmarks, CIFAR-10, CIFAR-100, and STL-10, we show that LILAC outperforms all comparable baselines. Further, we detail the importance of pacing the introduction of new labels to a model as well as the impact of using a smooth target vector.
△ Less
Submitted 13 August, 2020; v1 submitted 13 January, 2020;
originally announced January 2020.
-
HyperCon: Image-To-Video Model Transfer for Video-To-Video Translation Tasks
Authors:
Ryan Szeto,
Mostafa El-Khamy,
Jungwon Lee,
Jason J. Corso
Abstract:
Video-to-video translation is more difficult than image-to-image translation due to the temporal consistency problem that, if unaddressed, leads to distracting flickering effects. Although video models designed from scratch produce temporally consistent results, training them to match the vast visual knowledge captured by image models requires an intractable number of videos. To combine the benefi…
▽ More
Video-to-video translation is more difficult than image-to-image translation due to the temporal consistency problem that, if unaddressed, leads to distracting flickering effects. Although video models designed from scratch produce temporally consistent results, training them to match the vast visual knowledge captured by image models requires an intractable number of videos. To combine the benefits of image and video models, we propose an image-to-video model transfer method called Hyperconsistency (HyperCon) that transforms any well-trained image model into a temporally consistent video model without fine-tuning. HyperCon works by translating a temporally interpolated video frame-wise and then aggregating over temporally localized windows on the interpolated video. It handles both masked and unmasked inputs, enabling support for even more video-to-video translation tasks than prior image-to-video model transfer techniques. We demonstrate HyperCon on video style transfer and inpainting, where it performs favorably compared to prior state-of-the-art methods without training on a single stylized or incomplete video. Our project website is available at https://ryanszeto.com/projects/hypercon .
△ Less
Submitted 10 November, 2020; v1 submitted 10 December, 2019;
originally announced December 2019.
-
Metis: the Solar Orbiter visible light and ultraviolet coronal imager
Authors:
Ester Antonucci,
Marco Romoli,
Vincenzo Andretta,
Silvano Fineschi,
Petr Heinzel,
J. Daniel Moses,
Giampiero Naletto,
Gianalfredo Nicolini,
Daniele Spadaro,
Luca Teriaca,
Arkadiusz Berlicki,
Gerardo Capobianco,
Giuseppe Crescenzio,
Vania Da Deppo,
Mauro Focardi,
Fabio Frassetto,
Klaus Heerlein,
Federico Landini,
Enrico Magli,
Andrea Marco Malvezzi,
Giuseppe Massone,
Radek Melich,
Piergiorgio Nicolosi,
Giancarlo Noci,
Maurizio Pancrazzi
, et al. (78 additional authors not shown)
Abstract:
Metis is the first solar coronagraph designed for a space mission capable of performing simultaneous imaging of the off-limb solar corona in both visible and UV light. The observations obtained with Metis aboard the Solar Orbiter ESA-NASA observatory will enable us to diagnose, with unprecedented temporal coverage and spatial resolution, the structures and dynamics of the full corona from 1.7…
▽ More
Metis is the first solar coronagraph designed for a space mission capable of performing simultaneous imaging of the off-limb solar corona in both visible and UV light. The observations obtained with Metis aboard the Solar Orbiter ESA-NASA observatory will enable us to diagnose, with unprecedented temporal coverage and spatial resolution, the structures and dynamics of the full corona from 1.7 $R_\odot$ to about 9 $R_\odot$. Due to the uniqueness of the Solar Orbiter mission profile, Metis will be able to observe the solar corona from a close vantage point (down to 0.28 AU), achieving out-of-ecliptic views with the increase of the orbit inclination over time. Moreover, observations near perihelion, during the phase of lower rotational velocity of the solar surface relative to the spacecraft, will allow longer-term studies of the coronal features. Thanks to a novel occultation design and a combination of a UV interference coating of the mirrors and a spectral bandpass filter, Metis images the solar corona simultaneously in the visible light band, between 580 and 640 nm, and in the UV H I Lyman-α line at 121.6 nm. The coronal images in both the UV Lyman-α and polarised visible light are obtained at high spatial resolution with a spatial scale down to about 2000 km and 15000 km at perihelion, in the cases of the visible and UV light, respectively. A temporal resolution down to 1 second can be achieved when observing coronal fluctuations in visible light. The Metis measurements will allow for complete characterisation of the main physical parameters and dynamics of the electron and neutral hydrogen/proton plasma components of the corona in the region where the solar wind undergoes acceleration and where the onset and initial propagation of coronal mass ejections take place, thus significantly improving our understanding of the region connecting the Sun to the heliosphere.
△ Less
Submitted 14 November, 2019;
originally announced November 2019.
-
ViP: Video Platform for PyTorch
Authors:
Madan Ravi Ganesh,
Eric Hofesmann,
Nathan Louis,
Jason Corso
Abstract:
This work presents the Video Platform for PyTorch (ViP), a deep learning-based framework designed to handle and extend to any problem domain based on videos. ViP supports (1) a single unified interface applicable to all video problem domains, (2) quick prototy** of video models, (3) executing large-batch operations with reduced memory consumption, and (4) easy and reproducible experimental setup…
▽ More
This work presents the Video Platform for PyTorch (ViP), a deep learning-based framework designed to handle and extend to any problem domain based on videos. ViP supports (1) a single unified interface applicable to all video problem domains, (2) quick prototy** of video models, (3) executing large-batch operations with reduced memory consumption, and (4) easy and reproducible experimental setups. ViP's core functionality is built with flexibility and modularity in mind to allow for smooth data flow between different parts of the platform and benchmarking against existing methods. In providing a software platform that supports multiple video-based problem domains, we allow for more cross-pollination of models, ideas and stronger generalization in the video understanding research community.
△ Less
Submitted 7 October, 2019;
originally announced October 2019.
-
A Geometric Approach to Online Streaming Feature Selection
Authors:
Salimeh Yasaei Sekeh,
Madan Ravi Ganesh,
Shurjo Banerjee,
Jason J. Corso,
Alfred O. Hero
Abstract:
Online Streaming Feature Selection (OSFS) is a sequential learning problem where individual features across all samples are made available to algorithms in a streaming fashion. In this work, firstly, we assert that OSFS's main assumption of having data from all the samples available at runtime is unrealistic and introduce a new setting where features and samples are streamed concurrently called OS…
▽ More
Online Streaming Feature Selection (OSFS) is a sequential learning problem where individual features across all samples are made available to algorithms in a streaming fashion. In this work, firstly, we assert that OSFS's main assumption of having data from all the samples available at runtime is unrealistic and introduce a new setting where features and samples are streamed concurrently called OSFS with Streaming Samples (OSFS-SS). Secondly, the primary OSFS method, SAOLA utilizes an unbounded mutual information measure and requires multiple comparison steps between the stored and incoming feature sets to evaluate a feature's importance. We introduce Geometric Online Adaption, an algorithm that requires relatively less feature comparison steps and uses a bounded conditional geometric dependency measure. Our algorithm outperforms several OSFS baselines including SAOLA on a variety of datasets. We also extend SAOLA to work in the OSFS-SS setting and show that GOA continues to achieve the best results. Thirdly, the current paradigm of the OSFS algorithm comparison is flawed. Algorithms are measured by comparing the number of features used and the accuracy obtained by the learner, two properties that are fundamentally at odds with one another. Without fixing a limit on either of these properties, the qualities of features obtained by different algorithms are incomparable. We try to rectify this inconsistency by fixing the maximum number of features available to the learner and comparing algorithms in terms of their accuracy. Additionally, we characterize the behaviour of SAOLA and GOA on feature sets derived from popular deep convolutional featurizers.
△ Less
Submitted 16 March, 2020; v1 submitted 2 October, 2019;
originally announced October 2019.
-
Unified Vision-Language Pre-Training for Image Captioning and VQA
Authors:
Luowei Zhou,
Hamid Palangi,
Lei Zhang,
Houdong Hu,
Jason J. Corso,
Jianfeng Gao
Abstract:
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and d…
▽ More
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.
△ Less
Submitted 4 December, 2019; v1 submitted 24 September, 2019;
originally announced September 2019.
-
TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection
Authors:
Kyle Min,
Jason J. Corso
Abstract:
TASED-Net is a 3D fully-convolutional network architecture for video saliency detection. It consists of two building blocks: first, the encoder network extracts low-resolution spatiotemporal features from an input clip of several consecutive frames, and then the following prediction network decodes the encoded features spatially while aggregating all the temporal information. As a result, a single…
▽ More
TASED-Net is a 3D fully-convolutional network architecture for video saliency detection. It consists of two building blocks: first, the encoder network extracts low-resolution spatiotemporal features from an input clip of several consecutive frames, and then the following prediction network decodes the encoded features spatially while aggregating all the temporal information. As a result, a single prediction map is produced from an input clip of multiple frames. Frame-wise saliency maps can be predicted by applying TASED-Net in a sliding-window fashion to a video. The proposed approach assumes that the saliency map of any frame can be predicted by considering a limited number of past frames. The results of our extensive experiments on video saliency detection validate this assumption and demonstrate that our fully-convolutional model with temporal aggregation method is effective. TASED-Net significantly outperforms previous state-of-the-art approaches on all three major large-scale datasets of video saliency detection: DHF1K, Hollywood2, and UCFSports. After analyzing the results qualitatively, we observe that our model is especially better at attending to salient moving objects.
△ Less
Submitted 15 August, 2019;
originally announced August 2019.
-
Multi-Channel Attention Selection GAN with Cascaded Semantic Guidance for Cross-View Image Translation
Authors:
Hao Tang,
Dan Xu,
Nicu Sebe,
Yanzhi Wang,
Jason J. Corso,
Yan Yan
Abstract:
Cross-view image translation is challenging because it involves images with drastically different views and severe deformation. In this paper, we propose a novel approach named Multi-Channel Attention SelectionGAN (SelectionGAN) that makes it possible to generate images of natural scenes in arbitrary viewpoints, based on an image of the scene and a novel semantic map. The proposed SelectionGAN exp…
▽ More
Cross-view image translation is challenging because it involves images with drastically different views and severe deformation. In this paper, we propose a novel approach named Multi-Channel Attention SelectionGAN (SelectionGAN) that makes it possible to generate images of natural scenes in arbitrary viewpoints, based on an image of the scene and a novel semantic map. The proposed SelectionGAN explicitly utilizes the semantic information and consists of two stages. In the first stage, the condition image and the target semantic map are fed into a cycled semantic-guided generation network to produce initial coarse results. In the second stage, we refine the initial results by using a multi-channel attention selection mechanism. Moreover, uncertainty maps automatically learned from attentions are used to guide the pixel loss for better network optimization. Extensive experiments on Dayton, CVUSA and Ego2Top datasets show that our model is able to generate significantly better results than the state-of-the-art methods. The source code, data and trained models are available at https://github.com/Ha0Tang/SelectionGAN.
△ Less
Submitted 16 April, 2019; v1 submitted 14 April, 2019;
originally announced April 2019.
-
Robot-Supervised Learning for Object Segmentation
Authors:
Victoria Florence,
Jason J. Corso,
Brent Griffin
Abstract:
To be effective in unstructured and changing environments, robots must learn to recognize new objects. Deep learning has enabled rapid progress for object detection and segmentation in computer vision; however, this progress comes at the price of human annotators labeling many training examples. This paper addresses the problem of extending learning-based segmentation methods to robotics applicati…
▽ More
To be effective in unstructured and changing environments, robots must learn to recognize new objects. Deep learning has enabled rapid progress for object detection and segmentation in computer vision; however, this progress comes at the price of human annotators labeling many training examples. This paper addresses the problem of extending learning-based segmentation methods to robotics applications where annotated training data is not available. Our method enables pixelwise segmentation of grasped objects. We factor the problem of segmenting the object from the background into two sub-problems: (1) segmenting the robot manipulator and object from the background and (2) segmenting the object from the manipulator. We propose a kinematics-based foreground segmentation technique to solve (1). To solve (2), we train a self-recognition network that segments the robot manipulator. We train this network without human supervision, leveraging our foreground segmentation technique from (1) to label a training set of images containing the robot manipulator without a grasped object. We demonstrate experimentally that our method outperforms state-of-the-art adaptable in-hand object segmentation. We also show that a training set composed of automatically labelled images of grasped objects improves segmentation performance on a test set of images of the same objects in the environment.
△ Less
Submitted 4 March, 2020; v1 submitted 1 April, 2019;
originally announced April 2019.
-
BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames
Authors:
Brent A. Griffin,
Jason J. Corso
Abstract:
Semi-supervised video object segmentation has made significant progress on real and challenging videos in recent years. The current paradigm for segmentation methods and benchmark datasets is to segment objects in video provided a single annotation in the first frame. However, we find that segmentation performance across the entire video varies dramatically when selecting an alternative frame for…
▽ More
Semi-supervised video object segmentation has made significant progress on real and challenging videos in recent years. The current paradigm for segmentation methods and benchmark datasets is to segment objects in video provided a single annotation in the first frame. However, we find that segmentation performance across the entire video varies dramatically when selecting an alternative frame for annotation. This paper address the problem of learning to suggest the single best frame across the video for user annotation-this is, in fact, never the first frame of video. We achieve this by introducing BubbleNets, a novel deep sorting network that learns to select frames using a performance-based loss function that enables the conversion of expansive amounts of training examples from already existing datasets. Using BubbleNets, we are able to achieve an 11% relative improvement in segmentation performance on the DAVIS benchmark without any changes to the underlying method of segmentation.
△ Less
Submitted 23 November, 2020; v1 submitted 27 March, 2019;
originally announced March 2019.
-
Video Object Segmentation-based Visual Servo Control and Object Depth Estimation on a Mobile Robot
Authors:
Brent A. Griffin,
Victoria Florence,
Jason J. Corso
Abstract:
To be useful in everyday environments, robots must be able to identify and locate real-world objects. In recent years, video object segmentation has made significant progress on densely separating such objects from background in real and challenging videos. Building off of this progress, this paper addresses the problem of identifying generic objects and locating them in 3D using a mobile robot wi…
▽ More
To be useful in everyday environments, robots must be able to identify and locate real-world objects. In recent years, video object segmentation has made significant progress on densely separating such objects from background in real and challenging videos. Building off of this progress, this paper addresses the problem of identifying generic objects and locating them in 3D using a mobile robot with an RGB camera. We achieve this by, first, introducing a video object segmentation-based approach to visual servo control and active perception and, second, develo** a new Hadamard-Broyden update formulation. Our segmentation-based methods are simple but effective, and our update formulation lets a robot quickly learn the relationship between actuators and visual features without any camera calibration. We validate our approach in experiments by learning a variety of actuator-camera configurations on a mobile HSR robot, which subsequently identifies, locates, and grasps objects from the YCB dataset and tracks people and other dynamic articulated objects in real-time.
△ Less
Submitted 9 January, 2020; v1 submitted 20 March, 2019;
originally announced March 2019.
-
Attribute-Guided Sketch Generation
Authors:
Hao Tang,
Xinya Chen,
Wei Wang,
Dan Xu,
Jason J. Corso,
Nicu Sebe,
Yan Yan
Abstract:
Facial attributes are important since they provide a detailed description and determine the visual appearance of human faces. In this paper, we aim at converting a face image to a sketch while simultaneously generating facial attributes. To this end, we propose a novel Attribute-Guided Sketch Generative Adversarial Network (ASGAN) which is an end-to-end framework and contains two pairs of generato…
▽ More
Facial attributes are important since they provide a detailed description and determine the visual appearance of human faces. In this paper, we aim at converting a face image to a sketch while simultaneously generating facial attributes. To this end, we propose a novel Attribute-Guided Sketch Generative Adversarial Network (ASGAN) which is an end-to-end framework and contains two pairs of generators and discriminators, one of which is used to generate faces with attributes while the other one is employed for image-to-sketch translation. The two generators form a W-shaped network (W-net) and they are trained jointly with a weight-sharing constraint. Additionally, we also propose two novel discriminators, the residual one focusing on attribute generation and the triplex one hel** to generate realistic looking sketches. To validate our model, we have created a new large dataset with 8,804 images, named the Attribute Face Photo & Sketch (AFPS) dataset which is the first dataset containing attributes associated to face sketch images. The experimental results demonstrate that the proposed network (i) generates more photo-realistic faces with sharper facial attributes than baselines and (ii) has good generalization capability on different generative tasks.
△ Less
Submitted 14 April, 2019; v1 submitted 28 January, 2019;
originally announced January 2019.
-
Kinematically-Informed Interactive Perception: Robot-Generated 3D Models for Classification
Authors:
Abhishek Venkataraman,
Brent Griffin,
Jason J. Corso
Abstract:
To be useful in everyday environments, robots must be able to observe and learn about objects. Recent datasets enable progress for classifying data into known object categories; however, it is unclear how to collect reliable object data when operating in cluttered, partially-observable environments. In this paper, we address the problem of building complete 3D models for real-world objects using a…
▽ More
To be useful in everyday environments, robots must be able to observe and learn about objects. Recent datasets enable progress for classifying data into known object categories; however, it is unclear how to collect reliable object data when operating in cluttered, partially-observable environments. In this paper, we address the problem of building complete 3D models for real-world objects using a robot platform, which can remove objects from clutter for better classification. Furthermore, we are able to learn entirely new object categories as they are encountered, enabling the robot to classify previously unidentifiable objects during future interactions. We build models of grasped objects using simultaneous manipulation and observation, and we guide the processing of visual data using a kinematic description of the robot to combine observations from different view-points and remove background noise. To test our framework, we use a mobile manipulation robot equipped with an RGBD camera to build voxelized representations of unknown objects and then classify them into new categories. We then have the robot remove objects from clutter to manipulate, observe, and classify them in real-time.
△ Less
Submitted 16 January, 2019;
originally announced January 2019.
-
Grounded Video Description
Authors:
Luowei Zhou,
Yannis Kalantidis,
Xinlei Chen,
Jason J. Corso,
Marcus Rohrbach
Abstract:
Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the v…
▽ More
Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the video by annotating each noun phrase in a sentence with the corresponding bounding box in one of the frames of a video. Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe. To generate grounded captions, we propose a novel video description model which is able to exploit these bounding box annotations. We demonstrate the effectiveness of our model on our dataset, but also show how it can be applied to image description on the Flickr30k Entities dataset. We achieve state-of-the-art performance on video description, video paragraph description, and image description and demonstrate our generated sentences are better grounded in the video.
△ Less
Submitted 5 May, 2019; v1 submitted 16 December, 2018;
originally announced December 2018.
-
Dynamic Graph Modules for Modeling Object-Object Interactions in Activity Recognition
Authors:
Hao Huang,
Luowei Zhou,
Wei Zhang,
Jason J. Corso,
Chenliang Xu
Abstract:
Video action recognition, a critical problem in video understanding, has been gaining increasing attention. To identify actions induced by complex object-object interactions, we need to consider not only spatial relations among objects in a single frame, but also temporal relations among different or the same objects across multiple frames. However, existing approaches that model video representat…
▽ More
Video action recognition, a critical problem in video understanding, has been gaining increasing attention. To identify actions induced by complex object-object interactions, we need to consider not only spatial relations among objects in a single frame, but also temporal relations among different or the same objects across multiple frames. However, existing approaches that model video representations and non-local features are either incapable of explicitly modeling relations at the object-object level or unable to handle streaming videos. In this paper, we propose a novel dynamic hidden graph module to model complex object-object interactions in videos, of which two instantiations are considered: a visual graph that captures appearance/motion changes among objects and a location graph that captures relative spatiotemporal position changes among objects. Additionally, the proposed graph module allows us to process streaming videos, setting it apart from existing methods. Experimental results on benchmark datasets, Something-Something and ActivityNet, show the competitive performance of our method.
△ Less
Submitted 7 May, 2019; v1 submitted 13 December, 2018;
originally announced December 2018.
-
Tukey-Inspired Video Object Segmentation
Authors:
Brent A. Griffin,
Jason J. Corso
Abstract:
We investigate the problem of strictly unsupervised video object segmentation, i.e., the separation of a primary object from background in video without a user-provided object mask or any training on an annotated dataset. We find foreground objects in low-level vision data using a John Tukey-inspired measure of "outlierness". This Tukey-inspired measure also estimates the reliability of each data…
▽ More
We investigate the problem of strictly unsupervised video object segmentation, i.e., the separation of a primary object from background in video without a user-provided object mask or any training on an annotated dataset. We find foreground objects in low-level vision data using a John Tukey-inspired measure of "outlierness". This Tukey-inspired measure also estimates the reliability of each data source as video characteristics change (e.g., a camera starts moving). The proposed method achieves state-of-the-art results for strictly unsupervised video object segmentation on the challenging DAVIS dataset. Finally, we use a variant of the Tukey-inspired measure to combine the output of multiple segmentation methods, including those using supervision during training, runtime, or both. This collectively more robust method of segmentation improves the Jaccard measure of its constituent methods by as much as 28%.
△ Less
Submitted 29 November, 2018; v1 submitted 19 November, 2018;
originally announced November 2018.
-
Floyd-Warshall Reinforcement Learning: Learning from Past Experiences to Reach New Goals
Authors:
Vikas Dhiman,
Shurjo Banerjee,
Jeffrey M. Siskind,
Jason J. Corso
Abstract:
Consider mutli-goal tasks that involve static environments and dynamic goals. Examples of such tasks, such as goal-directed navigation and pick-and-place in robotics, abound. Two types of Reinforcement Learning (RL) algorithms are used for such tasks: model-free or model-based. Each of these approaches has limitations. Model-free RL struggles to transfer learned information when the goal location…
▽ More
Consider mutli-goal tasks that involve static environments and dynamic goals. Examples of such tasks, such as goal-directed navigation and pick-and-place in robotics, abound. Two types of Reinforcement Learning (RL) algorithms are used for such tasks: model-free or model-based. Each of these approaches has limitations. Model-free RL struggles to transfer learned information when the goal location changes, but achieves high asymptotic accuracy in single goal tasks. Model-based RL can transfer learned information to new goal locations by retaining the explicitly learned state-dynamics, but is limited by the fact that small errors in modelling these dynamics accumulate over long-term planning. In this work, we improve upon the limitations of model-free RL in multi-goal domains. We do this by adapting the Floyd-Warshall algorithm for RL and call the adaptation Floyd-Warshall RL (FWRL). The proposed algorithm learns a goal-conditioned action-value function by constraining the value of the optimal path between any two states to be greater than or equal to the value of paths via intermediary states. Experimentally, we show that FWRL is more sample-efficient and learns higher reward strategies in multi-goal tasks as compared to Q-learning, model-based RL and other relevant baselines in a tabular domain.
△ Less
Submitted 4 January, 2019; v1 submitted 25 September, 2018;
originally announced September 2018.
-
Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction
Authors:
Luowei Zhou,
Nathan Louis,
Jason J. Corso
Abstract:
We study weakly-supervised video object grounding: given a video segment and a corresponding descriptive sentence, the goal is to localize objects that are mentioned from the sentence in the video. During training, no object bounding boxes are available, but the set of possible objects to be grounded is known beforehand. Existing approaches in the image domain use Multiple Instance Learning (MIL)…
▽ More
We study weakly-supervised video object grounding: given a video segment and a corresponding descriptive sentence, the goal is to localize objects that are mentioned from the sentence in the video. During training, no object bounding boxes are available, but the set of possible objects to be grounded is known beforehand. Existing approaches in the image domain use Multiple Instance Learning (MIL) to ground objects by enforcing matches between visual and semantic features. A naive extension of this approach to the video domain is to treat the entire segment as a bag of spatial object proposals. However, an object existing sparsely across multiple frames might not be detected completely since successfully spotting it from one single frame would trigger a satisfactory match. To this end, we propagate the weak supervisory signal from the segment level to frames that likely contain the target object. For frames that are unlikely to contain the target objects, we use an alternative penalty loss. We also leverage the interactions among objects as a textual guide for the grounding. We evaluate our model on the newly-collected benchmark YouCook2-BoundingBox and show improvements over competitive baselines.
△ Less
Submitted 20 July, 2018; v1 submitted 8 May, 2018;
originally announced May 2018.
-
Joint Surgical Gesture and Task Classification with Multi-Task and Multimodal Learning
Authors:
Duygu Sarikaya,
Khurshid A. Guru,
Jason J. Corso
Abstract:
We propose a novel multi-modal and multi-task architecture for simultaneous low level gesture and surgical task classification in Robot Assisted Surgery (RAS) videos.Our end-to-end architecture is based on the principles of a long short-term memory network (LSTM) that jointly learns temporal dynamics on rich representations of visual and motion features, while simultaneously classifying activities…
▽ More
We propose a novel multi-modal and multi-task architecture for simultaneous low level gesture and surgical task classification in Robot Assisted Surgery (RAS) videos.Our end-to-end architecture is based on the principles of a long short-term memory network (LSTM) that jointly learns temporal dynamics on rich representations of visual and motion features, while simultaneously classifying activities of low-level gestures and surgical tasks. Our experimental results show that our approach is superior compared to an ar- chitecture that classifies the gestures and surgical tasks separately on visual cues and motion cues respectively. We train our model on a fixed random set of 1200 gesture video segments and use the rest 422 for testing. This results in around 42,000 gesture frames sampled for training and 14,500 for testing. For a 6 split experimentation, while the conventional approach reaches an Average Precision (AP) of only 29% (29.13%), our architecture reaches an AP of 51% (50.83%) for 3 tasks and 14 possible gesture labels, resulting in an improvement of 22% (21.7%). Our architecture learns temporal dynamics on rich representations of visual and motion features that compliment each other for classification of low-level gestures and surgical tasks. Its multi-task learning nature makes use of learned joint re- lationships and combinations of shared and task-specific representations. While benchmark studies focus on recognizing gestures that take place under specific tasks, we focus on recognizing common gestures that reoccur across different tasks and settings and significantly perform better compared to conventional architectures.
△ Less
Submitted 2 May, 2018;
originally announced May 2018.