-
Virchow: A Million-Slide Digital Pathology Foundation Model
Authors:
Eugene Vorontsov,
Alican Bozkurt,
Adam Casson,
George Shaikovski,
Michal Zelechowski,
Siqi Liu,
Kristen Severson,
Eric Zimmermann,
James Hall,
Neil Tenenholtz,
Nicolo Fusi,
Philippe Mathieu,
Alexander van Eck,
Donghun Lee,
Julian Viret,
Eric Robert,
Yi Kan Wang,
Jeremy D. Kunz,
Matthew C. H. Lee,
Jan Bernhard,
Ran A. Godrich,
Gerard Oakley,
Ewan Millar,
Matthew Hanna,
Juan Retamero
, et al. (6 additional authors not shown)
Abstract:
The use of artificial intelligence to enable precision medicine and decision support systems through the analysis of pathology images has the potential to revolutionize the diagnosis and treatment of cancer. Such applications will depend on models' abilities to capture the diverse patterns observed in pathology images. To address this challenge, we present Virchow, a foundation model for computati…
▽ More
The use of artificial intelligence to enable precision medicine and decision support systems through the analysis of pathology images has the potential to revolutionize the diagnosis and treatment of cancer. Such applications will depend on models' abilities to capture the diverse patterns observed in pathology images. To address this challenge, we present Virchow, a foundation model for computational pathology. Using self-supervised learning empowered by the DINOv2 algorithm, Virchow is a vision transformer model with 632 million parameters trained on 1.5 million hematoxylin and eosin stained whole slide images from diverse tissue and specimen types, which is orders of magnitude more data than previous works. The Virchow model enables the development of a pan-cancer detection system with 0.949 overall specimen-level AUC across 17 different cancer types, while also achieving 0.937 AUC on 7 rare cancer types. The Virchow model sets the state-of-the-art on the internal and external image tile level benchmarks and slide level biomarker prediction tasks. The gains in performance highlight the importance of training on massive pathology image datasets, suggesting scaling up the data and network architecture can improve the accuracy for many high-impact computational pathology applications where limited amounts of training data are available.
△ Less
Submitted 17 January, 2024; v1 submitted 14 September, 2023;
originally announced September 2023.
-
Dynamics of the no-slip Galton board
Authors:
Jan Ahmed,
Timothy Chumley,
Scott Cook,
Christopher Cox,
Hakiem Grant,
Nicholas Petela,
Bethany Rothrock,
Ridnald Xhafaj
Abstract:
The ideal Galton board and Lorentz gas billiard models have been studied numerically and analytically primarily in settings where friction and rotational velocity are neglected. We eliminate these simplifying assumptions and study the resulting dynamics of a more general model using no-slip collisions, in which particles rotate and may exchange linear and angular momentum at collisions while adher…
▽ More
The ideal Galton board and Lorentz gas billiard models have been studied numerically and analytically primarily in settings where friction and rotational velocity are neglected. We eliminate these simplifying assumptions and study the resulting dynamics of a more general model using no-slip collisions, in which particles rotate and may exchange linear and angular momentum at collisions while adhering to certain conservation laws. Using numerical experiments and phase portrait analysis we show that (in contrast to specular dispersing billiards) regularity persists when a small force is introduced while (consistent with specular billiards) under a stronger force new structure including invariant regions may arise. We also show analytically that with the introduction of an external force periodicity proliferates, with new types of periodic orbits not present in the no-force case.
△ Less
Submitted 16 August, 2022;
originally announced August 2022.
-
Privacy-Preserving Human Activity Recognition from Extreme Low Resolution
Authors:
Michael S. Ryoo,
Brandon Rothrock,
Charles Fleming,
Hyun Jong Yang
Abstract:
Privacy protection from surreptitious video recordings is an important societal challenge. We desire a computer vision system (e.g., a robot) that can recognize human activities and assist our daily life, yet ensure that it is not recording video that may invade our privacy. This paper presents a fundamental approach to address such contradicting objectives: human activity recognition while only u…
▽ More
Privacy protection from surreptitious video recordings is an important societal challenge. We desire a computer vision system (e.g., a robot) that can recognize human activities and assist our daily life, yet ensure that it is not recording video that may invade our privacy. This paper presents a fundamental approach to address such contradicting objectives: human activity recognition while only using extreme low-resolution (e.g., 16x12) anonymized videos. We introduce the paradigm of inverse super resolution (ISR), the concept of learning the optimal set of image transformations to generate multiple low-resolution (LR) training videos from a single video. Our ISR learns different types of sub-pixel transformations optimized for the activity classification, allowing the classifier to best take advantage of existing high-resolution videos (e.g., YouTube videos) by creating multiple LR training videos tailored for the problem. We experimentally confirm that the paradigm of inverse super resolution is able to benefit activity recognition from extreme low-resolution videos.
△ Less
Submitted 26 December, 2016; v1 submitted 11 April, 2016;
originally announced April 2016.
-
Joint Inference of Groups, Events and Human Roles in Aerial Videos
Authors:
Tianmin Shu,
Dan Xie,
Brandon Rothrock,
Sinisa Todorovic,
Song-Chun Zhu
Abstract:
With the advent of drones, aerial video analysis becomes increasingly important; yet, it has received scant attention in the literature. This paper addresses a new problem of parsing low-resolution aerial videos of large spatial areas, in terms of 1) grou**, 2) recognizing events and 3) assigning roles to people engaged in events. We propose a novel framework aimed at conducting joint inference…
▽ More
With the advent of drones, aerial video analysis becomes increasingly important; yet, it has received scant attention in the literature. This paper addresses a new problem of parsing low-resolution aerial videos of large spatial areas, in terms of 1) grou**, 2) recognizing events and 3) assigning roles to people engaged in events. We propose a novel framework aimed at conducting joint inference of the above tasks, as reasoning about each in isolation typically fails in our setting. Given noisy tracklets of people and detections of large objects and scene surfaces (e.g., building, grass), we use a spatiotemporal AND-OR graph to drive our joint inference, using Markov Chain Monte Carlo and dynamic programming. We also introduce a new formalism of spatiotemporal templates characterizing latent sub-events. For evaluation, we have collected and released a new aerial videos dataset using a hex-rotor flying over picnic areas rich with group events. Our results demonstrate that we successfully address above inference tasks under challenging conditions.
△ Less
Submitted 22 May, 2015;
originally announced May 2015.
-
Pooled Motion Features for First-Person Videos
Authors:
M. S. Ryoo,
Brandon Rothrock,
Larry Matthies
Abstract:
In this paper, we present a new feature representation for first-person videos. In first-person video understanding (e.g., activity recognition), it is very important to capture both entire scene dynamics (i.e., egomotion) and salient local motion observed in videos. We describe a representation framework based on time series pooling, which is designed to abstract short-term/long-term changes in f…
▽ More
In this paper, we present a new feature representation for first-person videos. In first-person video understanding (e.g., activity recognition), it is very important to capture both entire scene dynamics (i.e., egomotion) and salient local motion observed in videos. We describe a representation framework based on time series pooling, which is designed to abstract short-term/long-term changes in feature descriptor elements. The idea is to keep track of how descriptor values are changing over time and summarize them to represent motion in the activity video. The framework is general, handling any types of per-frame feature descriptors including conventional motion descriptors like histogram of optical flows (HOF) as well as appearance descriptors from more recent convolutional neural networks (CNN). We experimentally confirm that our approach clearly outperforms previous feature representations including bag-of-visual-words and improved Fisher vector (IFV) when using identical underlying feature descriptors. We also confirm that our feature representation has superior performance to existing state-of-the-art features like local spatio-temporal features and Improved Trajectory Features (originally developed for 3rd-person videos) when handling first-person videos. Multiple first-person activity datasets were tested under various settings to confirm these findings.
△ Less
Submitted 6 May, 2015; v1 submitted 19 December, 2014;
originally announced December 2014.