-
DISC: Latent Diffusion Models with Self-Distillation from Separated Conditions for Prostate Cancer Grading
Authors:
Man M. Ho,
Elham Ghelichkhan,
Yosep Chong,
Yufei Zhou,
Beatrice Knudsen,
Tolga Tasdizen
Abstract:
Latent Diffusion Models (LDMs) can generate high-fidelity images from noise, offering a promising approach for augmenting histopathology images for training cancer grading models. While previous works successfully generated high-fidelity histopathology images using LDMs, the generation of image tiles to improve prostate cancer grading has not yet been explored. Additionally, LDMs face challenges i…
▽ More
Latent Diffusion Models (LDMs) can generate high-fidelity images from noise, offering a promising approach for augmenting histopathology images for training cancer grading models. While previous works successfully generated high-fidelity histopathology images using LDMs, the generation of image tiles to improve prostate cancer grading has not yet been explored. Additionally, LDMs face challenges in accurately generating admixtures of multiple cancer grades in a tile when conditioned by a tile mask. In this study, we train specific LDMs to generate synthetic tiles that contain multiple Gleason Grades (GGs) by leveraging pixel-wise annotations in input tiles. We introduce a novel framework named Self-Distillation from Separated Conditions (DISC) that generates GG patterns guided by GG masks. Finally, we deploy a training framework for pixel-level and slide-level prostate cancer grading, where synthetic tiles are effectively utilized to improve the cancer grading performance of existing models. As a result, this work surpasses previous works in two domains: 1) our LDMs enhanced with DISC produce more accurate tiles in terms of GG patterns, and 2) our training scheme, incorporating synthetic data, significantly improves the generalization of the baseline model for prostate cancer grading, particularly in challenging cases of rare GG5, demonstrating the potential of generative models to enhance cancer grading when data is limited.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
F2FLDM: Latent Diffusion Models with Histopathology Pre-Trained Embeddings for Unpaired Frozen Section to FFPE Translation
Authors:
Man M. Ho,
Shikha Dubey,
Yosep Chong,
Beatrice Knudsen,
Tolga Tasdizen
Abstract:
The Frozen Section (FS) technique is a rapid and efficient method, taking only 15-30 minutes to prepare slides for pathologists' evaluation during surgery, enabling immediate decisions on further surgical interventions. However, FS process often introduces artifacts and distortions like folds and ice-crystal effects. In contrast, these artifacts and distortions are absent in the higher-quality for…
▽ More
The Frozen Section (FS) technique is a rapid and efficient method, taking only 15-30 minutes to prepare slides for pathologists' evaluation during surgery, enabling immediate decisions on further surgical interventions. However, FS process often introduces artifacts and distortions like folds and ice-crystal effects. In contrast, these artifacts and distortions are absent in the higher-quality formalin-fixed paraffin-embedded (FFPE) slides, which require 2-3 days to prepare. While Generative Adversarial Network (GAN)-based methods have been used to translate FS to FFPE images (F2F), they may leave morphological inaccuracies with remaining FS artifacts or introduce new artifacts, reducing the quality of these translations for clinical assessments. In this study, we benchmark recent generative models, focusing on GANs and Latent Diffusion Models (LDMs), to overcome these limitations. We introduce a novel approach that combines LDMs with Histopathology Pre-Trained Embeddings to enhance restoration of FS images. Our framework leverages LDMs conditioned by both text and pre-trained embeddings to learn meaningful features of FS and FFPE histopathology images. Through diffusion and denoising techniques, our approach not only preserves essential diagnostic attributes like color staining and tissue morphology but also proposes an embedding translation mechanism to better predict the targeted FFPE representation of input FS images. As a result, this work achieves a significant improvement in classification performance, with the Area Under the Curve rising from 81.99% to 94.64%, accompanied by an advantageous CaseFD. This work establishes a new benchmark for FS to FFPE image translation quality, promising enhanced reliability and accuracy in histopathology FS image analysis. Our work is available at https://minhmanho.github.io/f2f_ldm/.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
StainDiffuser: MultiTask Dual Diffusion Model for Virtual Staining
Authors:
Tushar Kataria,
Beatrice Knudsen,
Shireen Y. Elhabian
Abstract:
Hematoxylin and Eosin (H&E) staining is the most commonly used for disease diagnosis and tumor recurrence tracking. Hematoxylin excels at highlighting nuclei, whereas eosin stains the cytoplasm. However, H&E stain lacks details for differentiating different types of cells relevant to identifying the grade of the disease or response to specific treatment variations. Pathologists require special imm…
▽ More
Hematoxylin and Eosin (H&E) staining is the most commonly used for disease diagnosis and tumor recurrence tracking. Hematoxylin excels at highlighting nuclei, whereas eosin stains the cytoplasm. However, H&E stain lacks details for differentiating different types of cells relevant to identifying the grade of the disease or response to specific treatment variations. Pathologists require special immunohistochemical (IHC) stains that highlight different cell types. These stains help in accurately identifying different regions of disease growth and their interactions with the cell's microenvironment. The advent of deep learning models has made Image-to-Image (I2I) translation a key research area, reducing the need for expensive physical staining processes. Pix2Pix and CycleGAN are still the most commonly used methods for virtual staining applications. However, both suffer from hallucinations or staining irregularities when H&E stain has less discriminate information about the underlying cells IHC needs to highlight (e.g.,CD3 lymphocytes). Diffusion models are currently the state-of-the-art models for image generation and conditional generation tasks. However, they require extensive and diverse datasets (millions of samples) to converge, which is less feasible for virtual staining applications.Inspired by the success of multitask deep learning models for limited dataset size, we propose StainDiffuser, a novel multitask dual diffusion architecture for virtual staining that converges under a limited training budget. StainDiffuser trains two diffusion processes simultaneously: (a) generation of cell-specific IHC stain from H&E and (b) H&E-based cell segmentation using coarse segmentation only during training. Our results show that StainDiffuser produces high-quality results for easier (CK8/18,epithelial marker) and difficult stains(CD3, Lymphocytes).
△ Less
Submitted 17 March, 2024;
originally announced March 2024.
-
CLASS-M: Adaptive stain separation-based contrastive learning with pseudo-labeling for histopathological image classification
Authors:
Bodong Zhang,
Hamid Manoochehri,
Man Minh Ho,
Fahimeh Fooladgar,
Yosep Chong,
Beatrice S. Knudsen,
Deepika Sirohi,
Tolga Tasdizen
Abstract:
Histopathological image classification is an important task in medical image analysis. Recent approaches generally rely on weakly supervised learning due to the ease of acquiring case-level labels from pathology reports. However, patch-level classification is preferable in applications where only a limited number of cases are available or when local prediction accuracy is critical. On the other ha…
▽ More
Histopathological image classification is an important task in medical image analysis. Recent approaches generally rely on weakly supervised learning due to the ease of acquiring case-level labels from pathology reports. However, patch-level classification is preferable in applications where only a limited number of cases are available or when local prediction accuracy is critical. On the other hand, acquiring extensive datasets with localized labels for training is not feasible. In this paper, we propose a semi-supervised patch-level histopathological image classification model, named CLASS-M, that does not require extensively labeled datasets. CLASS-M is formed by two main parts: a contrastive learning module that uses separated Hematoxylin and Eosin images generated through an adaptive stain separation process, and a module with pseudo-labels using MixUp. We compare our model with other state-of-the-art models on two clear cell renal cell carcinoma datasets. We demonstrate that our CLASS-M model has the best performance on both datasets. Our code is available at github.com/BzhangURU/Paper_CLASS-M/tree/main
△ Less
Submitted 4 January, 2024; v1 submitted 11 December, 2023;
originally announced December 2023.
-
Structural Cycle GAN for Virtual Immunohistochemistry Staining of Gland Markers in the Colon
Authors:
Shikha Dubey,
Tushar Kataria,
Beatrice Knudsen,
Shireen Y. Elhabian
Abstract:
With the advent of digital scanners and deep learning, diagnostic operations may move from a microscope to a desktop. Hematoxylin and Eosin (H&E) staining is one of the most frequently used stains for disease analysis, diagnosis, and grading, but pathologists do need different immunohistochemical (IHC) stains to analyze specific structures or cells. Obtaining all of these stains (H&E and different…
▽ More
With the advent of digital scanners and deep learning, diagnostic operations may move from a microscope to a desktop. Hematoxylin and Eosin (H&E) staining is one of the most frequently used stains for disease analysis, diagnosis, and grading, but pathologists do need different immunohistochemical (IHC) stains to analyze specific structures or cells. Obtaining all of these stains (H&E and different IHCs) on a single specimen is a tedious and time-consuming task. Consequently, virtual staining has emerged as an essential research direction. Here, we propose a novel generative model, Structural Cycle-GAN (SC-GAN), for synthesizing IHC stains from H&E images, and vice versa. Our method expressly incorporates structural information in the form of edges (in addition to color data) and employs attention modules exclusively in the decoder of the proposed generator model. This integration enhances feature localization and preserves contextual information during the generation process. In addition, a structural loss is incorporated to ensure accurate structure alignment between the generated and input markers. To demonstrate the efficacy of the proposed model, experiments are conducted with two IHC markers emphasizing distinct structures of glands in the colon: the nucleus of epithelial cells (CDX2) and the cytoplasm (CK818). Quantitative metrics such as FID and SSIM are frequently used for the analysis of generative models, but they do not correlate explicitly with higher-quality virtual staining results. Therefore, we propose two new quantitative metrics that correlate directly with the virtual staining specificity of IHC markers.
△ Less
Submitted 25 August, 2023;
originally announced August 2023.
-
To pretrain or not to pretrain? A case study of domain-specific pretraining for semantic segmentation in histopathology
Authors:
Tushar Kataria,
Beatrice Knudsen,
Shireen Elhabian
Abstract:
Annotating medical imaging datasets is costly, so fine-tuning (or transfer learning) is the most effective method for digital pathology vision applications such as disease classification and semantic segmentation. However, due to texture bias in models trained on real-world images, transfer learning for histopathology applications might result in underperforming models, which necessitates the need…
▽ More
Annotating medical imaging datasets is costly, so fine-tuning (or transfer learning) is the most effective method for digital pathology vision applications such as disease classification and semantic segmentation. However, due to texture bias in models trained on real-world images, transfer learning for histopathology applications might result in underperforming models, which necessitates the need for using unlabeled histopathology data and self-supervised methods to discover domain-specific characteristics. Here, we tested the premise that histopathology-specific pretrained models provide better initializations for pathology vision tasks, i.e., gland and cell segmentation. In this study, we compare the performance of gland and cell segmentation tasks with histopathology domain-specific and non-domain-specific (real-world images) pretrained weights. Moreover, we investigate the dataset size at which domain-specific pretraining produces significant gains in performance. In addition, we investigated whether domain-specific initialization improves the effectiveness of out-of-distribution testing on distinct datasets but the same task. The results indicate that performance gain using domain-specific pretrained weights depends on both the task and the size of the training dataset. In instances with limited dataset sizes, a significant improvement in gland segmentation performance was also observed, whereas models trained on cell segmentation datasets exhibit no improvement.
△ Less
Submitted 21 August, 2023; v1 submitted 6 July, 2023;
originally announced July 2023.
-
Unsupervised Domain Adaptation for Medical Image Segmentation via Feature-space Density Matching
Authors:
Tushar Kataria,
Beatrice Knudsen,
Shireen Elhabian
Abstract:
Semantic segmentation is a critical step in automated image interpretation and analysis where pixels are classified into one or more predefined semantically meaningful classes. Deep learning approaches for semantic segmentation rely on harnessing the power of annotated images to learn features indicative of these semantic classes. Nonetheless, they often fail to generalize when there is a signific…
▽ More
Semantic segmentation is a critical step in automated image interpretation and analysis where pixels are classified into one or more predefined semantically meaningful classes. Deep learning approaches for semantic segmentation rely on harnessing the power of annotated images to learn features indicative of these semantic classes. Nonetheless, they often fail to generalize when there is a significant domain (i.e., distributional) shift between the training (i.e., source) data and the dataset(s) encountered when deployed (i.e., target), necessitating manual annotations for the target data to achieve acceptable performance. This is especially important in medical imaging because different image modalities have significant intra- and inter-site variations due to protocol and vendor variability. Current techniques are sensitive to hyperparameter tuning and target dataset size. This paper presents an unsupervised domain adaptation approach for semantic segmentation that alleviates the need for annotating target data. Using kernel density estimation, we match the target data distribution to the source in the feature space, particularly when the number of target samples is limited (3% of the target dataset size). We demonstrate the efficacy of our proposed approach on 2 datasets, multisite prostate MRI and histopathology images.
△ Less
Submitted 6 July, 2023; v1 submitted 9 May, 2023;
originally announced May 2023.
-
A Pathologist-Informed Workflow for Classification of Prostate Glands in Histopathology
Authors:
Alessandro Ferrero,
Beatrice Knudsen,
Deepika Sirohi,
Ross Whitaker
Abstract:
Pathologists diagnose and grade prostate cancer by examining tissue from needle biopsies on glass slides. The cancer's severity and risk of metastasis are determined by the Gleason grade, a score based on the organization and morphology of prostate cancer glands. For diagnostic work-up, pathologists first locate glands in the whole biopsy core, and -- if they detect cancer -- they assign a Gleason…
▽ More
Pathologists diagnose and grade prostate cancer by examining tissue from needle biopsies on glass slides. The cancer's severity and risk of metastasis are determined by the Gleason grade, a score based on the organization and morphology of prostate cancer glands. For diagnostic work-up, pathologists first locate glands in the whole biopsy core, and -- if they detect cancer -- they assign a Gleason grade. This time-consuming process is subject to errors and significant inter-observer variability, despite strict diagnostic criteria. This paper proposes an automated workflow that follows pathologists' \textit{modus operandi}, isolating and classifying multi-scale patches of individual glands in whole slide images (WSI) of biopsy tissues using distinct steps: (1) two fully convolutional networks segment epithelium versus stroma and gland boundaries, respectively; (2) a classifier network separates benign from cancer glands at high magnification; and (3) an additional classifier predicts the grade of each cancer gland at low magnification. Altogether, this process provides a gland-specific approach for prostate cancer grading that we compare against other machine-learning-based grading methods.
△ Less
Submitted 27 September, 2022;
originally announced September 2022.
-
Stain Based Contrastive Co-training for Histopathological Image Analysis
Authors:
Bodong Zhang,
Beatrice Knudsen,
Deepika Sirohi,
Alessandro Ferrero,
Tolga Tasdizen
Abstract:
We propose a novel semi-supervised learning approach for classification of histopathology images. We employ strong supervision with patch-level annotations combined with a novel co-training loss to create a semi-supervised learning framework. Co-training relies on multiple conditionally independent and sufficient views of the data. We separate the hematoxylin and eosin channels in pathology images…
▽ More
We propose a novel semi-supervised learning approach for classification of histopathology images. We employ strong supervision with patch-level annotations combined with a novel co-training loss to create a semi-supervised learning framework. Co-training relies on multiple conditionally independent and sufficient views of the data. We separate the hematoxylin and eosin channels in pathology images using color deconvolution to create two views of each slide that can partially fulfill these requirements. Two separate CNNs are used to embed the two views into a joint feature space. We use a contrastive loss between the views in this feature space to implement co-training. We evaluate our approach in clear cell renal cell and prostate carcinomas, and demonstrate improvement over state-of-the-art semi-supervised learning methods.
△ Less
Submitted 26 August, 2022; v1 submitted 24 June, 2022;
originally announced June 2022.
-
Visual attention analysis of pathologists examining whole slide images of Prostate cancer
Authors:
Souradeep Chakraborty,
Ke Ma,
Rajarsi Gupta,
Beatrice Knudsen,
Gregory J. Zelinsky,
Joel H. Saltz,
Dimitris Samaras
Abstract:
We study the attention of pathologists as they examine whole-slide images (WSIs) of prostate cancer tissue using a digital microscope. To the best of our knowledge, our study is the first to report in detail how pathologists navigate WSIs of prostate cancer as they accumulate information for their diagnoses. We collected slide navigation data (i.e., viewport location, magnification level, and time…
▽ More
We study the attention of pathologists as they examine whole-slide images (WSIs) of prostate cancer tissue using a digital microscope. To the best of our knowledge, our study is the first to report in detail how pathologists navigate WSIs of prostate cancer as they accumulate information for their diagnoses. We collected slide navigation data (i.e., viewport location, magnification level, and time) from 13 pathologists in 2 groups (5 genitourinary (GU) specialists and 8 general pathologists) and generated visual attention heatmaps and scanpaths. Each pathologist examined five WSIs from the TCGA PRAD dataset, which were selected by a GU pathology specialist. We examined and analyzed the distributions of visual attention for each group of pathologists after each WSI was examined. To quantify the relationship between a pathologist's attention and evidence for cancer in the WSI, we obtained tumor annotations from a genitourinary specialist. We used these annotations to compute the overlap between the distribution of visual attention and annotated tumor region to identify strong correlations. Motivated by this analysis, we trained a deep learning model to predict visual attention on unseen WSIs. We find that the attention heatmaps predicted by our model correlate quite well with the ground truth attention heatmap and tumor annotations on a test set of 17 WSIs by using various spatial and temporal evaluation metrics.
△ Less
Submitted 2 May, 2022; v1 submitted 16 February, 2022;
originally announced February 2022.
-
Load Balancing with Dynamic Set of Balls and Bins
Authors:
Anders Aamand,
Jakob Bæk Tejs Knudsen,
Mikkel Thorup
Abstract:
In dynamic load balancing, we wish to distribute balls into bins in an environment where both balls and bins can be added and removed. We want to minimize the maximum load of any bin but we also want to minimize the number of balls and bins affected when adding or removing a ball or a bin. We want a hashing-style solution where we given the ID of a ball can find its bin efficiently.
We are given…
▽ More
In dynamic load balancing, we wish to distribute balls into bins in an environment where both balls and bins can be added and removed. We want to minimize the maximum load of any bin but we also want to minimize the number of balls and bins affected when adding or removing a ball or a bin. We want a hashing-style solution where we given the ID of a ball can find its bin efficiently.
We are given a balancing parameter $c=1+ε$, where $ε\in (0,1)$. With $n$ and $m$ the current numbers of balls and bins, we want no bin with load above $C=\lceil c n/m\rceil$, referred to as the capacity of the bins.
We present a scheme where we can locate a ball checking $1+O(\log 1/ε)$ bins in expectation. When inserting or deleting a ball, we expect to move $O(1/ε)$ balls, and when inserting or deleting a bin, we expect to move $O(C/ε)$ balls. Previous bounds were off by a factor $1/ε$.
These bounds are best possible when $C=O(1)$ but for larger $C$, we can do much better: Let $f=εC$ if $C\leq \log 1/ε$, $f=ε\sqrt{C}\cdot \sqrt{\log(1/(ε\sqrt{C}))}$ if $\log 1/ε\leq C<\tfrac{1}{2ε^2}$, and $C=1$ if $C\geq \tfrac{1}{2ε^2}$. We show that we expect to move $O(1/f)$ balls when inserting or deleting a ball, and $O(C/f)$ balls when inserting or deleting a bin.
For the bounds with larger $C$, we first have to resolve a much simpler probabilistic problem. Place $n$ balls in $m$ bins of capacity $C$, one ball at the time. Each ball picks a uniformly random non-full bin. We show that in expectation and with high probability, the fraction of non-full bins is $Θ(f)$. Then the expected number of bins that a new ball would have to visit to find one that is not full is $Θ(1/f)$. As it turns out, we obtain the same complexity in our more complicated scheme where both balls and bins can be added and removed.
△ Less
Submitted 11 April, 2021;
originally announced April 2021.
-
The Power of Hashing with Mersenne Primes
Authors:
Thomas Dybdahl Ahle,
Jakob Tejs Bæk Knudsen,
Mikkel Thorup
Abstract:
The classic way of computing a $k$-universal hash function is to use a random degree-$(k-1)$ polynomial over a prime field $\mathbb Z_p$. For a fast computation of the polynomial, the prime $p$ is often chosen as a Mersenne prime $p=2^b-1$.
In this paper, we show that there are other nice advantages to using Mersenne primes. Our view is that the hash function's output is a $b$-bit integer that i…
▽ More
The classic way of computing a $k$-universal hash function is to use a random degree-$(k-1)$ polynomial over a prime field $\mathbb Z_p$. For a fast computation of the polynomial, the prime $p$ is often chosen as a Mersenne prime $p=2^b-1$.
In this paper, we show that there are other nice advantages to using Mersenne primes. Our view is that the hash function's output is a $b$-bit integer that is uniformly distributed in $\{0, \dots, 2^b-1\}$, except that $p$ (the all \texttt1s value in binary) is missing. Uniform bit strings have many nice properties, such as splitting into substrings which gives us two or more hash functions for the cost of one, while preserving strong theoretical qualities. We call this trick "Two for one" hashing, and we demonstrate it on 4-universal hashing in the classic Count Sketch algorithm for second-moment estimation.
We also provide a new fast branch-free code for division and modulus with Mersenne primes. Contrasting our analytic work, this code generalizes to any Pseudo-Mersenne primes $p=2^b-c$ for small $c$.
△ Less
Submitted 6 May, 2021; v1 submitted 19 August, 2020;
originally announced August 2020.
-
No Repetition: Fast Streaming with Highly Concentrated Hashing
Authors:
Anders Aamand,
Debarati Das,
Evangelos Kipouridis,
Jakob B. T. Knudsen,
Peter M. R. Rasmussen,
Mikkel Thorup
Abstract:
To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of this approach are small space algorithms for estimating the number of distinct elements in a stream, or estimating the set similarity between large sets. Using sta…
▽ More
To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of this approach are small space algorithms for estimating the number of distinct elements in a stream, or estimating the set similarity between large sets. Using standard strongly universal hashing to process each element, we get a sketch based estimator where the probability of a too large error is, say, 1/4. By performing $r$ independent repetitions and taking the median of the estimators, the error probability falls exponentially in $r$. However, running $r$ independent experiments increases the processing time by a factor $r$.
Here we make the point that if we have a hash function with strong concentration bounds, then we get the same high probability bounds without any need for repetitions. Instead of $r$ independent sketches, we have a single sketch that is $r$ times bigger, so the total space is the same. However, we only apply a single hash function, so we save a factor $r$ in time, and the overall algorithms just get simpler.
Fast practical hash functions with strong concentration bounds were recently proposed by Aamand em et al. (to appear in STOC 2020). Using their hashing schemes, the algorithms thus become very fast and practical, suitable for online processing of high volume data streams.
△ Less
Submitted 2 April, 2020;
originally announced April 2020.
-
Almost Optimal Tensor Sketch
Authors:
Thomas D. Ahle,
Jakob B. T. Knudsen
Abstract:
We construct a matrix $M\in R^{m\otimes d^c}$ with just $m=O(c\,λ\,\varepsilon^{-2}\text{poly}\log1/\varepsilonδ)$ rows, which preserves the norm $\|Mx\|_2=(1\pm\varepsilon)\|x\|_2$ of all $x$ in any given $λ$ dimensional subspace of $ R^d$ with probability at least $1-δ$. This matrix can be applied to tensors $x^{(1)}\otimes\dots\otimes x^{(c)}\in R^{d^c}$ in $O(c\, m \min\{d,m\})$ time -- hence…
▽ More
We construct a matrix $M\in R^{m\otimes d^c}$ with just $m=O(c\,λ\,\varepsilon^{-2}\text{poly}\log1/\varepsilonδ)$ rows, which preserves the norm $\|Mx\|_2=(1\pm\varepsilon)\|x\|_2$ of all $x$ in any given $λ$ dimensional subspace of $ R^d$ with probability at least $1-δ$. This matrix can be applied to tensors $x^{(1)}\otimes\dots\otimes x^{(c)}\in R^{d^c}$ in $O(c\, m \min\{d,m\})$ time -- hence the name "Tensor Sketch". (Here $x\otimes y = \text{asvec}(xy^T) = [x_1y_1, x_1y_2,\dots,x_1y_m,x_2y_1,\dots,x_ny_m]\in R^{nm}$.)
This improves upon earlier Tensor Sketch constructions by Pagh and Pham~[TOCT 2013, SIGKDD 2013] and Avron et al.~[NIPS 2014] which require $m=Ω(3^cλ^2δ^{-1})$ rows for the same guarantees. The factors of $λ$, $\varepsilon^{-2}$ and $\log1/δ$ can all be shown to be necessary making our sketch optimal up to log factors.
With another construction we get $λ$ times more rows $m=\tilde O(c\,λ^2\,\varepsilon^{-2}(\log1/δ)^3)$, but the matrix can be applied to any vector $x^{(1)}\otimes\dots\otimes x^{(c)}\in R^{d^c}$ in just $\tilde O(c\, (d+m))$ time. This matches the application time of Tensor Sketch while still improving the exponential dependencies in $c$ and $\log1/δ$.
Technically, we show two main lemmas: (1) For many Johnson Lindenstrauss (JL) constructions, if $Q,Q'\in R^{m\times d}$ are independent JL matrices, the element-wise product $Qx \circ Q'y$ equals $M(x\otimes y)$ for some $M\in R^{m\times d^2}$ which is itself a JL matrix. (2) If $M^{(i)}\in R^{m\times md}$ are independent JL matrices, then $M^{(1)}(x \otimes (M^{(2)}y \otimes \dots)) = M(x\otimes y\otimes \dots)$ for some $M\in R^{m\times d^c}$ which is itself a JL matrix. Combining these two results give an efficient sketch for tensors of any size.
△ Less
Submitted 3 September, 2019;
originally announced September 2019.
-
Oblivious Sketching of High-Degree Polynomial Kernels
Authors:
Thomas D. Ahle,
Michael Kapralov,
Jakob B. T. Knudsen,
Rasmus Pagh,
Ameya Velingker,
David Woodruff,
Amir Zandieh
Abstract:
Kernel methods are fundamental tools in machine learning that allow detection of non-linear dependencies between data without explicitly constructing feature vectors in high dimensional spaces. A major disadvantage of kernel methods is their poor scalability: primitives such as kernel PCA or kernel ridge regression generally take prohibitively large quadratic space and (at least) quadratic time, a…
▽ More
Kernel methods are fundamental tools in machine learning that allow detection of non-linear dependencies between data without explicitly constructing feature vectors in high dimensional spaces. A major disadvantage of kernel methods is their poor scalability: primitives such as kernel PCA or kernel ridge regression generally take prohibitively large quadratic space and (at least) quadratic time, as kernel matrices are usually dense. Some methods for speeding up kernel linear algebra are known, but they all invariably take time exponential in either the dimension of the input point set (e.g., fast multipole methods suffer from the curse of dimensionality) or in the degree of the kernel function.
Oblivious sketching has emerged as a powerful approach to speeding up numerical linear algebra over the past decade, but our understanding of oblivious sketching solutions for kernel matrices has remained quite limited, suffering from the aforementioned exponential dependence on input parameters. Our main contribution is a general method for applying sketching solutions developed in numerical linear algebra over the past decade to a tensoring of data points without forming the tensoring explicitly. This leads to the first oblivious sketch for the polynomial kernel with a target dimension that is only polynomially dependent on the degree of the kernel function, as well as the first oblivious sketch for the Gaussian kernel on bounded datasets that does not suffer from an exponential dependence on the dimensionality of input data points.
△ Less
Submitted 22 December, 2020; v1 submitted 3 September, 2019;
originally announced September 2019.
-
An attention-based multi-resolution model for prostate whole slide imageclassification and localization
Authors:
Jiayun Li,
Wenyuan Li,
Arkadiusz Gertych,
Beatrice S. Knudsen,
William Speier,
Corey W. Arnold
Abstract:
Histology review is often used as the `gold standard' for disease diagnosis. Computer aided diagnosis tools can potentially help improve current pathology workflows by reducing examination time and interobserver variability. Previous work in cancer grading has focused mainly on classifying pre-defined regions of interest (ROIs), or relied on large amounts of fine-grained labels. In this paper, we…
▽ More
Histology review is often used as the `gold standard' for disease diagnosis. Computer aided diagnosis tools can potentially help improve current pathology workflows by reducing examination time and interobserver variability. Previous work in cancer grading has focused mainly on classifying pre-defined regions of interest (ROIs), or relied on large amounts of fine-grained labels. In this paper, we propose a two-stage attention-based multiple instance learning model for slide-level cancer grading and weakly-supervised ROI detection and demonstrate its use in prostate cancer. Compared with existing Gleason classification models, our model goes a step further by utilizing visualized saliency maps to select informative tiles for fine-grained grade classification. The model was primarily developed on a large-scale whole slide dataset consisting of 3,521 prostate biopsy slides with only slide-level labels from 718 patients. The model achieved state-of-the-art performance for prostate cancer grading with an accuracy of 85.11\% for classifying benign, low-grade (Gleason grade 3+3 or 3+4), and high-grade (Gleason grade 4+3 or higher) slides on an independent test set.
△ Less
Submitted 30 May, 2019;
originally announced May 2019.
-
Fast hashing with Strong Concentration Bounds
Authors:
Anders Aamand,
Jakob B. T. Knudsen,
Mathias B. T. Knudsen,
Peter M. R. Rasmussen,
Mikkel Thorup
Abstract:
Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of…
▽ More
Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of $c=O(1)$ characters, e.g., a 64-bit key as $c=8$ characters of 8-bits. The character domain $Σ$ should be small enough that character tables of size $|Σ|$ fit in fast cache. The schemes then use $O(1)$ tables of this size, so the space of tabulation hashing is $O(|Σ|)$. However, the concentration bounds by Patrascu and Thorup only apply if the expected sums are $\ll |Σ|$.
To see the problem, consider the very simple case where we use tabulation hashing to throw $n$ balls into $m$ bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if $n=m$, for then the expected value is $1$. However, if $m=2$, as when tossing $n$ unbiased coins, the expected value $n/2$ is $\gg |Σ|$ for large data sets, e.g., data sets that do not fit in fast cache.
To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call \emph{tabulation-permutation} hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.
△ Less
Submitted 10 August, 2020; v1 submitted 1 May, 2019;
originally announced May 2019.
-
Subsets and Supermajorities: Optimal Hashing-based Set Similarity Search
Authors:
Thomas Dybdahl Ahle,
Jakob Bæk Tejs Knudsen
Abstract:
We formulate and optimally solve a new generalized Set Similarity Search problem, which assumes the size of the database and query sets are known in advance. By creating polylog copies of our data-structure, we optimally solve any symmetric Approximate Set Similarity Search problem, including approximate versions of Subset Search, Maximum Inner Product Search (MIPS), Jaccard Similarity Search and…
▽ More
We formulate and optimally solve a new generalized Set Similarity Search problem, which assumes the size of the database and query sets are known in advance. By creating polylog copies of our data-structure, we optimally solve any symmetric Approximate Set Similarity Search problem, including approximate versions of Subset Search, Maximum Inner Product Search (MIPS), Jaccard Similarity Search and Partial Match.
Our algorithm can be seen as a natural generalization of previous work on Set as well as Euclidean Similarity Search, but conceptually it differs by optimally exploiting the information present in the sets as well as their complements, and doing so asymmetrically between queries and stored sets. Doing so we improve upon the best previous work: MinHash [J. Discrete Algorithms 1998], SimHash [STOC 2002], Spherical LSF [SODA 2016, 2017] and Chosen Path [STOC 2017] by as much as a factor $n^{0.14}$ in both time and space; or in the near-constant time regime, in space, by an arbitrarily large polynomial factor.
Turning the geometric concept, based on Boolean supermajority functions, into a practical algorithm requires ideas from branching random walks on $\mathbb Z^2$, for which we give the first non-asymptotic near tight analysis.
Our lower bounds follow from new hypercontractive arguments, which can be seen as characterizing the exact family of similarity search problems for which supermajorities are optimal. The optimality holds for among all hashing based data structures in the random setting, and by reductions, for 1 cell and 2 cell probe data structures. As a side effect, we obtain new hypercontractive bounds on the directed noise operator $T^{p_1 \to p_2}_ρ$.
△ Less
Submitted 20 April, 2020; v1 submitted 8 April, 2019;
originally announced April 2019.
-
Classifying Convex Bodies by their Contact and Intersection Graphs
Authors:
Anders Aamand,
Mikkel Abrahamsen,
Jakob Bæk Tejs Knudsen,
Peter Michael Reichstein Rasmussen
Abstract:
Suppose that $A$ is a convex body in the plane and that $A_1,\dots,A_n$ are translates of $A$. Such translates give rise to an intersection graph of $A$, $G=(V,E)$, with vertices $V=\{1,\dots,n\}$ and edges $E=\{uv\mid A_u\cap A_v\neq \emptyset\}$. The subgraph $G'=(V, E')$ satisfying that $E'\subset E$ is the set of edges $uv$ for which the interiors of $A_u$ and $A_v$ are disjoint is a unit dist…
▽ More
Suppose that $A$ is a convex body in the plane and that $A_1,\dots,A_n$ are translates of $A$. Such translates give rise to an intersection graph of $A$, $G=(V,E)$, with vertices $V=\{1,\dots,n\}$ and edges $E=\{uv\mid A_u\cap A_v\neq \emptyset\}$. The subgraph $G'=(V, E')$ satisfying that $E'\subset E$ is the set of edges $uv$ for which the interiors of $A_u$ and $A_v$ are disjoint is a unit distance graph of $A$. If furthermore $G'=G$, i.e., if the interiors of $A_u$ and $A_v$ are disjoint whenever $u\neq v$, then $G$ is a contact graph of $A$.
In this paper we study which pairs of convex bodies have the same contact, unit distance, or intersection graphs. We say that two convex bodies $A$ and $B$ are equivalent if there exists a linear transformation $B'$ of $B$ such that for any slope, the longest line segments with that slope contained in $A$ and $B'$, respectively, are equally long. For a broad class of convex bodies, including all strictly convex bodies and linear transformations of regular polygons, we show that the contact graphs of $A$ and $B$ are the same if and only if $A$ and $B$ are equivalent. We prove the same statement for unit distance and intersection graphs.
△ Less
Submitted 5 February, 2019;
originally announced February 2019.
-
Power of $d$ Choices with Simple Tabulation
Authors:
Anders Aamand,
Mathias Bæk Tejs Knudsen,
Mikkel Thorup
Abstract:
Suppose that we are to place $m$ balls into $n$ bins sequentially using the $d$-choice paradigm: For each ball we are given a choice of $d$ bins, according to $d$ hash functions $h_1,\dots,h_d$ and we place the ball in the least loaded of these bins breaking ties arbitrarily. Our interest is in the number of balls in the fullest bin after all $m$ balls have been placed.
Azar et al. [STOC'94] pro…
▽ More
Suppose that we are to place $m$ balls into $n$ bins sequentially using the $d$-choice paradigm: For each ball we are given a choice of $d$ bins, according to $d$ hash functions $h_1,\dots,h_d$ and we place the ball in the least loaded of these bins breaking ties arbitrarily. Our interest is in the number of balls in the fullest bin after all $m$ balls have been placed.
Azar et al. [STOC'94] proved that when $m=O(n)$ and when the hash functions are fully random the maximum load is at most $\frac{\lg \lg n }{\lg d}+O(1)$ whp (i.e. with probability $1-O(n^{-γ})$ for any choice of $γ$).
In this paper we suppose that the $h_1,\dots,h_d$ are simple tabulation hash functions. Generalising a result by Dahlgaard et al [SODA'16] we show that for an arbitrary constant $d\geq 2$ the maximum load is $O(\lg \lg n)$ whp, and that expected maximum load is at most $\frac{\lg \lg n}{\lg d}+O(1)$. We further show that by using a simple tie-breaking algorithm introduced by Vöcking [J.ACM'03] the expected maximum load drops to $\frac{\lg \lg n}{d\lg \varphi_d}+O(1)$ where $\varphi_d$ is the rate of growth of the $d$-ary Fibonacci numbers. Both of these expected bounds match those of the fully random setting.
The analysis by Dahlgaard et al. relies on a proof by Pătraşcu and Thorup [J.ACM'11] concerning the use of simple tabulation for cuckoo hashing. We need here a generalisation to $d>2$ hash functions, but the original proof is an 8-page tour de force of ad-hoc arguments that do not appear to generalise. Our main technical contribution is a shorter, simpler and more accessible proof of the result by Pătraşcu and Thorup, where the relevant parts generalise nicely to the analysis of $d$ choices.
△ Less
Submitted 25 April, 2018;
originally announced April 2018.
-
Practical Hash Functions for Similarity Estimation and Dimensionality Reduction
Authors:
Søren Dahlgaard,
Mathias Bæk Tejs Knudsen,
Mikkel Thorup
Abstract:
Hashing is a basic tool for dimensionality reduction employed in several aspects of machine learning. However, the perfomance analysis is often carried out under the abstract assumption that a truly random unit cost hash function is used, without concern for which concrete hash function is employed. The concrete hash function may work fine on sufficiently random input. The question is if it can be…
▽ More
Hashing is a basic tool for dimensionality reduction employed in several aspects of machine learning. However, the perfomance analysis is often carried out under the abstract assumption that a truly random unit cost hash function is used, without concern for which concrete hash function is employed. The concrete hash function may work fine on sufficiently random input. The question is if it can be trusted in the real world when faced with more structured input.
In this paper we focus on two prominent applications of hashing, namely similarity estimation with the one permutation hashing (OPH) scheme of Li et al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of which have found numerous applications, i.e. in approximate near-neighbour search with LSH and large-scale classification with SVM.
We consider mixed tabulation hashing of Dahlgaard et al.[FOCS'15] which was proved to perform like a truly random hash function in many applications, including OPH. Here we first show improved concentration bounds for FH with truly random hashing and then argue that mixed tabulation performs similar for sparse input. Our main contribution, however, is an experimental comparison of different hashing schemes when used inside FH, OPH, and LSH.
We find that mixed tabulation hashing is almost as fast as the multiply-mod-prime scheme ax+b mod p. Mutiply-mod-prime is guaranteed to work well on sufficiently random data, but we demonstrate that in the above applications, it can lead to bias and poor concentration on both real-world and synthetic data. We also compare with the popular MurmurHash3, which has no proven guarantees. Mixed tabulation and MurmurHash3 both perform similar to truly random hashing in our experiments. However, mixed tabulation is 40% faster than MurmurHash3, and it has the proven guarantee of good performance on all possible input.
△ Less
Submitted 23 November, 2017;
originally announced November 2017.
-
Linear Hashing is Awesome
Authors:
Mathias Bæk Tejs Knudsen
Abstract:
We consider the hash function $h(x) = ((ax+b) \bmod p) \bmod n$ where $a,b$ are chosen uniformly at random from $\{0,1,\ldots,p-1\}$. We prove that when we use $h(x)$ in hashing with chaining to insert $n$ elements into a table of size $n$ the expected length of the longest chain is $\tilde{O}\!\left(n^{1/3}\right)$. The proof also generalises to give the same bound when we use the multiply-shift…
▽ More
We consider the hash function $h(x) = ((ax+b) \bmod p) \bmod n$ where $a,b$ are chosen uniformly at random from $\{0,1,\ldots,p-1\}$. We prove that when we use $h(x)$ in hashing with chaining to insert $n$ elements into a table of size $n$ the expected length of the longest chain is $\tilde{O}\!\left(n^{1/3}\right)$. The proof also generalises to give the same bound when we use the multiply-shift hash function by Dietzfelbinger et al. [Journal of Algorithms 1997].
△ Less
Submitted 8 June, 2017;
originally announced June 2017.
-
The Entropy of Backwards Analysis
Authors:
Mathias Bæk Tejs Knudsen,
Mikkel Thorup
Abstract:
Backwards analysis, first popularized by Seidel, is often the simplest most elegant way of analyzing a randomized algorithm. It applies to incremental algorithms where elements are added incrementally, following some random permutation, e.g., incremental Delauney triangulation of a pointset, where points are added one by one, and where we always maintain the Delauney triangulation of the points ad…
▽ More
Backwards analysis, first popularized by Seidel, is often the simplest most elegant way of analyzing a randomized algorithm. It applies to incremental algorithms where elements are added incrementally, following some random permutation, e.g., incremental Delauney triangulation of a pointset, where points are added one by one, and where we always maintain the Delauney triangulation of the points added thus far. For backwards analysis, we think of the permutation as generated backwards, implying that the $i$th point in the permutation is picked uniformly at random from the $i$ points not picked yet in the backwards direction. Backwards analysis has also been applied elegantly by Chan to the randomized linear time minimum spanning tree algorithm of Karger, Klein, and Tarjan.
The question considered in this paper is how much randomness we need in order to trust the expected bounds obtained using backwards analysis, exactly and approximately. For the exact case, it turns out that a random permutation works if and only if it is minwise, that is, for any given subset, each element has the same chance of being first. Minwise permutations are known to have $Θ(n)$ entropy, and this is then also what we need for exact backwards analysis.
However, when it comes to approximation, the two concepts diverge dramatically. To get backwards analysis to hold within a factor $α$, the random permutation needs entropy $Ω(n/α)$. This contrasts with minwise permutations, where it is known that a $1+\varepsilon$ approximation only needs $Θ(\log (n/\varepsilon))$ entropy. Our negative result for backwards analysis essentially shows that it is as abstract as any analysis based on full randomness.
△ Less
Submitted 14 April, 2017;
originally announced April 2017.
-
Additive Spanners and Distance Oracles in Quadratic Time
Authors:
Mathias Bæk Tejs Knudsen
Abstract:
Let $G$ be an unweighted, undirected graph. An additive $k$-spanner of $G$ is a subgraph $H$ that approximates all distances between pairs of nodes up to an additive error of $+k$, that is, it satisfies $d_H(u,v) \le d_G(u,v)+k$ for all nodes $u,v$, where $d$ is the shortest path distance. We give a deterministic algorithm that constructs an additive $O\!\left(1\right)$-spanner with…
▽ More
Let $G$ be an unweighted, undirected graph. An additive $k$-spanner of $G$ is a subgraph $H$ that approximates all distances between pairs of nodes up to an additive error of $+k$, that is, it satisfies $d_H(u,v) \le d_G(u,v)+k$ for all nodes $u,v$, where $d$ is the shortest path distance. We give a deterministic algorithm that constructs an additive $O\!\left(1\right)$-spanner with $O\!\left(n^{4/3}\right)$ edges in $O\!\left(n^2\right)$ time. This should be compared with the randomized Monte Carlo algorithm by Woodruff [ICALP 2010] giving an additive $6$-spanner with $O\!\left(n^{4/3}\log^3 n\right)$ edges in expected time $O\!\left(n^2\log^2 n\right)$.
An $(α,β)$-approximate distance oracle for $G$ is a data structure that supports the following distance queries between pairs of nodes in $G$. Given two nodes $u$, $v$ it can in constant time compute a distance estimate $\tilde{d}$ that satisfies $d \le \tilde{d} \le αd + β$ where $d$ is the distance between $u$ and $v$ in $G$. Sommer [ICALP 2016] gave a randomized Monte Carlo $(2,1)$-distance oracle of size $O\!\left(n^{5/3}\text{poly} \log n\right)$ in expected time $O\!\left(n^2\text{poly} \log n\right)$. As an application of the additive $O(1)$-spanner we improve the construction by Sommer [ICALP 2016] and give a Las Vegas $(2,1)$-distance oracle of size $O\!\left(n^{5/3}\right)$ in time $O\!\left(n^2\right)$. This also implies an algorithm that in $O\!\left(n^2\right)$ gives approximate distance for all pairs of nodes in $G$ improving on the $O\!\left(n^2 \log n\right)$ algorithm by Baswana and Kavitha [SICOMP 2010].
△ Less
Submitted 14 April, 2017;
originally announced April 2017.
-
Maximal Unbordered Factors of Random Strings
Authors:
Patrick Hagge Cording,
Travis Gagie,
Mathias Bæk Tejs Knudsen,
Tomasz Kociumaka
Abstract:
A border of a string is a non-empty prefix of the string that is also a suffix of the string, and a string is unbordered if it has no border other than itself. Loptev, Kucherov, and Starikovskaya [CPM 2015] conjectured the following: If we pick a string of length $n$ from a fixed non-unary alphabet uniformly at random, then the expected maximum length of its unbordered factors is $n - O(1)$. We co…
▽ More
A border of a string is a non-empty prefix of the string that is also a suffix of the string, and a string is unbordered if it has no border other than itself. Loptev, Kucherov, and Starikovskaya [CPM 2015] conjectured the following: If we pick a string of length $n$ from a fixed non-unary alphabet uniformly at random, then the expected maximum length of its unbordered factors is $n - O(1)$. We confirm this conjecture by proving that the expected value is, in fact, ${n - Θ(σ^{-1})}$, where $σ$ is the size of the alphabet. This immediately implies that we can find such a maximal unbordered factor in linear time on average. However, we go further and show that the optimum average-case running time is in $Ω(\sqrt{n}) \cap O (\sqrt{n \log_σn})$ due to analogous bounds by Czumaj and Gąsieniec [CPM 2000] for the problem of computing the shortest period of a uniformly random string.
△ Less
Submitted 17 December, 2018; v1 submitted 14 April, 2017;
originally announced April 2017.
-
New Subquadratic Approximation Algorithms for the Girth
Authors:
Søren Dahlgaard,
Mathias Bæk Tejs Knudsen,
Morten Stöckel
Abstract:
We consider the problem of approximating the girth, $g$, of an unweighted and undirected graph $G=(V,E)$ with $n$ nodes and $m$ edges. A seminal result of Itai and Rodeh [SICOMP'78] gave an additive $1$-approximation in $O(n^2)$ time, and the main open question is thus how well we can do in subquadratic time.
In this paper we present two main results. The first is a $(1+\varepsilon,O(1))$-approx…
▽ More
We consider the problem of approximating the girth, $g$, of an unweighted and undirected graph $G=(V,E)$ with $n$ nodes and $m$ edges. A seminal result of Itai and Rodeh [SICOMP'78] gave an additive $1$-approximation in $O(n^2)$ time, and the main open question is thus how well we can do in subquadratic time.
In this paper we present two main results. The first is a $(1+\varepsilon,O(1))$-approximation in truly subquadratic time. Specifically, for any $k\ge 2$ our algorithm returns a cycle of length $2\lceil g/2\rceil+2\left\lceil\frac{g}{2(k-1)}\right\rceil$ in $\tilde{O}(n^{2-1/k})$ time. This generalizes the results of Lingas and Lundell [IPL'09] who showed it for the special case of $k=2$ and Roditty and Vassilevska Williams [SODA'12] who showed it for $k=3$. Our second result is to present an $O(1)$-approximation running in $O(n^{1+\varepsilon})$ time for any $\varepsilon > 0$. Prior to this work the fastest constant-factor approximation was the $\tilde{O}(n^{3/2})$ time $8/3$-approximation of Lingas and Lundell [IPL'09] using the algorithm corresponding to the special case $k=2$ of our first result.
△ Less
Submitted 7 April, 2017;
originally announced April 2017.
-
Finding Even Cycles Faster via Capped k-Walks
Authors:
Søren Dahlgaard,
Mathias Bæk Tejs Knudsen,
Morten Stöckel
Abstract:
In this paper, we consider the problem of finding a cycle of length $2k$ (a $C_{2k}$) in an undirected graph $G$ with $n$ nodes and $m$ edges for constant $k\ge2$. A classic result by Bondy and Simonovits [J.Comb.Th.'74] implies that if $m \ge100k n^{1+1/k}$, then $G$ contains a $C_{2k}$, further implying that one needs to consider only graphs with $m = O(n^{1+1/k})$.
Previously the best known a…
▽ More
In this paper, we consider the problem of finding a cycle of length $2k$ (a $C_{2k}$) in an undirected graph $G$ with $n$ nodes and $m$ edges for constant $k\ge2$. A classic result by Bondy and Simonovits [J.Comb.Th.'74] implies that if $m \ge100k n^{1+1/k}$, then $G$ contains a $C_{2k}$, further implying that one needs to consider only graphs with $m = O(n^{1+1/k})$.
Previously the best known algorithms were an $O(n^2)$ algorithm due to Yuster and Zwick [J.Disc.Math'97] as well as a $O(m^{2-(1+\lceil k/2\rceil^{-1})/(k+1)})$ algorithm by Alon et al. [Algorithmica'97].
We present an algorithm that uses $O(m^{2k/(k+1)})$ time and finds a $C_{2k}$ if one exists. This bound is $O(n^2)$ exactly when $m=Θ(n^{1+1/k})$. For $4$-cycles our new bound coincides with Alon et al., while for every $k>2$ our bound yields a polynomial improvement in $m$.
Yuster and Zwick noted that it is "plausible to conjecture that $O(n^2)$ is the best possible bound in terms of $n$". We show "conditional optimality": if this hypothesis holds then our $O(m^{2k/(k+1)})$ algorithm is tight as well. Furthermore, a folklore reduction implies that no combinatorial algorithm can determine if a graph contains a $6$-cycle in time $O(m^{3/2-ε})$ for any $ε>0$ under the widely believed combinatorial BMM conjecture. Coupled with our main result, this gives tight bounds for finding $6$-cycles combinatorially and also separates the complexity of finding $4$- and $6$-cycles giving evidence that the exponent of $m$ in the running time should indeed increase with $k$.
The key ingredient in our algorithm is a new notion of capped $k$-walks, which are walks of length $k$ that visit only nodes according to a fixed ordering. Our main technical contribution is an involved analysis proving several properties of such walks which may be of independent interest.
△ Less
Submitted 30 March, 2017;
originally announced March 2017.
-
Near-Optimal Induced Universal Graphs for Bounded Degree Graphs
Authors:
Mikkel Abrahamsen,
Stephen Alstrup,
Jacob Holm,
Mathias Bæk Tejs Knudsen,
Morten Stöckel
Abstract:
A graph $U$ is an induced universal graph for a family $F$ of graphs if every graph in $F$ is a vertex-induced subgraph of $U$. For the family of all undirected graphs on $n$ vertices Alstrup, Kaplan, Thorup, and Zwick [STOC 2015] give an induced universal graph with $O\!\left(2^{n/2}\right)$ vertices, matching a lower bound by Moon [Proc. Glasgow Math. Assoc. 1965].
Let $k= \lceil D/2 \rceil$.…
▽ More
A graph $U$ is an induced universal graph for a family $F$ of graphs if every graph in $F$ is a vertex-induced subgraph of $U$. For the family of all undirected graphs on $n$ vertices Alstrup, Kaplan, Thorup, and Zwick [STOC 2015] give an induced universal graph with $O\!\left(2^{n/2}\right)$ vertices, matching a lower bound by Moon [Proc. Glasgow Math. Assoc. 1965].
Let $k= \lceil D/2 \rceil$. Improving asymptotically on previous results by Butler [Graphs and Combinatorics 2009] and Esperet, Arnaud and Ochem [IPL 2008], we give an induced universal graph with $O\!\left(\frac{k2^k}{k!}n^k \right)$ vertices for the family of graphs with $n$ vertices of maximum degree $D$. For constant $D$, Butler gives a lower bound of $Ω\!\left(n^{D/2}\right)$. For an odd constant $D\geq 3$, Esperet et al. and Alon and Capalbo [SODA 2008] give a graph with $O\!\left(n^{k-\frac{1}{D}}\right)$ vertices. Using their techniques for any (including constant) even values of $D$ gives asymptotically worse bounds than we present.
For large $D$, i.e. when $D = Ω\left(\log^3 n\right)$, the previous best upper bound was ${n\choose\lceil D/2\rceil} n^{O(1)}$ due to Adjiashvili and Rotbart [ICALP 2014]. We give upper and lower bounds showing that the size is ${\lfloor n/2\rfloor\choose\lfloor D/2 \rfloor}2^{\pm\tilde{O}\left(\sqrt{D}\right)}$. Hence the optimal size is $2^{\tilde{O}(D)}$ and our construction is within a factor of $2^{\tilde{O}\left(\sqrt{D}\right)}$ from this. The previous results were larger by at least a factor of $2^{Ω(D)}$.
As a part of the above, proving a conjecture by Esperet et al., we construct an induced universal graph with $2n-1$ vertices for the family of graphs with max degree $2$. In addition, we give results for acyclic graphs with max degree $2$ and cycle graphs. Our results imply the first labeling schemes that for any $D$ are at most $o(n)$ bits from optimal.
△ Less
Submitted 21 July, 2016; v1 submitted 17 July, 2016;
originally announced July 2016.
-
Sublinear Distance Labeling
Authors:
Stephen Alstrup,
Søren Dahlgaard,
Mathias Bæk Tejs Knudsen,
Ely Porat
Abstract:
A distance labeling scheme labels the $n$ nodes of a graph with binary strings such that, given the labels of any two nodes, one can determine the distance in the graph between the two nodes by looking only at the labels. A $D$-preserving distance labeling scheme only returns precise distances between pairs of nodes that are at distance at least $D$ from each other. In this paper we consider dista…
▽ More
A distance labeling scheme labels the $n$ nodes of a graph with binary strings such that, given the labels of any two nodes, one can determine the distance in the graph between the two nodes by looking only at the labels. A $D$-preserving distance labeling scheme only returns precise distances between pairs of nodes that are at distance at least $D$ from each other. In this paper we consider distance labeling schemes for the classical case of unweighted graphs with both directed and undirected edges.
We present a $O(\frac{n}{D}\log^2 D)$ bit $D$-preserving distance labeling scheme, improving the previous bound by Bollobás et. al. [SIAM J. Discrete Math. 2005]. We also give an almost matching lower bound of $Ω(\frac{n}{D})$. With our $D$-preserving distance labeling scheme as a building block, we additionally achieve the following results:
1. We present the first distance labeling scheme of size $o(n)$ for sparse graphs (and hence bounded degree graphs). This addresses an open problem by Gavoille et. al. [J. Algo. 2004], hereby separating the complexity from distance labeling in general graphs which require $Ω(n)$ bits, Moon [Proc. of Glasgow Math. Association 1965].
2. For approximate $r$-additive labeling schemes, that return distances within an additive error of $r$ we show a scheme of size $O\left ( \frac{n}{r} \cdot\frac{\operatorname{polylog} (r\log n)}{\log n} \right )$ for $r \ge 2$. This improves on the current best bound of $O\left(\frac{n}{r}\right)$ by Alstrup et. al. [SODA 2016] for sub-polynomial $r$, and is a generalization of a result by Gawrychowski et al. [arXiv preprint 2015] who showed this for $r=2$.
△ Less
Submitted 8 September, 2016; v1 submitted 9 July, 2015;
originally announced July 2015.
-
Longest Common Extensions in Sublinear Space
Authors:
Philip Bille,
Inge Li Gørtz,
Mathias Bæk Tejs Knudsen,
Moshe Lewenstein,
Hjalte Wedel Vildhøj
Abstract:
The longest common extension problem (LCE problem) is to construct a data structure for an input string $T$ of length $n$ that supports LCE$(i,j)$ queries. Such a query returns the length of the longest common prefix of the suffixes starting at positions $i$ and $j$ in $T$. This classic problem has a well-known solution that uses $O(n)$ space and $O(1)$ query time. In this paper we show that for a…
▽ More
The longest common extension problem (LCE problem) is to construct a data structure for an input string $T$ of length $n$ that supports LCE$(i,j)$ queries. Such a query returns the length of the longest common prefix of the suffixes starting at positions $i$ and $j$ in $T$. This classic problem has a well-known solution that uses $O(n)$ space and $O(1)$ query time. In this paper we show that for any trade-off parameter $1 \leq τ\leq n$, the problem can be solved in $O(\frac{n}τ)$ space and $O(τ)$ query time. This significantly improves the previously best known time-space trade-offs, and almost matches the best known time-space product lower bound.
△ Less
Submitted 10 April, 2015;
originally announced April 2015.
-
Optimal induced universal graphs and adjacency labeling for trees
Authors:
Stephen Alstrup,
Søren Dahlgaard,
Mathias Bæk Tejs Knudsen
Abstract:
We show that there exists a graph $G$ with $O(n)$ nodes, where any forest of $n$ nodes is a node-induced subgraph of $G$. Furthermore, for constant arboricity $k$, the result implies the existence of a graph with $O(n^k)$ nodes that contains all $n$-node graphs as node-induced subgraphs, matching a $Ω(n^k)$ lower bound. The lower bound and previously best upper bounds were presented in Alstrup and…
▽ More
We show that there exists a graph $G$ with $O(n)$ nodes, where any forest of $n$ nodes is a node-induced subgraph of $G$. Furthermore, for constant arboricity $k$, the result implies the existence of a graph with $O(n^k)$ nodes that contains all $n$-node graphs as node-induced subgraphs, matching a $Ω(n^k)$ lower bound. The lower bound and previously best upper bounds were presented in Alstrup and Rauhe (FOCS'02). Our upper bounds are obtained through a $\log_2 n +O(1)$ labeling scheme for adjacency queries in forests.
We hereby solve an open problem being raised repeatedly over decades, e.g. in Kannan, Naor, Rudich (STOC 1988), Chung (J. of Graph Theory 1990), Fraigniaud and Korman (SODA 2010).
△ Less
Submitted 15 February, 2016; v1 submitted 9 April, 2015;
originally announced April 2015.
-
Quicksort, Largest Bucket, and Min-Wise Hashing with Limited Independence
Authors:
Mathias Bæk Tejs Knudsen,
Morten Stöckel
Abstract:
Randomized algorithms and data structures are often analyzed under the assumption of access to a perfect source of randomness. The most fundamental metric used to measure how "random" a hash function or a random number generator is, is its independence: a sequence of random variables is said to be $k$-independent if every variable is uniform and every size $k$ subset is independent. In this paper…
▽ More
Randomized algorithms and data structures are often analyzed under the assumption of access to a perfect source of randomness. The most fundamental metric used to measure how "random" a hash function or a random number generator is, is its independence: a sequence of random variables is said to be $k$-independent if every variable is uniform and every size $k$ subset is independent. In this paper we consider three classic algorithms under limited independence. We provide new bounds for randomized quicksort, min-wise hashing and largest bucket size under limited independence. Our results can be summarized as follows.
-Randomized quicksort. When pivot elements are computed using a $5$-independent hash function, Karloff and Raghavan, J.ACM'93 showed $O ( n \log n)$ expected worst-case running time for a special version of quicksort. We improve upon this, showing that the same running time is achieved with only $4$-independence.
-Min-wise hashing. For a set $A$, consider the probability of a particular element being mapped to the smallest hash value. It is known that $5$-independence implies the optimal probability $O (1 /n)$. Broder et al., STOC'98 showed that $2$-independence implies it is $O(1 / \sqrt{|A|})$. We show a matching lower bound as well as new tight bounds for $3$- and $4$-independent hash functions.
-Largest bucket. We consider the case where $n$ balls are distributed to $n$ buckets using a $k$-independent hash function and analyze the largest bucket size. Alon et. al, STOC'97 showed that there exists a $2$-independent hash function implying a bucket of size $Ω( n^{1/2})$. We generalize the bound, providing a $k$-independent family of functions that imply size $Ω( n^{1/k})$.
△ Less
Submitted 19 February, 2015;
originally announced February 2015.
-
Hashing for statistics over k-partitions
Authors:
Søren Dahlgaard,
Mathias Bæk Tejs Knudsen,
Eva Rotenberg,
Mikkel Thorup
Abstract:
In this paper we analyze a hash function for $k$-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin.
This generic method was originally introduced by Flajolet and Martin~[FOCS'83] in order to save a factor $Ω(k)$ of time per element over $k$ independent samples when estimating the number of distinct elements in a data st…
▽ More
In this paper we analyze a hash function for $k$-partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin.
This generic method was originally introduced by Flajolet and Martin~[FOCS'83] in order to save a factor $Ω(k)$ of time per element over $k$ independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used HyperLogLog algorithm of Flajolet et al.~[AOFA'97] and in large-scale machine learning by Li et al.~[NIPS'12] for minwise estimation of set similarity.
The main issue of $k$-partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation, does yield strong concentration bounds on the most popular applications of $k$-partitioning similar to those we would get using a truly random hash function. The analysis is very involved and implies several new results of independent interest for both simple and double tabulation, e.g. a simple and efficient construction for invertible bloom filters and uniform hashing on a given set.
△ Less
Submitted 15 February, 2016; v1 submitted 26 November, 2014;
originally announced November 2014.
-
The Power of Two Choices with Simple Tabulation
Authors:
Søren Dahlgaard,
Mathias Bæk Tejs Knudsen,
Eva Rotenberg,
Mikkel Thorup
Abstract:
The power of two choices is a classic paradigm for load balancing when assigning $m$ balls to $n$ bins. When placing a ball, we pick two bins according to two hash functions $h_0$ and $h_1$, and place the ball in the least loaded bin. Assuming fully random hash functions, when $m=O(n)$, Azar et al.~[STOC'94] proved that the maximum load is $\lg \lg n + O(1)$ with high probability.
In this paper,…
▽ More
The power of two choices is a classic paradigm for load balancing when assigning $m$ balls to $n$ bins. When placing a ball, we pick two bins according to two hash functions $h_0$ and $h_1$, and place the ball in the least loaded bin. Assuming fully random hash functions, when $m=O(n)$, Azar et al.~[STOC'94] proved that the maximum load is $\lg \lg n + O(1)$ with high probability.
In this paper, we investigate the power of two choices when the hash functions $h_0$ and $h_1$ are implemented with simple tabulation, which is a very efficient hash function evaluated in constant time. Following their analysis of Cuckoo hashing [J.ACM'12], Pǎtraşcu and Thorup claimed that the expected maximum load with simple tabulation is $O(\lg\lg n)$. This did not include any high probability guarantee, so the load balancing was not yet to be trusted.
Here, we show that with simple tabulation, the maximum load is $O(\lg\lg n)$ with high probability, giving the first constant time hash function with this guarantee. We also give a concrete example where, unlike with fully random hashing, the maximum load is not bounded by $\lg \lg n + O(1)$, or even $(1+o(1))\lg \lg n$ with high probability. Finally, we show that the expected maximum load is $\lg \lg n + O(1)$, just like with fully random hashing.
△ Less
Submitted 25 January, 2016; v1 submitted 25 July, 2014;
originally announced July 2014.
-
A simple and optimal ancestry labeling scheme for trees
Authors:
Søren Dahlgaard,
Mathias Bæk Tejs Knudsen,
Noy Rotbart
Abstract:
We present a $\lg n + 2 \lg \lg n+3$ ancestry labeling scheme for trees. The problem was first presented by Kannan et al. [STOC 88'] along with a simple $2 \lg n$ solution. Motivated by applications to XML files, the label size was improved incrementally over the course of more than 20 years by a series of papers. The last, due to Fraigniaud and Korman [STOC 10'], presented an asymptotically optim…
▽ More
We present a $\lg n + 2 \lg \lg n+3$ ancestry labeling scheme for trees. The problem was first presented by Kannan et al. [STOC 88'] along with a simple $2 \lg n$ solution. Motivated by applications to XML files, the label size was improved incrementally over the course of more than 20 years by a series of papers. The last, due to Fraigniaud and Korman [STOC 10'], presented an asymptotically optimal $\lg n + 4 \lg \lg n+O(1)$ labeling scheme using non-trivial tree-decomposition techniques. By providing a framework generalizing interval based labeling schemes, we obtain a simple, yet asymptotically optimal solution to the problem. Furthermore, our labeling scheme is attained by a small modification of the original $2 \lg n$ solution.
△ Less
Submitted 26 April, 2015; v1 submitted 18 July, 2014;
originally announced July 2014.
-
Dynamic and Multi-functional Labeling Schemes
Authors:
Søren Dahlgaard,
Mathias Bæk Tejs Knudsen,
Noy Rotbart
Abstract:
We investigate labeling schemes supporting adjacency, ancestry, sibling, and connectivity queries in forests. In the course of more than 20 years, the existence of $\log n + O(\log \log)$ labeling schemes supporting each of these functions was proven, with the most recent being ancestry [Fraigniaud and Korman, STOC '10]. Several multi-functional labeling schemes also enjoy lower or upper bounds of…
▽ More
We investigate labeling schemes supporting adjacency, ancestry, sibling, and connectivity queries in forests. In the course of more than 20 years, the existence of $\log n + O(\log \log)$ labeling schemes supporting each of these functions was proven, with the most recent being ancestry [Fraigniaud and Korman, STOC '10]. Several multi-functional labeling schemes also enjoy lower or upper bounds of $\log n + Ω(\log \log n)$ or $\log n + O(\log \log n)$ respectively. Notably an upper bound of $\log n + 5\log \log n$ for adjacency+siblings and a lower bound of $\log n + \log \log n$ for each of the functions siblings, ancestry, and connectivity [Alstrup et al., SODA '03]. We improve the constants hidden in the $O$-notation. In particular we show a $\log n + 2\log \log n$ lower bound for connectivity+ancestry and connectivity+siblings, as well as an upper bound of $\log n + 3\log \log n + O(\log \log \log n)$ for connectivity+adjacency+siblings by altering existing methods.
In the context of dynamic labeling schemes it is known that ancestry requires $Ω(n)$ bits [Cohen, et al. PODS '02]. In contrast, we show upper and lower bounds on the label size for adjacency, siblings, and connectivity of $2\log n$ bits, and $3 \log n$ to support all three functions. There exist efficient adjacency labeling schemes for planar, bounded treewidth, bounded arboricity and interval graphs. In a dynamic setting, we show a lower bound of $Ω(n)$ for each of those families.
△ Less
Submitted 19 April, 2014;
originally announced April 2014.
-
Additive Spanners: A Simple Construction
Authors:
Mathias Bæk Tejs Knudsen
Abstract:
We consider additive spanners of unweighted undirected graphs. Let $G$ be a graph and $H$ a subgraph of $G$. The most naïve way to construct an additive $k$-spanner of $G$ is the following: As long as $H$ is not an additive $k$-spanner repeat: Find a pair $(u,v) \in H$ that violates the spanner-condition and a shortest path from $u$ to $v$ in $G$. Add the edges of this path to $H$.
We show that,…
▽ More
We consider additive spanners of unweighted undirected graphs. Let $G$ be a graph and $H$ a subgraph of $G$. The most naïve way to construct an additive $k$-spanner of $G$ is the following: As long as $H$ is not an additive $k$-spanner repeat: Find a pair $(u,v) \in H$ that violates the spanner-condition and a shortest path from $u$ to $v$ in $G$. Add the edges of this path to $H$.
We show that, with a very simple initial graph $H$, this naïve method gives additive $6$- and $2$-spanners of sizes matching the best known upper bounds. For additive $2$-spanners we start with $H=\emptyset$ and end with $O(n^{3/2})$ edges in the spanner. For additive $6$-spanners we start with $H$ containing $\lfloor n^{1/3} \rfloor$ arbitrary edges incident to each node and end with a spanner of size $O(n^{4/3})$.
△ Less
Submitted 23 November, 2014; v1 submitted 2 March, 2014;
originally announced March 2014.
-
Statistical methods for tissue array images - algorithmic scoring and co-training
Authors:
Donghui Yan,
Pei Wang,
Michael Linden,
Beatrice Knudsen,
Timothy Randolph
Abstract:
Recent advances in tissue microarray technology have allowed immunohistochemistry to become a powerful medium-to-high throughput analysis tool, particularly for the validation of diagnostic and prognostic biomarkers. However, as study size grows, the manual evaluation of these assays becomes a prohibitive limitation; it vastly reduces throughput and greatly increases variability and expense. We pr…
▽ More
Recent advances in tissue microarray technology have allowed immunohistochemistry to become a powerful medium-to-high throughput analysis tool, particularly for the validation of diagnostic and prognostic biomarkers. However, as study size grows, the manual evaluation of these assays becomes a prohibitive limitation; it vastly reduces throughput and greatly increases variability and expense. We propose an algorithm - Tissue Array Co-Occurrence Matrix Analysis (TACOMA) - for quantifying cellular phenotypes based on textural regularity summarized by local inter-pixel relationships. The algorithm can be easily trained for any staining pattern, is absent of sensitive tuning parameters and has the ability to report salient pixels in an image that contribute to its score. Pathologists' input via informative training patches is an important aspect of the algorithm that allows the training for any specific marker or cell type. With co-training, the error rate of TACOMA can be reduced substantially for a very small training sample (e.g., with size 30). We give theoretical insights into the success of co-training via thinning of the feature set in a high-dimensional setting when there is "sufficient" redundancy among the features. TACOMA is flexible, transparent and provides a scoring process that can be evaluated with clarity and confidence. In a study based on an estrogen receptor (ER) marker, we show that TACOMA is comparable to, or outperforms, pathologists' performance in terms of accuracy and repeatability.
△ Less
Submitted 1 October, 2012; v1 submitted 31 January, 2011;
originally announced February 2011.