Search | arXiv e-print repository

From Images to Connections: Can DQN with GNNs learn the Strategic Game of Hex?

Authors: Yannik Keller, Jannis Blüml, Gopika Sudhakaran, Kristian Kersting

Abstract: The gameplay of strategic board games such as chess, Go and Hex is often characterized by combinatorial, relational structures -- capturing distinct interactions and non-local patterns -- and not just images. Nonetheless, most common self-play reinforcement learning (RL) approaches simply approximate policy and value functions using convolutional neural networks (CNN). A key feature of CNNs is the… ▽ More The gameplay of strategic board games such as chess, Go and Hex is often characterized by combinatorial, relational structures -- capturing distinct interactions and non-local patterns -- and not just images. Nonetheless, most common self-play reinforcement learning (RL) approaches simply approximate policy and value functions using convolutional neural networks (CNN). A key feature of CNNs is their relational inductive bias towards locality and translational invariance. In contrast, graph neural networks (GNN) can encode more complicated and distinct relational structures. Hence, we investigate the crucial question: Can GNNs, with their ability to encode complex connections, replace CNNs in self-play reinforcement learning? To this end, we do a comparison with Hex -- an abstract yet strategically rich board game -- serving as our experimental platform. Our findings reveal that GNNs excel at dealing with long range dependency situations in game states and are less prone to overfitting, but also showing a reduced proficiency in discerning local patterns. This suggests a potential paradigm shift, signaling the use of game-specific structures to reshape self-play reinforcement learning. △ Less

Submitted 22 November, 2023; originally announced November 2023.

arXiv:2308.14075 [pdf, other]

FaceCoresetNet: Differentiable Coresets for Face Set Recognition

Authors: Gil Shapira, Yosi Keller

Abstract: In set-based face recognition, we aim to compute the most discriminative descriptor from an unbounded set of images and videos showing a single person. A discriminative descriptor balances two policies when aggregating information from a given set. The first is a quality-based policy: emphasizing high-quality and down-weighting low-quality images. The second is a diversity-based policy: emphasizin… ▽ More In set-based face recognition, we aim to compute the most discriminative descriptor from an unbounded set of images and videos showing a single person. A discriminative descriptor balances two policies when aggregating information from a given set. The first is a quality-based policy: emphasizing high-quality and down-weighting low-quality images. The second is a diversity-based policy: emphasizing unique images in the set and down-weighting multiple occurrences of similar images as found in video clips which can overwhelm the set representation. This work frames face-set representation as a differentiable coreset selection problem. Our model learns how to select a small coreset of the input set that balances quality and diversity policies using a learned metric parameterized by the face quality, optimized end-to-end. The selection process is a differentiable farthest-point sampling (FPS) realized by approximating the non-differentiable Argmax operation with differentiable sampling from the Gumbel-Softmax distribution of distances. The small coreset is later used as queries in a self and cross-attention architecture to enrich the descriptor with information from the whole set. Our model is order-invariant and linear in the input set size. We set a new SOTA to set face verification on the IJB-B and IJB-C datasets. Our code is publicly available. △ Less

Submitted 13 December, 2023; v1 submitted 27 August, 2023; originally announced August 2023.

Comments: Accepted to AAAI-24

arXiv:2308.11783 [pdf, other]

Coarse-to-Fine Multi-Scene Pose Regression with Transformers

Authors: Yoli Shavit, Ron Ferens, Yosi Keller

Abstract: Absolute camera pose regressors estimate the position and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers.… ▽ More Absolute camera pose regressors estimate the position and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into pose predictions. This allows our model to focus on general features that are informative for localization, while embedding multiple scenes in parallel. We extend our previous MS-Transformer approach \cite{shavit2021learning} by introducing a mixed classification-regression architecture that improves the localization accuracy. Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). arXiv admin note: substantial text overlap with arXiv:2103.11468

arXiv:2305.07954 [pdf, other]

doi 10.1109/TIP.2016.2590832

Image Segmentation via Probabilistic Graph Matching

Authors: Ayelet Heimowitz, Yosi Keller

Abstract: This work presents an unsupervised and semi-automatic image segmentation approach where we formulate the segmentation as a inference problem based on unary and pairwise assignment probabilities computed using low-level image cues. The inference is solved via a probabilistic graph matching scheme, which allows rigorous incorporation of low level image cues and automatic tuning of parameters. The pr… ▽ More This work presents an unsupervised and semi-automatic image segmentation approach where we formulate the segmentation as a inference problem based on unary and pairwise assignment probabilities computed using low-level image cues. The inference is solved via a probabilistic graph matching scheme, which allows rigorous incorporation of low level image cues and automatic tuning of parameters. The proposed scheme is experimentally shown to compare favorably with contemporary semi-supervised and unsupervised image segmentation schemes, when applied to contemporary state-of-the-art image sets. △ Less

Submitted 13 May, 2023; originally announced May 2023.

Journal ref: IEEE Transactions on Image Processing, vol. 25, no. 10, pp. 4743-4752, Oct. 2016

arXiv:2305.02745 [pdf, other]

Age-Invariant Face Embedding using the Wasserstein Distance

Authors: Eran Dahan, Yosi Keller

Abstract: In this work, we study face verification in datasets where images of the same individuals exhibit significant age differences. This poses a major challenge for current face recognition and verification techniques. To address this issue, we propose a novel approach that utilizes multitask learning and a Wasserstein distance discriminator to disentangle age and identity embeddings of facial images.… ▽ More In this work, we study face verification in datasets where images of the same individuals exhibit significant age differences. This poses a major challenge for current face recognition and verification techniques. To address this issue, we propose a novel approach that utilizes multitask learning and a Wasserstein distance discriminator to disentangle age and identity embeddings of facial images. Our approach employs multitask learning with a Wasserstein distance discriminator that minimizes the mutual information between the age and identity embeddings by minimizing the Jensen-Shannon divergence. This improves the encoding of age and identity information in face images and enhances the performance of face verification in age-variant datasets. We evaluate the effectiveness of our approach using multiple age-variant face datasets and demonstrate its superiority over state-of-the-art methods in terms of face verification accuracy. △ Less

Submitted 4 May, 2023; originally announced May 2023.

arXiv:2304.11706 [pdf, other]

Deep Convolutional Tables: Deep Learning without Convolutions

Authors: Shay Dekel, Yosi Keller, Aharon Bar-Hillel

Abstract: We propose a novel formulation of deep networks that do not use dot-product neurons and rely on a hierarchy of voting tables instead, denoted as Convolutional Tables (CT), to enable accelerated CPU-based inference. Convolutional layers are the most time-consuming bottleneck in contemporary deep learning techniques, severely limiting their use in Internet of Things and CPU-based devices. The propos… ▽ More We propose a novel formulation of deep networks that do not use dot-product neurons and rely on a hierarchy of voting tables instead, denoted as Convolutional Tables (CT), to enable accelerated CPU-based inference. Convolutional layers are the most time-consuming bottleneck in contemporary deep learning techniques, severely limiting their use in Internet of Things and CPU-based devices. The proposed CT performs a fern operation at each image location: it encodes the location environment into a binary index and uses the index to retrieve the desired local output from a table. The results of multiple tables are combined to derive the final output. The computational complexity of a CT transformation is independent of the patch (filter) size and grows gracefully with the number of channels, outperforming comparable convolutional layers. It is shown to have a better capacity:compute ratio than dot-product neurons, and that deep CT networks exhibit a universal approximation property similar to neural networks. As the transformation involves computing discrete indices, we derive a soft relaxation and gradient-based approach for training the CT hierarchy. Deep CT networks have been experimentally shown to have accuracy comparable to that of CNNs of similar architectures. In the low compute regime, they enable an error:speed trade-off superior to alternative efficient CNN architectures. △ Less

Submitted 23 April, 2023; originally announced April 2023.

Comments: Accepted for publication. IEEE Transactions on Neural Networks and Learning Systems

arXiv:2303.02717 [pdf, other]

Learning to Localize in Unseen Scenes with Relative Pose Regressors

Authors: Ofer Idan, Yoli Shavit, Yosi Keller

Abstract: Relative pose regressors (RPRs) localize a camera by estimating its relative translation and rotation to a pose-labelled reference. Unlike scene coordinate regression and absolute pose regression methods, which learn absolute scene parameters, RPRs can (theoretically) localize in unseen environments, since they only learn the residual pose between camera pairs. In practice, however, the performanc… ▽ More Relative pose regressors (RPRs) localize a camera by estimating its relative translation and rotation to a pose-labelled reference. Unlike scene coordinate regression and absolute pose regression methods, which learn absolute scene parameters, RPRs can (theoretically) localize in unseen environments, since they only learn the residual pose between camera pairs. In practice, however, the performance of RPRs is significantly degraded in unseen scenes. In this work, we propose to aggregate paired feature maps into latent codes, instead of operating on global image descriptors, in order to improve the generalization of RPRs. We implement aggregation with concatenation, projection, and attention operations (Transformer Encoders) and learn to regress the relative pose parameters from the resulting latent codes. We further make use of a recently proposed continuous representation of rotation matrices, which alleviates the limitations of the commonly used quaternions. Compared to state-of-the-art RPRs, our model is shown to localize significantly better in unseen environments, across both indoor and outdoor benchmarks, while maintaining competitive performance in seen scenes. We validate our findings and architecture design through multiple ablations. Our code and pretrained models is publicly available. △ Less

Submitted 5 March, 2023; originally announced March 2023.

arXiv:2303.02615 [pdf, other]

Estimating Extreme 3D Image Rotation with Transformer Cross-Attention

Authors: Shay Dekel, Yosi Keller, Martin Cadik

Abstract: The estimation of large and extreme image rotation plays a key role in multiple computer vision domains, where the rotated images are related by a limited or a non-overlap** field of view. Contemporary approaches apply convolutional neural networks to compute a 4D correlation volume to estimate the relative rotation between image pairs. In this work, we propose a cross-attention-based approach t… ▽ More The estimation of large and extreme image rotation plays a key role in multiple computer vision domains, where the rotated images are related by a limited or a non-overlap** field of view. Contemporary approaches apply convolutional neural networks to compute a 4D correlation volume to estimate the relative rotation between image pairs. In this work, we propose a cross-attention-based approach that utilizes CNN feature maps and a Transformer-Encoder, to compute the cross-attention between the activation maps of the image pairs, which is shown to be an improved equivalent of the 4D correlation volume, used in previous works. In the suggested approach, higher attention scores are associated with image regions that encode visual cues of rotation. Our approach is end-to-end trainable and optimizes a simple regression loss. It is experimentally shown to outperform contemporary state-of-the-art schemes when applied to commonly used image rotation datasets and benchmarks, and establishes a new state-of-the-art accuracy on these datasets. We make our code publicly available. △ Less

Submitted 8 March, 2024; v1 submitted 5 March, 2023; originally announced March 2023.

Journal ref: CVPR 2024

arXiv:2303.02610 [pdf, other]

HyperPose: Camera Pose Localization using Attention Hypernetworks

Authors: Ron Ferens, Yosi Keller

Abstract: In this study, we propose the use of attention hypernetworks in camera pose localization. The dynamic nature of natural scenes, including changes in environment, perspective, and lighting, creates an inherent domain gap between the training and test sets that limits the accuracy of contemporary localization networks. To overcome this issue, we suggest a camera pose regressor that integrates a hype… ▽ More In this study, we propose the use of attention hypernetworks in camera pose localization. The dynamic nature of natural scenes, including changes in environment, perspective, and lighting, creates an inherent domain gap between the training and test sets that limits the accuracy of contemporary localization networks. To overcome this issue, we suggest a camera pose regressor that integrates a hypernetwork. During inference, the hypernetwork generates adaptive weights for the localization regression heads based on the input image, effectively reducing the domain gap. We also suggest the use of a Transformer-Encoder as the hypernetwork, instead of the common multilayer perceptron, to derive an attention hypernetwork. The proposed approach achieves superior results compared to state-of-the-art methods on contemporary datasets. To the best of our knowledge, this is the first instance of using hypernetworks in camera pose regression, as well as using Transformer-Encoders as hypernetworks. We make our code publicly available. △ Less

Submitted 5 March, 2023; originally announced March 2023.

arXiv:2210.03838 [pdf, other]

doi 10.1109/TPAMI.2021.3132163

Learning to embed semantic similarity for joint image-text retrieval

Authors: Noam Malali, Yosi Keller

Abstract: We present a deep learning approach for learning the joint semantic embeddings of images and captions in a Euclidean space, such that the semantic similarity is approximated by the L2 distances in the embedding space. For that, we introduce a metric learning scheme that utilizes multitask learning to learn the embedding of identical semantic concepts using a center loss. By introducing a different… ▽ More We present a deep learning approach for learning the joint semantic embeddings of images and captions in a Euclidean space, such that the semantic similarity is approximated by the L2 distances in the embedding space. For that, we introduce a metric learning scheme that utilizes multitask learning to learn the embedding of identical semantic concepts using a center loss. By introducing a differentiable quantization scheme into the end-to-end trainable network, we derive a semantic embedding of semantically similar concepts in Euclidean space. We also propose a novel metric learning formulation using an adaptive margin hinge loss, that is refined during the training phase. The proposed scheme was applied to the MS-COCO, Flicke30K and Flickr8K datasets, and was shown to compare favorably with contemporary state-of-the-art approaches. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

arXiv:2207.05530 [pdf, other]

Camera Pose Auto-Encoders for Improving Pose Regression

Authors: Yoli Shavit, Yosi Keller

Abstract: Absolute pose regressor (APR) networks are trained to estimate the pose of the camera given a captured image. They compute latent image representations from which the camera position and orientation are regressed. APRs provide a different tradeoff between localization accuracy, runtime, and memory, compared to structure-based localization schemes that provide state-of-the-art accuracy. In this wor… ▽ More Absolute pose regressor (APR) networks are trained to estimate the pose of the camera given a captured image. They compute latent image representations from which the camera position and orientation are regressed. APRs provide a different tradeoff between localization accuracy, runtime, and memory, compared to structure-based localization schemes that provide state-of-the-art accuracy. In this work, we introduce Camera Pose Auto-Encoders (PAEs), multilayer perceptrons that are trained via a Teacher-Student approach to encode camera poses using APRs as their teachers. We show that the resulting latent pose representations can closely reproduce APR performance and demonstrate their effectiveness for related tasks. Specifically, we propose a light-weight test-time optimization in which the closest train poses are encoded and used to refine camera position estimation. This procedure achieves a new state-of-the-art position accuracy for APRs, on both the CambridgeLandmarks and 7Scenes benchmarks. We also show that train images can be reconstructed from the learned pose encoding, paving the way for integrating visual information from the train set at a low memory cost. Our code and pre-trained models are available at https://github.com/yolish/camera-pose-auto-encoders. △ Less

Submitted 12 July, 2022; originally announced July 2022.

Comments: Accepted to ECCV22

arXiv:2202.12972 [pdf, other]

FSGANv2: Improved Subject Agnostic Face Swap** and Reenactment

Authors: Yuval Nirkin, Yosi Keller, Tal Hassner

Abstract: We present Face Swap** GAN (FSGAN) for face swap** and reenactment. Unlike previous work, we offer a subject agnostic swap** scheme that can be applied to pairs of faces without requiring training on those faces. We derive a novel iterative deep learning--based approach for face reenactment which adjusts significant pose and expression variations that can be applied to a single image or a vi… ▽ More We present Face Swap** GAN (FSGAN) for face swap** and reenactment. Unlike previous work, we offer a subject agnostic swap** scheme that can be applied to pairs of faces without requiring training on those faces. We derive a novel iterative deep learning--based approach for face reenactment which adjusts significant pose and expression variations that can be applied to a single image or a video sequence. For video sequences, we introduce a continuous interpolation of the face views based on reenactment, Delaunay Triangulation, and barycentric coordinates. Occluded face regions are handled by a face completion network. Finally, we use a face blending network for seamless blending of the two faces while preserving the target skin color and lighting conditions. This network uses a novel Poisson blending loss combining Poisson optimization with a perceptual loss. We compare our approach to existing state-of-the-art systems and show our results to be both qualitatively and quantitatively superior. This work describes extensions of the FSGAN method, proposed in an earlier conference version of our work, as well as additional experiments and results. △ Less

Submitted 25 February, 2022; originally announced February 2022.

Comments: arXiv admin note: text overlap with arXiv:1908.05932

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2022

arXiv:2201.06164 [pdf, other]

Synthesis and Reconstruction of Fingerprints using Generative Adversarial Networks

Authors: Rafael Bouzaglo, Yosi Keller

Abstract: Deep learning-based models have been shown to improve the accuracy of fingerprint recognition. While these algorithms show exceptional performance, they require large-scale fingerprint datasets for training and evaluation. In this work, we propose a novel fingerprint synthesis and reconstruction framework based on the StyleGan2 architecture, to address the privacy issues related to the acquisition… ▽ More Deep learning-based models have been shown to improve the accuracy of fingerprint recognition. While these algorithms show exceptional performance, they require large-scale fingerprint datasets for training and evaluation. In this work, we propose a novel fingerprint synthesis and reconstruction framework based on the StyleGan2 architecture, to address the privacy issues related to the acquisition of such large-scale datasets. We also derive a computational approach to modify the attributes of the generated fingerprint while preserving their identity. This allows synthesizing multiple different fingerprint images per finger. In particular, we introduce the SynFing synthetic fingerprints dataset consisting of 100K image pairs, each pair corresponding to the same identity. The proposed framework was experimentally shown to outperform contemporary state-of-the-art approaches for both fingerprint synthesis and reconstruction. It significantly improved the realism of the generated fingerprints, both visually and in terms of their ability to spoof fingerprint-based verification systems. The code and fingerprints dataset are publicly available: https://github.com/rafaelbou/fingerprint_generator. △ Less

Submitted 12 March, 2023; v1 submitted 16 January, 2022; originally announced January 2022.

arXiv:2106.01452 [pdf, other]

BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks

Authors: Yannik Keller, Jan Mackensen, Steffen Eger

Abstract: Adversarial attacks expose important blind spots of deep learning systems. While word- and sentence-level attack scenarios mostly deal with finding semantic paraphrases of the input that fool NLP models, character-level attacks typically insert typos into the input stream. It is commonly thought that these are easier to defend via spelling correction modules. In this work, we show that both a stan… ▽ More Adversarial attacks expose important blind spots of deep learning systems. While word- and sentence-level attack scenarios mostly deal with finding semantic paraphrases of the input that fool NLP models, character-level attacks typically insert typos into the input stream. It is commonly thought that these are easier to defend via spelling correction modules. In this work, we show that both a standard spellchecker and the approach of Pruthi et al. (2019), which trains to defend against insertions, deletions and swaps, perform poorly on the character-level benchmark recently proposed in Eger and Benz (2020) which includes more challenging attacks such as visual and phonetic perturbations and missing word segmentations. In contrast, we show that an untrained iterative approach which combines context-independent character-level information with context-dependent information from BERT's masked language modeling can perform on par with human crowd-workers from Amazon Mechanical Turk (AMT) supervised via 3-shot learning. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: Findings of ACL 2021

arXiv:2104.10741 [pdf, other]

doi 10.1145/3411764.3445140

AdaptiFont: Increasing Individuals' Reading Speed with a Generative Font Model and Bayesian Optimization

Authors: Florian Kadner, Yannik Keller, Constantin A. Rothkopf

Abstract: Digital text has become one of the primary ways of exchanging knowledge, but text needs to be rendered to a screen to be read. We present AdaptiFont, a human-in-the-loop system that is aimed at interactively increasing readability of text displayed on a monitor. To this end, we first learn a generative font space with non-negative matrix factorization from a set of classic fonts. In this space we… ▽ More Digital text has become one of the primary ways of exchanging knowledge, but text needs to be rendered to a screen to be read. We present AdaptiFont, a human-in-the-loop system that is aimed at interactively increasing readability of text displayed on a monitor. To this end, we first learn a generative font space with non-negative matrix factorization from a set of classic fonts. In this space we generate new true-type-fonts through active learning, render texts with the new font, and measure individual users' reading speed. Bayesian optimization sequentially generates new fonts on the fly to progressively increase individuals' reading speed. The results of a user study show that this adaptive font generation system finds regions in the font space corresponding to high reading speeds, that these fonts significantly increase participants' reading speed, and that the found fonts are significantly different across individual readers. △ Less

Submitted 21 April, 2021; originally announced April 2021.

Comments: 18 pages, 11 figures

Journal ref: In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 585, 1-11

arXiv:2103.11477 [pdf, other]

Paying Attention to Activation Maps in Camera Pose Regression

Authors: Yoli Shavit, Ron Ferens, Yosi Keller

Abstract: Camera pose regression methods apply a single forward pass to the query image to estimate the camera pose. As such, they offer a fast and light-weight alternative to traditional localization schemes based on image retrieval. Pose regression approaches simultaneously learn two regression tasks, aiming to jointly estimate the camera position and orientation using a single embedding vector computed b… ▽ More Camera pose regression methods apply a single forward pass to the query image to estimate the camera pose. As such, they offer a fast and light-weight alternative to traditional localization schemes based on image retrieval. Pose regression approaches simultaneously learn two regression tasks, aiming to jointly estimate the camera position and orientation using a single embedding vector computed by a convolutional backbone. We propose an attention-based approach for pose regression, where the convolutional activation maps are used as sequential inputs. Transformers are applied to encode the sequential activation maps as latent vectors, used for camera pose regression. This allows us to pay attention to spatially-varying deep features. Using two Transformer heads, we separately focus on the features for camera position and orientation, based on how informative they are per task. Our proposed approach is shown to compare favorably to contemporary pose regressors schemes and achieves state-of-the-art accuracy across multiple outdoor and indoor benchmarks. In particular, to the best of our knowledge, our approach is the only method to attain sub-meter average accuracy across outdoor scenes. We make our code publicly available from here. △ Less

Submitted 11 April, 2021; v1 submitted 21 March, 2021; originally announced March 2021.

arXiv:2103.11468 [pdf, other]

Learning Multi-Scene Absolute Pose Regression with Transformers

Authors: Yoli Shavit, Ron Ferens, Yosi Keller

Abstract: Absolute camera pose regressors estimate the position and orientation of a camera from the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended for learning multiple scenes by replacing the MLP head with a set of fully connected layers. In t… ▽ More Absolute camera pose regressors estimate the position and orientation of a camera from the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended for learning multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into candidate pose predictions. This mechanism allows our model to focus on general features that are informative for localization while embedding multiple scenes in parallel. We evaluate our method on commonly benchmarked indoor and outdoor datasets and show that it surpasses both multi-scene and state-of-the-art single-scene absolute pose regressors. We make our code publicly available from https://github.com/yolish/multi-scene-pose-transformer. △ Less

Submitted 26 July, 2021; v1 submitted 21 March, 2021; originally announced March 2021.

arXiv:2103.11247 [pdf, other]

Attention-Based Multimodal Image Matching

Authors: Aviad Moreshet, Yosi Keller

Abstract: We propose an attention-based approach for multimodal image patch matching using a Transformer encoder attending to the feature maps of a multiscale Siamese CNN. Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues. We also introduce an attention-residual architecture, using a residual connection bypassing the enc… ▽ More We propose an attention-based approach for multimodal image patch matching using a Transformer encoder attending to the feature maps of a multiscale Siamese CNN. Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues. We also introduce an attention-residual architecture, using a residual connection bypassing the encoder. This additional learning signal facilitates end-to-end training from scratch. Our approach is experimentally shown to achieve new state-of-the-art accuracy on both multimodal and single modality benchmarks, illustrating its general applicability. To the best of our knowledge, this is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task. △ Less

Submitted 24 September, 2023; v1 submitted 20 March, 2021; originally announced March 2021.

arXiv:2103.09882 [pdf, other]

Hierarchical Attention-based Age Estimation and Bias Estimation

Authors: Shakediel Hiba, Yosi Keller

Abstract: In this work we propose a novel deep-learning approach for age estimation based on face images. We first introduce a dual image augmentation-aggregation approach based on attention. This allows the network to jointly utilize multiple face image augmentations whose embeddings are aggregated by a Transformer-Encoder. The resulting aggregated embedding is shown to better encode the face image attribu… ▽ More In this work we propose a novel deep-learning approach for age estimation based on face images. We first introduce a dual image augmentation-aggregation approach based on attention. This allows the network to jointly utilize multiple face image augmentations whose embeddings are aggregated by a Transformer-Encoder. The resulting aggregated embedding is shown to better encode the face image attributes. We then propose a probabilistic hierarchical regression framework that combines a discrete probabilistic estimate of age labels, with a corresponding ensemble of regressors. Each regressor is particularly adapted and trained to refine the probabilistic estimate over a range of ages. Our scheme is shown to outperform contemporary schemes and provide a new state-of-the-art age estimation accuracy, when applied to the MORPH II dataset for age estimation. Last, we introduce a bias analysis of state-of-the-art age estimation results. △ Less

Submitted 27 September, 2023; v1 submitted 17 March, 2021; originally announced March 2021.

Comments: 11 pages, 7 figures

arXiv:2009.05871 [pdf, other]

A Unified Approach to Kinship Verification

Authors: Eran Dahan, Yosi Keller

Abstract: In this work, we propose a deep learning-based approach for kin verification using a unified multi-task learning scheme where all kinship classes are jointly learned. This allows us to better utilize small training sets that are typical of kin verification. We introduce a novel approach for fusing the embeddings of kin images, to avoid overfitting, which is a common issue in training such networks… ▽ More In this work, we propose a deep learning-based approach for kin verification using a unified multi-task learning scheme where all kinship classes are jointly learned. This allows us to better utilize small training sets that are typical of kin verification. We introduce a novel approach for fusing the embeddings of kin images, to avoid overfitting, which is a common issue in training such networks. An adaptive sampling scheme is derived for the training set images to resolve the inherent imbalance in kin verification datasets. A thorough ablation study exemplifies the effectivity of our approach, which is experimentally shown to outperform contemporary state-of-the-art kin verification results when applied to the Families In the Wild, FG2018, and FG2020 datasets. △ Less

Submitted 12 September, 2020; originally announced September 2020.

arXiv:2008.12262 [pdf, other]

DeepFake Detection Based on the Discrepancy Between the Face and its Context

Authors: Yuval Nirkin, Lior Wolf, Yosi Keller, Tal Hassner

Abstract: We propose a method for detecting face swap** and other identity manipulations in single images. Face swap** methods, such as DeepFake, manipulate the face region, aiming to adjust the face to the appearance of its context, while leaving the context unchanged. We show that this modus operandi produces discrepancies between the two regions. These discrepancies offer exploitable telltale signs o… ▽ More We propose a method for detecting face swap** and other identity manipulations in single images. Face swap** methods, such as DeepFake, manipulate the face region, aiming to adjust the face to the appearance of its context, while leaving the context unchanged. We show that this modus operandi produces discrepancies between the two regions. These discrepancies offer exploitable telltale signs of manipulation. Our approach involves two networks: (i) a face identification network that considers the face region bounded by a tight semantic segmentation, and (ii) a context recognition network that considers the face context (e.g., hair, ears, neck). We describe a method which uses the recognition signals from our two networks to detect such discrepancies, providing a complementary detection signal that improves conventional real vs. fake classifiers commonly used for detecting fake images. Our method achieves state of the art results on the FaceForensics++, Celeb-DF-v2, and DFDC benchmarks for face manipulation detection, and even generalizes to detect fakes produced by unseen methods. △ Less

Submitted 27 August, 2020; originally announced August 2020.

arXiv:1908.05932 [pdf, other]

FSGAN: Subject Agnostic Face Swap** and Reenactment

Authors: Yuval Nirkin, Yosi Keller, Tal Hassner

Abstract: We present Face Swap** GAN (FSGAN) for face swap** and reenactment. Unlike previous work, FSGAN is subject agnostic and can be applied to pairs of faces without requiring training on those faces. To this end, we describe a number of technical contributions. We derive a novel recurrent neural network (RNN)-based approach for face reenactment which adjusts for both pose and expression variations… ▽ More We present Face Swap** GAN (FSGAN) for face swap** and reenactment. Unlike previous work, FSGAN is subject agnostic and can be applied to pairs of faces without requiring training on those faces. To this end, we describe a number of technical contributions. We derive a novel recurrent neural network (RNN)-based approach for face reenactment which adjusts for both pose and expression variations and can be applied to a single image or a video sequence. For video sequences, we introduce continuous interpolation of the face views based on reenactment, Delaunay Triangulation, and barycentric coordinates. Occluded face regions are handled by a face completion network. Finally, we use a face blending network for seamless blending of the two faces while preserving target skin color and lighting conditions. This network uses a novel Poisson blending loss which combines Poisson optimization with perceptual loss. We compare our approach to existing state-of-the-art systems and show our results to be both qualitatively and quantitatively superior. △ Less

Submitted 16 August, 2019; originally announced August 2019.

Comments: 2019 IEEE/CVF International Conference on Computer Vision (ICCV)

arXiv:1811.01290 [pdf, ps, other]

Auto-ML Deep Learning for Rashi Scripts OCR

Authors: Shahar Mahpod, Yosi Keller

Abstract: In this work we propose an OCR scheme for manuscripts printed in Rashi font that is an ancient Hebrew font and corresponding dialect used in religious Jewish literature, for more than 600 years. The proposed scheme utilizes a convolution neural network (CNN) for visual inference and Long-Short Term Memory (LSTM) to learn the Rashi scripts dialect. In particular, we derive an AutoML scheme to optim… ▽ More In this work we propose an OCR scheme for manuscripts printed in Rashi font that is an ancient Hebrew font and corresponding dialect used in religious Jewish literature, for more than 600 years. The proposed scheme utilizes a convolution neural network (CNN) for visual inference and Long-Short Term Memory (LSTM) to learn the Rashi scripts dialect. In particular, we derive an AutoML scheme to optimize the CNN architecture, and a book-specific CNN training to improve the OCR accuracy. The proposed scheme achieved an accuracy of more than 99.8% using a dataset of more than 3M annotated letters from the Responsa Project dataset. △ Less

Submitted 22 February, 2020; v1 submitted 3 November, 2018; originally announced November 2018.

Comments: The paper is under consideration at Pattern Recognition Letters

arXiv:1810.12941 [pdf, other]

Joint detection and matching of feature points in multimodal images

Authors: Elad Ben Baruch, Yosi Keller

Abstract: In this work, we propose a novel Convolutional Neural Network (CNN) architecture for the joint detection and matching of feature points in images acquired by different sensors using a single forward pass. The resulting feature detector is tightly coupled with the feature descriptor, in contrast to classical approaches (SIFT, etc.), where the detection phase precedes and differs from computing the… ▽ More In this work, we propose a novel Convolutional Neural Network (CNN) architecture for the joint detection and matching of feature points in images acquired by different sensors using a single forward pass. The resulting feature detector is tightly coupled with the feature descriptor, in contrast to classical approaches (SIFT, etc.), where the detection phase precedes and differs from computing the descriptor. Our approach utilizes two CNN subnetworks, the first being a Siamese CNN and the second, consisting of dual non-weight-sharing CNNs. This allows simultaneous processing and fusion of the joint and disjoint cues in the multimodal image patches. The proposed approach is experimentally shown to outperform contemporary state-of-the-art schemes when applied to multiple datasets of multimodal images. It is also shown to provide repeatable feature points detections across multisensor images, outperforming state-of-the-art detectors. To the best of our knowledge, it is the first unified approach for the detection and matching of such images. △ Less

Submitted 16 June, 2021; v1 submitted 30 October, 2018; originally announced October 2018.

arXiv:1809.08493 [pdf, ps, other]

SelfKin: Self Adjusted Deep Model For Kinship Verification

Authors: Eran Dahan, Yosi Keller

Abstract: One of the unsolved challenges in the field of biometrics and face recognition is Kinship Verification. This problem aims to understand if two people are family-related and how (sisters, brothers, etc.) Solving this problem can give rise to varied tasks and applications. In the area of homeland security (HLS) it is crucial to auto-detect if the person questioned is related to a wanted suspect, In… ▽ More One of the unsolved challenges in the field of biometrics and face recognition is Kinship Verification. This problem aims to understand if two people are family-related and how (sisters, brothers, etc.) Solving this problem can give rise to varied tasks and applications. In the area of homeland security (HLS) it is crucial to auto-detect if the person questioned is related to a wanted suspect, In the field of biometrics, kinship-verification can help to discriminate between families by photos and in the field of predicting or fashion it can help to predict an older or younger model of people faces. Lately, and with the advanced deep learning technology, this problem has gained focus from the research community in matters of data and research. In this article, we propose using a Deep Learning approach for solving the Kinship-Verification problem. Further, we offer a novel self-learning deep model, which learns the essential features from different faces. We show that our model wins the Recognize Families In the Wild(RFIW2018,FG2018) challenge and obtains state-of-the-art results. Moreover, we show that our proposed model can reduce the size of the network by half without loss in performance. △ Less

Submitted 22 September, 2018; originally announced September 2018.

arXiv:1805.01760 [pdf, other]

doi 10.1016/j.cviu.2021.103171

Facial Landmarks Localization using Cascaded Neural Networks

Authors: Shahar Mahpod, Rig Das, Emanuele Maiorana, Yosi Keller, Patrizio Campisi

Abstract: The accurate localization of facial landmarks is at the core of face analysis tasks, such as face recognition and facial expression analysis, to name a few. In this work, we propose a novel localization approach based on a deep learning architecture that utilizes cascaded subnetworks with convolutional neural network units. The cascaded units of the first subnetwork estimate heatmap-based encoding… ▽ More The accurate localization of facial landmarks is at the core of face analysis tasks, such as face recognition and facial expression analysis, to name a few. In this work, we propose a novel localization approach based on a deep learning architecture that utilizes cascaded subnetworks with convolutional neural network units. The cascaded units of the first subnetwork estimate heatmap-based encodings of the landmarks locations, while the cascaded units of the second subnetwork receive as input the output of the corresponding heatmap estimation units, and refine them through regression. The proposed scheme is experimentally shown to compare favorably with contemporary state-of-the-art schemes, especially when applied to images depicting challenging localization conditions. △ Less

Submitted 19 July, 2021; v1 submitted 3 May, 2018; originally announced May 2018.

arXiv:1803.09420 [pdf, other]

doi 10.1109/ICPR48806.2021.9413325

Multi-scale Processing of Noisy Images using Edge Preservation Losses

Authors: Nati Ofir, Yosi Keller

Abstract: Noisy images processing is a fundamental task of computer vision. The first example is the detection of faint edges in noisy images, a challenging problem studied in the last decades. A recent study introduced a fast method to detect faint edges in the highest accuracy among all the existing approaches. Their complexity is nearly linear in the image's pixels and their runtime is seconds for a nois… ▽ More Noisy images processing is a fundamental task of computer vision. The first example is the detection of faint edges in noisy images, a challenging problem studied in the last decades. A recent study introduced a fast method to detect faint edges in the highest accuracy among all the existing approaches. Their complexity is nearly linear in the image's pixels and their runtime is seconds for a noisy image. Their approach utilizes a multi-scale binary partitioning of the image. By utilizing the multi-scale U-net architecture, we show in this paper that their method can be dramatically improved in both aspects of run time and accuracy. By training the network on a dataset of binary images, we developed an approach for faint edge detection that works in a linear complexity. Our runtime of a noisy image is milliseconds on a GPU. Even though our method is orders of magnitude faster, we still achieve higher accuracy of detection under many challenging scenarios. In addition, we show that our approach to performing multi-scale preprocessing of noisy images using U-net improves the ability to perform other vision tasks under the presence of noise. We prove it on the problems of noisy objects classification and classical image denoising. We show that multi-scale denoising can be carried out by a novel edge preservation loss. As our experiments show, we achieve high-quality results in the three aspects of faint edge detection, noisy image classification and natural image denoising. △ Less

Submitted 21 March, 2019; v1 submitted 26 March, 2018; originally announced March 2018.

arXiv:1801.05171 [pdf, other]

doi 10.1109/ICIP.2018.8451640

Deep Multi-Spectral Registration Using Invariant Descriptor Learning

Authors: Nati Ofir, Shai Silberstein, Hila Levi, Dani Rozenbaum, Yosi Keller, Sharon Duvdevani Bar

Abstract: In this paper, we introduce a novel deep-learning method to align cross-spectral images. Our approach relies on a learned descriptor which is invariant to different spectra. Multi-modal images of the same scene capture different signals and therefore their registration is challenging and it is not solved by classic approaches. To that end, we developed a feature-based approach that solves the visi… ▽ More In this paper, we introduce a novel deep-learning method to align cross-spectral images. Our approach relies on a learned descriptor which is invariant to different spectra. Multi-modal images of the same scene capture different signals and therefore their registration is challenging and it is not solved by classic approaches. To that end, we developed a feature-based approach that solves the visible (VIS) to Near-Infra-Red (NIR) registration problem. Our algorithm detects corners by Harris and matches them by a patch-metric learned on top of CIFAR-10 network descriptor. As our experiments demonstrate we achieve a high-quality alignment of cross-spectral images with a sub-pixel accuracy. Comparing to other existing methods, our approach is more accurate in the task of VIS to NIR registration. △ Less

Submitted 23 May, 2018; v1 submitted 16 January, 2018; originally announced January 2018.

arXiv:1711.01543 [pdf, other]

doi 10.1109/ICIP.2018.8451650

Registration and Fusion of Multi-Spectral Images Using a Novel Edge Descriptor

Authors: Nati Ofir, Shai Silberstein, Dani Rozenbaum, Yosi Keller, Sharon Duvdevani Bar

Abstract: In this paper we introduce a fully end-to-end approach for multi-spectral image registration and fusion. Our method for fusion combines images from different spectral channels into a single fused image by different approaches for low and high frequency signals. A prerequisite of fusion is a stage of geometric alignment between the spectral bands, commonly referred to as registration. Unfortunately… ▽ More In this paper we introduce a fully end-to-end approach for multi-spectral image registration and fusion. Our method for fusion combines images from different spectral channels into a single fused image by different approaches for low and high frequency signals. A prerequisite of fusion is a stage of geometric alignment between the spectral bands, commonly referred to as registration. Unfortunately, common methods for image registration of a single spectral channel do not yield reasonable results on images from different modalities. For that end, we introduce a new algorithm for multi-spectral image registration, based on a novel edge descriptor of feature points. Our method achieves an accurate alignment of a level that allows us to further fuse the images. As our experiments show, we produce a high quality of multi-spectral image registration and fusion under many challenging scenarios. △ Less

Submitted 28 May, 2018; v1 submitted 5 November, 2017; originally announced November 2017.

arXiv:1412.2067 [pdf, ps, other]

An algorithm for improving Non-Local Means operators via low-rank approximation

Authors: Victor May, Yosi Keller, Nir Sharon, Yoel Shkolnisky

Abstract: We present a method for improving a Non Local Means operator by computing its low-rank approximation. The low-rank operator is constructed by applying a filter to the spectrum of the original Non Local Means operator. This results in an operator which is less sensitive to noise while preserving important properties of the original operator. The method is efficiently implemented based on Chebyshev… ▽ More We present a method for improving a Non Local Means operator by computing its low-rank approximation. The low-rank operator is constructed by applying a filter to the spectrum of the original Non Local Means operator. This results in an operator which is less sensitive to noise while preserving important properties of the original operator. The method is efficiently implemented based on Chebyshev polynomials and is demonstrated on the application of natural images denoising. For this application, we provide a comprehensive comparison of our method with leading denoising methods. △ Less

Submitted 20 November, 2014; originally announced December 2014.

arXiv:1102.4258 [pdf, other]

SHREC 2011: robust feature detection and description benchmark

Authors: E. Boyer, A. M. Bronstein, M. M. Bronstein, B. Bustos, T. Darom, R. Horaud, I. Hotz, Y. Keller, J. Keustermans, A. Kovnatsky, R. Litman, J. Reininghaus, I. Sipiran, D. Smeets, P. Suetens, D. Vandermeulen, A. Zaharescu, V. Zobel

Abstract: Feature-based approaches have recently become very popular in computer vision and image analysis applications, and are becoming a promising direction in shape retrieval. SHREC'11 robust feature detection and description benchmark simulates the feature detection and description stages of feature-based shape retrieval algorithms. The benchmark tests the performance of shape feature detectors and des… ▽ More Feature-based approaches have recently become very popular in computer vision and image analysis applications, and are becoming a promising direction in shape retrieval. SHREC'11 robust feature detection and description benchmark simulates the feature detection and description stages of feature-based shape retrieval algorithms. The benchmark tests the performance of shape feature detectors and descriptors under a wide variety of transformations. The benchmark allows evaluating how algorithms cope with certain classes of transformations and strength of the transformations that can be dealt with. The present paper is a report of the SHREC'11 robust feature detection and description benchmark results. △ Less

Submitted 21 February, 2011; originally announced February 2011.

Comments: This is a full version of the SHREC'11 report published in 3DOR

Showing 1–31 of 31 results for author: Keller, Y