-
Coarse-to-Fine Multi-Scene Pose Regression with Transformers
Authors:
Yoli Shavit,
Ron Ferens,
Yosi Keller
Abstract:
Absolute camera pose regressors estimate the position and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers.…
▽ More
Absolute camera pose regressors estimate the position and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into pose predictions. This allows our model to focus on general features that are informative for localization, while embedding multiple scenes in parallel. We extend our previous MS-Transformer approach \cite{shavit2021learning} by introducing a mixed classification-regression architecture that improves the localization accuracy. Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors.
△ Less
Submitted 22 August, 2023;
originally announced August 2023.
-
HyperPose: Camera Pose Localization using Attention Hypernetworks
Authors:
Ron Ferens,
Yosi Keller
Abstract:
In this study, we propose the use of attention hypernetworks in camera pose localization. The dynamic nature of natural scenes, including changes in environment, perspective, and lighting, creates an inherent domain gap between the training and test sets that limits the accuracy of contemporary localization networks. To overcome this issue, we suggest a camera pose regressor that integrates a hype…
▽ More
In this study, we propose the use of attention hypernetworks in camera pose localization. The dynamic nature of natural scenes, including changes in environment, perspective, and lighting, creates an inherent domain gap between the training and test sets that limits the accuracy of contemporary localization networks. To overcome this issue, we suggest a camera pose regressor that integrates a hypernetwork. During inference, the hypernetwork generates adaptive weights for the localization regression heads based on the input image, effectively reducing the domain gap. We also suggest the use of a Transformer-Encoder as the hypernetwork, instead of the common multilayer perceptron, to derive an attention hypernetwork. The proposed approach achieves superior results compared to state-of-the-art methods on contemporary datasets. To the best of our knowledge, this is the first instance of using hypernetworks in camera pose regression, as well as using Transformer-Encoders as hypernetworks. We make our code publicly available.
△ Less
Submitted 5 March, 2023;
originally announced March 2023.
-
Completely reachable automata: a quadratic decision algorithm and a quadratic upper bound on the reaching threshold
Authors:
Robert Ferens,
Marek Szykuła
Abstract:
A complete deterministic finite (semi)automaton (DFA) with a set of states $Q$ is \emph{completely reachable} if every nonempty subset of $Q$ is the image of the action of some word applied to $Q$. The concept of completely reachable automata appeared, in particular, in connection with synchronizing automata; the class contains the Čern{ý} automata and covers several independently investigated sub…
▽ More
A complete deterministic finite (semi)automaton (DFA) with a set of states $Q$ is \emph{completely reachable} if every nonempty subset of $Q$ is the image of the action of some word applied to $Q$. The concept of completely reachable automata appeared, in particular, in connection with synchronizing automata; the class contains the Čern{ý} automata and covers several independently investigated subclasses. The notion was introduced by Bondar and Volkov (2016), who also raised the question about the complexity of deciding if an automaton is completely reachable. We develop an algorithm solving this problem, which works in ${Ø(|Σ|\cdot n^2)}$ time and $Ø(|Σ|\cdot n)$ space, where $n=|Q|$ is the number of states and $|Σ|$ is the size of the input alphabet. In the second part, we prove a weak Don's conjecture for this class of automata: a subset of states $S \subseteq Q$ is reachable with a word of length at most $2n(n-|S|) - n \cdot H_{n-|S|}$, where $H_i$ is the $i$-th harmonic number. This implies a quadratic upper bound in $n$ on the length of the shortest synchronizing words (reset threshold) for the class of completely reachable automata and generalizes earlier upper bounds derived for its subclasses.
△ Less
Submitted 11 June, 2024; v1 submitted 11 August, 2022;
originally announced August 2022.
-
Paying Attention to Activation Maps in Camera Pose Regression
Authors:
Yoli Shavit,
Ron Ferens,
Yosi Keller
Abstract:
Camera pose regression methods apply a single forward pass to the query image to estimate the camera pose. As such, they offer a fast and light-weight alternative to traditional localization schemes based on image retrieval. Pose regression approaches simultaneously learn two regression tasks, aiming to jointly estimate the camera position and orientation using a single embedding vector computed b…
▽ More
Camera pose regression methods apply a single forward pass to the query image to estimate the camera pose. As such, they offer a fast and light-weight alternative to traditional localization schemes based on image retrieval. Pose regression approaches simultaneously learn two regression tasks, aiming to jointly estimate the camera position and orientation using a single embedding vector computed by a convolutional backbone. We propose an attention-based approach for pose regression, where the convolutional activation maps are used as sequential inputs. Transformers are applied to encode the sequential activation maps as latent vectors, used for camera pose regression. This allows us to pay attention to spatially-varying deep features. Using two Transformer heads, we separately focus on the features for camera position and orientation, based on how informative they are per task. Our proposed approach is shown to compare favorably to contemporary pose regressors schemes and achieves state-of-the-art accuracy across multiple outdoor and indoor benchmarks. In particular, to the best of our knowledge, our approach is the only method to attain sub-meter average accuracy across outdoor scenes. We make our code publicly available from here.
△ Less
Submitted 11 April, 2021; v1 submitted 21 March, 2021;
originally announced March 2021.
-
Learning Multi-Scene Absolute Pose Regression with Transformers
Authors:
Yoli Shavit,
Ron Ferens,
Yosi Keller
Abstract:
Absolute camera pose regressors estimate the position and orientation of a camera from the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended for learning multiple scenes by replacing the MLP head with a set of fully connected layers. In t…
▽ More
Absolute camera pose regressors estimate the position and orientation of a camera from the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended for learning multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into candidate pose predictions. This mechanism allows our model to focus on general features that are informative for localization while embedding multiple scenes in parallel. We evaluate our method on commonly benchmarked indoor and outdoor datasets and show that it surpasses both multi-scene and state-of-the-art single-scene absolute pose regressors. We make our code publicly available from https://github.com/yolish/multi-scene-pose-transformer.
△ Less
Submitted 26 July, 2021; v1 submitted 21 March, 2021;
originally announced March 2021.
-
Solving one variable word equations in the free group in cubic time
Authors:
Robert Ferens,
Artur Jeż
Abstract:
A word equation with one variable in a free group is given as $U = V$, where both $U$ and $V$ are words over the alphabet of generators of the free group and $X, X^{-1}$, for a fixed variable $X$. An element of the free group is a solution when substituting it for $X$ yields a true equality (interpreted in the free group) of left- and right-hand sides. It is known that the set of all solutions of…
▽ More
A word equation with one variable in a free group is given as $U = V$, where both $U$ and $V$ are words over the alphabet of generators of the free group and $X, X^{-1}$, for a fixed variable $X$. An element of the free group is a solution when substituting it for $X$ yields a true equality (interpreted in the free group) of left- and right-hand sides. It is known that the set of all solutions of a given word equation with one variable is a finite union of sets of the form $\{αw^i β\: : \: i \in \mathbb Z \}$, where $α, w, β$ are reduced words over the alphabet of generators, and a polynomial-time algorithm (of a high degree) computing this set is known. We provide a cubic time algorithm for this problem, which also shows that the set of solutions consists of at most a quadratic number of the above-mentioned sets. The algorithm uses only simple tools of word combinatorics and group theory and is simple to state. Its analysis is involved and focuses on the combinatorics of occurrences of powers of a word within a larger word.
△ Less
Submitted 15 January, 2021;
originally announced January 2021.
-
Synchronizing Strongly Connected Partial DFAs
Authors:
Mikhail V. Berlinkov,
Robert Ferens,
Andrew Ryzhikov,
Marek Szykuła
Abstract:
We study synchronizing partial DFAs, which extend the classical concept of synchronizing complete DFAs and are a special case of synchronizing unambiguous NFAs. A partial DFA is called synchronizing if it has a word (called a reset word) whose action brings a non-empty subset of states to a unique state and is undefined for all other states. While in the general case the problem of checking whethe…
▽ More
We study synchronizing partial DFAs, which extend the classical concept of synchronizing complete DFAs and are a special case of synchronizing unambiguous NFAs. A partial DFA is called synchronizing if it has a word (called a reset word) whose action brings a non-empty subset of states to a unique state and is undefined for all other states. While in the general case the problem of checking whether a partial DFA is synchronizing is PSPACE-complete, we show that in the strongly connected case this problem can be efficiently reduced to the same problem for a complete DFA. Using combinatorial, algebraic, and formal languages methods, we develop techniques that relate main synchronization problems for strongly connected partial DFAs with the same problems for complete DFAs. In particular, this includes the Černý and the rank conjectures, the problem of finding a reset word, and upper bounds on the length of the shortest reset words of literal automata of finite prefix codes. We conclude that solving fundamental synchronization problems is equally hard in both models, as an essential improvement of the results for one model implies an improvement for the other.
△ Less
Submitted 13 January, 2021;
originally announced January 2021.
-
Do We Really Need Scene-specific Pose Encoders?
Authors:
Yoli Shavit,
Ron Ferens
Abstract:
Visual pose regression models estimate the camera pose from a query image with a single forward pass. Current models learn pose encoding from an image using deep convolutional networks which are trained per scene. The resulting encoding is typically passed to a multi-layer perceptron in order to regress the pose. In this work, we propose that scene-specific pose encoders are not required for pose…
▽ More
Visual pose regression models estimate the camera pose from a query image with a single forward pass. Current models learn pose encoding from an image using deep convolutional networks which are trained per scene. The resulting encoding is typically passed to a multi-layer perceptron in order to regress the pose. In this work, we propose that scene-specific pose encoders are not required for pose regression and that encodings trained for visual similarity can be used instead. In order to test our hypothesis, we take a shallow architecture of several fully connected layers and train it with pre-computed encodings from a generic image retrieval model. We find that these encodings are not only sufficient to regress the camera pose, but that, when provided to a branching fully connected architecture, a trained model can achieve competitive results and even surpass current \textit{state-of-the-art} pose regressors in some cases. Moreover, we show that for outdoor localization, the proposed architecture is the only pose regressor, to date, consistently localizing in under 2 meters and 5 degrees.
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
Introduction to Camera Pose Estimation with Deep Learning
Authors:
Yoli Shavit,
Ron Ferens
Abstract:
Over the last two decades, deep learning has transformed the field of computer vision. Deep convolutional networks were successfully applied to learn different vision tasks such as image classification, image segmentation, object detection and many more. By transferring the knowledge learned by deep models on large generic datasets, researchers were further able to create fine-tuned models for oth…
▽ More
Over the last two decades, deep learning has transformed the field of computer vision. Deep convolutional networks were successfully applied to learn different vision tasks such as image classification, image segmentation, object detection and many more. By transferring the knowledge learned by deep models on large generic datasets, researchers were further able to create fine-tuned models for other more specific tasks. Recently this idea was applied for regressing the absolute camera pose from an RGB image. Although the resulting accuracy was sub-optimal, compared to classic feature-based solutions, this effort led to a surge of learning-based pose estimation methods. Here, we review deep learning approaches for camera pose estimation. We describe key methods in the field and identify trends aiming at improving the original deep pose regression solution. We further provide an extensive cross-comparison of existing learning-based pose estimators, together with practical notes on their execution for reproducibility purposes. Finally, we discuss emerging solutions and potential future research directions.
△ Less
Submitted 16 July, 2019; v1 submitted 8 July, 2019;
originally announced July 2019.
-
Preimage problems for deterministic finite automata
Authors:
Mikhail V. Berlinkov,
Robert Ferens,
Marek Szykuła
Abstract:
Given a subset of states $S$ of a deterministic finite automaton and a word $w$, the preimage is the subset of all states mapped to a state in $S$ by the action of $w$. We study three natural problems concerning words giving certain preimages. The first problem is whether, for a given subset, there exists a word \emph{extending} the subset (giving a larger preimage). The second problem is whether…
▽ More
Given a subset of states $S$ of a deterministic finite automaton and a word $w$, the preimage is the subset of all states mapped to a state in $S$ by the action of $w$. We study three natural problems concerning words giving certain preimages. The first problem is whether, for a given subset, there exists a word \emph{extending} the subset (giving a larger preimage). The second problem is whether there exists a \emph{totally extending} word (giving the whole set of states as a preimage)---equivalently, whether there exists an \emph{avoiding} word for the complementary subset. The third problem is whether there exists a \emph{resizing} word. We also consider variants where the length of the word is upper bounded, where the size of the given subset is restricted, and where the automaton is strongly connected, synchronizing, or binary. We conclude with a summary of the complexities in all combinations of the cases.
△ Less
Submitted 19 September, 2020; v1 submitted 26 April, 2017;
originally announced April 2017.
-
Complexity of regular bifix-free languages
Authors:
Robert Ferens,
Marek Szykuła
Abstract:
We study descriptive complexity properties of the class of regular bifix-free languages, which is the intersection of prefix-free and suffix-free regular languages. We show that there exist a single ternary universal (stream of) bifix-free languages that meet all the bounds for the state complexity basic operations (Boolean operations, product, star, and reversal). This is in contrast with suffix-…
▽ More
We study descriptive complexity properties of the class of regular bifix-free languages, which is the intersection of prefix-free and suffix-free regular languages. We show that there exist a single ternary universal (stream of) bifix-free languages that meet all the bounds for the state complexity basic operations (Boolean operations, product, star, and reversal). This is in contrast with suffix-free languages, where it is known that there does not exist such a stream. Then we present a stream of bifix-free languages that is most complex in terms of all basic operations, syntactic complexity, and the number of atoms and their complexities, which requires a superexponential alphabet.
We also complete the previous results by characterizing state complexity of product, star, and reversal, and establishing tight upper bounds for atom complexities of bifix-free languages. We show that to meet the bound for reversal we require at least 3 letters and to meet the bound for atom complexities $n+1$ letters are sufficient and necessary. For the cases of product, star, and reversal we show that there are no gaps (magic numbers) in the interval of possible state complexities of the languages resulted from an operation; in particular, the state complexity of the product $L_m L_n$ is always $m+n-2$, while of the star is either $n-1$ or $n-2$.
△ Less
Submitted 13 January, 2017;
originally announced January 2017.