-
The Parameterized Suffix Tray
Authors:
Noriki Fujisato,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
Let $Σ$ and $Π$ be disjoint alphabets, respectively called the static alphabet and the parameterized alphabet. Two strings $x$ and $y$ over $Σ\cup Π$ of equal length are said to parameterized match (p-match) if there exists a renaming bijection $f$ on $Σ$ and $Π$ which is identity on $Σ$ and maps the characters of $x$ to those of $y$ so that the two strings become identical. The indexing version o…
▽ More
Let $Σ$ and $Π$ be disjoint alphabets, respectively called the static alphabet and the parameterized alphabet. Two strings $x$ and $y$ over $Σ\cup Π$ of equal length are said to parameterized match (p-match) if there exists a renaming bijection $f$ on $Σ$ and $Π$ which is identity on $Σ$ and maps the characters of $x$ to those of $y$ so that the two strings become identical. The indexing version of the problem of finding p-matching occurrences of a given pattern in the text is a well-studied topic in string matching. In this paper, we present a state-of-the-art indexing structure for p-matching called the parameterized suffix tray of an input text $T$, denoted by $\mathsf{PSTray}(T)$. We show that $\mathsf{PSTray}(T)$ occupies $O(n)$ space and supports pattern matching queries in $O(m + \log (σ+π) + \mathit{occ})$ time, where $n$ is the length of $T$, $m$ is the length of a query pattern $P$, $π$ is the number of distinct symbols of $|Π|$ in $T$, $σ$ is the number of distinct symbols of $|Σ|$ in $T$ and $\mathit{occ}$ is the number of p-matching occurrences of $P$ in $T$. We also present how to build $\mathsf{PSTray}(T)$ in $O(n)$ time from the parameterized suffix tree of $T$.
△ Less
Submitted 3 February, 2021; v1 submitted 18 December, 2020;
originally announced December 2020.
-
Match Them Up: Visually Explainable Few-shot Image Classification
Authors:
Bowen Wang,
Liangzhi Li,
Manisha Verma,
Yuta Nakashima,
Ryo Kawasaki,
Hajime Nagahara
Abstract:
Few-shot learning (FSL) approaches are usually based on an assumption that the pre-trained knowledge can be obtained from base (seen) categories and can be well transferred to novel (unseen) categories. However, there is no guarantee, especially for the latter part. This issue leads to the unknown nature of the inference process in most FSL methods, which hampers its application in some risk-sensi…
▽ More
Few-shot learning (FSL) approaches are usually based on an assumption that the pre-trained knowledge can be obtained from base (seen) categories and can be well transferred to novel (unseen) categories. However, there is no guarantee, especially for the latter part. This issue leads to the unknown nature of the inference process in most FSL methods, which hampers its application in some risk-sensitive areas. In this paper, we reveal a new way to perform FSL for image classification, using visual representations from the backbone model and weights generated by a newly-emerged explainable classifier. The weighted representations only include a minimum number of distinguishable features and the visualized weights can serve as an informative hint for the FSL process. Finally, a discriminator will compare the representations of each pair of the images in the support set and the query set. Pairs with the highest scores will decide the classification results. Experimental results prove that the proposed method can achieve both good accuracy and satisfactory explainability on three mainstream datasets.
△ Less
Submitted 25 November, 2020;
originally announced November 2020.
-
Automated Grading System of Retinal Arterio-venous Crossing Patterns: A Deep Learning Approach Replicating Ophthalmologist's Diagnostic Process of Arteriolosclerosis
Authors:
Liangzhi Li,
Manisha Verma,
Bowen Wang,
Yuta Nakashima,
Hajime Nagahara,
Ryo Kawasaki
Abstract:
The status of retinal arteriovenous crossing is of great significance for clinical evaluation of arteriolosclerosis and systemic hypertension. As an ophthalmology diagnostic criteria, Scheie's classification has been used to grade the severity of arteriolosclerosis. In this paper, we propose a deep learning approach to support the diagnosis process, which, to the best of our knowledge, is one of t…
▽ More
The status of retinal arteriovenous crossing is of great significance for clinical evaluation of arteriolosclerosis and systemic hypertension. As an ophthalmology diagnostic criteria, Scheie's classification has been used to grade the severity of arteriolosclerosis. In this paper, we propose a deep learning approach to support the diagnosis process, which, to the best of our knowledge, is one of the earliest attempts in medical imaging. The proposed pipeline is three-fold. First, we adopt segmentation and classification models to automatically obtain vessels in a retinal image with the corresponding artery/vein labels and find candidate arteriovenous crossing points. Second, we use a classification model to validate the true crossing point. At last, the grade of severity for the vessel crossings is classified. To better address the problem of label ambiguity and imbalanced label distribution, we propose a new model, named multi-diagnosis team network (MDTNet), in which the sub-models with different structures or different loss functions provide different decisions. MDTNet unifies these diverse theories to give the final decision with high accuracy. Our severity grading method was able to validate crossing points with precision and recall of 96.3% and 96.3%, respectively. Among correctly detected crossing points, the kappa value for the agreement between the grading by a retina specialist and the estimated score was 0.85, with an accuracy of 0.92. The numerical results demonstrate that our method can achieve a good performance in both arteriovenous crossing validation and severity grading tasks. By the proposed models, we could build a pipeline reproducing retina specialist's subjective grading without feature extractions. The code is available for reproducibility.
△ Less
Submitted 1 December, 2022; v1 submitted 7 November, 2020;
originally announced November 2020.
-
Noisy-LSTM: Improving Temporal Awareness for Video Semantic Segmentation
Authors:
Bowen Wang,
Liangzhi Li,
Yuta Nakashima,
Ryo Kawasaki,
Hajime Nagahara,
Yasushi Yagi
Abstract:
Semantic video segmentation is a key challenge for various applications. This paper presents a new model named Noisy-LSTM, which is trainable in an end-to-end manner, with convolutional LSTMs (ConvLSTMs) to leverage the temporal coherency in video frames. We also present a simple yet effective training strategy, which replaces a frame in video sequence with noises. This strategy spoils the tempora…
▽ More
Semantic video segmentation is a key challenge for various applications. This paper presents a new model named Noisy-LSTM, which is trainable in an end-to-end manner, with convolutional LSTMs (ConvLSTMs) to leverage the temporal coherency in video frames. We also present a simple yet effective training strategy, which replaces a frame in video sequence with noises. This strategy spoils the temporal coherency in video frames during training and thus makes the temporal links in ConvLSTMs unreliable, which may consequently improve feature extraction from video frames, as well as serve as a regularizer to avoid overfitting, without requiring extra data annotation or computational costs. Experimental results demonstrate that the proposed model can achieve state-of-the-art performances in both the CityScapes and EndoVis2018 datasets.
△ Less
Submitted 19 October, 2020;
originally announced October 2020.
-
Constructing a Visual Relationship Authenticity Dataset
Authors:
Chenhui Chu,
Yuto Takebayashi,
Mishra Vipul,
Yuta Nakashima
Abstract:
A visual relationship denotes a relationship between two objects in an image, which can be represented as a triplet of (subject; predicate; object). Visual relationship detection is crucial for scene understanding in images. Existing visual relationship detection datasets only contain true relationships that correctly describe the content in an image. However, distinguishing false visual relations…
▽ More
A visual relationship denotes a relationship between two objects in an image, which can be represented as a triplet of (subject; predicate; object). Visual relationship detection is crucial for scene understanding in images. Existing visual relationship detection datasets only contain true relationships that correctly describe the content in an image. However, distinguishing false visual relationships from true ones is also crucial for image understanding and grounded natural language processing. In this paper, we construct a visual relationship authenticity dataset, where both true and false relationships among all objects appeared in the captions in the Flickr30k entities image caption dataset are annotated. The dataset is available at https://github.com/codecreator2053/VR_ClassifiedDataset. We hope that this dataset can promote the study on both vision and language understanding.
△ Less
Submitted 11 October, 2020;
originally announced October 2020.
-
Demographic Influences on Contemporary Art with Unsupervised Style Embeddings
Authors:
Nikolai Huckle,
Noa Garcia,
Yuta Nakashima
Abstract:
Computational art analysis has, through its reliance on classification tasks, prioritised historical datasets in which the artworks are already well sorted with the necessary annotations. Art produced today, on the other hand, is numerous and easily accessible, through the internet and social networks that are used by professional and amateur artists alike to display their work. Although this art,…
▽ More
Computational art analysis has, through its reliance on classification tasks, prioritised historical datasets in which the artworks are already well sorted with the necessary annotations. Art produced today, on the other hand, is numerous and easily accessible, through the internet and social networks that are used by professional and amateur artists alike to display their work. Although this art, yet unsorted in terms of style and genre, is less suited for supervised analysis, the data sources come with novel information that may help frame the visual content in equally novel ways. As a first step in this direction, we present contempArt, a multi-modal dataset of exclusively contemporary artworks. contempArt is a collection of paintings and drawings, a detailed graph network based on social connections on Instagram and additional socio-demographic information; all attached to 442 artists at the beginning of their career. We evaluate three methods suited for generating unsupervised style embeddings of images and correlate them with the remaining data. We find no connections between visual style on the one hand and social proximity, gender, and nationality on the other.
△ Less
Submitted 1 December, 2020; v1 submitted 30 September, 2020;
originally announced September 2020.
-
SCOUTER: Slot Attention-based Classifier for Explainable Image Recognition
Authors:
Liangzhi Li,
Bowen Wang,
Manisha Verma,
Yuta Nakashima,
Ryo Kawasaki,
Hajime Nagahara
Abstract:
Explainable artificial intelligence has been gaining attention in the past few years. However, most existing methods are based on gradients or intermediate features, which are not directly involved in the decision-making process of the classifier. In this paper, we propose a slot attention-based classifier called SCOUTER for transparent yet accurate classification. Two major differences from other…
▽ More
Explainable artificial intelligence has been gaining attention in the past few years. However, most existing methods are based on gradients or intermediate features, which are not directly involved in the decision-making process of the classifier. In this paper, we propose a slot attention-based classifier called SCOUTER for transparent yet accurate classification. Two major differences from other attention-based methods include: (a) SCOUTER's explanation is involved in the final confidence for each category, offering more intuitive interpretation, and (b) all the categories have their corresponding positive or negative explanation, which tells "why the image is of a certain category" or "why the image is not of a certain category." We design a new loss tailored for SCOUTER that controls the model's behavior to switch between positive and negative explanations, as well as the size of explanatory regions. Experimental results show that SCOUTER can give better visual explanations in terms of various metrics while kee** good accuracy on small and medium-sized datasets.
△ Less
Submitted 20 August, 2021; v1 submitted 13 September, 2020;
originally announced September 2020.
-
Uncovering Hidden Challenges in Query-Based Video Moment Retrieval
Authors:
Mayu Otani,
Yuta Nakashima,
Esa Rahtu,
Janne Heikkilä
Abstract:
The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. This is a challenging task that requires interpretation of both the natural language query and the video content. Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and…
▽ More
The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. This is a challenging task that requires interpretation of both the natural language query and the video content. Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and, therefore, their quality has significant impact on the field. In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task. Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models. Moreover, we present new sanity check experiments and approaches for visualising the results. Finally, we suggest possible directions to improve the temporal sentence grounding in the future. Our code for this paper is available at https://mayu-ot.github.io/hidden-challenges-MR .
△ Less
Submitted 7 October, 2020; v1 submitted 1 September, 2020;
originally announced September 2020.
-
A Dataset and Baselines for Visual Question Answering on Art
Authors:
Noa Garcia,
Chentao Ye,
Zihua Liu,
Qingtao Hu,
Mayu Otani,
Chenhui Chu,
Yuta Nakashima,
Teruko Mitamura
Abstract:
Answering questions related to art pieces (paintings) is a difficult task, as it implies the understanding of not only the visual information that is shown in the picture, but also the contextual knowledge that is acquired through the study of the history of art. In this work, we introduce our first attempt towards building a new dataset, coined AQUA (Art QUestion Answering). The question-answer (…
▽ More
Answering questions related to art pieces (paintings) is a difficult task, as it implies the understanding of not only the visual information that is shown in the picture, but also the contextual knowledge that is acquired through the study of the history of art. In this work, we introduce our first attempt towards building a new dataset, coined AQUA (Art QUestion Answering). The question-answer (QA) pairs are automatically generated using state-of-the-art question generation methods based on paintings and comments provided in an existing art understanding dataset. The QA pairs are cleansed by crowdsourcing workers with respect to their grammatical correctness, answerability, and answers' correctness. Our dataset inherently consists of visual (painting-based) and knowledge (comment-based) questions. We also present a two-branch model as baseline, where the visual and knowledge questions are handled independently. We extensively compare our baseline model against the state-of-the-art models for question answering, and we provide a comprehensive study about the challenges and potential future directions for visual question answering on art.
△ Less
Submitted 28 August, 2020;
originally announced August 2020.
-
Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition
Authors:
Sudhakar Kumawat,
Manisha Verma,
Yuta Nakashima,
Shanmuganathan Raman
Abstract:
Conventional 3D convolutional neural networks (CNNs) are computationally expensive, memory intensive, prone to overfitting, and most importantly, there is a need to improve their feature learning capabilities. To address these issues, we propose spatio-temporal short term Fourier transform (STFT) blocks, a new class of convolutional blocks that can serve as an alternative to the 3D convolutional l…
▽ More
Conventional 3D convolutional neural networks (CNNs) are computationally expensive, memory intensive, prone to overfitting, and most importantly, there is a need to improve their feature learning capabilities. To address these issues, we propose spatio-temporal short term Fourier transform (STFT) blocks, a new class of convolutional blocks that can serve as an alternative to the 3D convolutional layer and its variants in 3D CNNs. An STFT block consists of non-trainable convolution layers that capture spatially and/or temporally local Fourier information using a STFT kernel at multiple low frequency points, followed by a set of trainable linear weights for learning channel correlations. The STFT blocks significantly reduce the space-time complexity in 3D CNNs. In general, they use 3.5 to 4.5 times less parameters and 1.5 to 1.8 times less computational costs when compared to the state-of-the-art methods. Furthermore, their feature learning capabilities are significantly better than the conventional 3D convolutional layer and its variants. Our extensive evaluation on seven action recognition datasets, including Something-something v1 and v2, Jester, Diving-48, Kinetics-400, UCF 101, and HMDB 51, demonstrate that STFT blocks based 3D CNNs achieve on par or even better performance compared to the state-of-the-art methods.
△ Less
Submitted 22 July, 2020;
originally announced July 2020.
-
Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions
Authors:
Noa Garcia,
Yuta Nakashima
Abstract:
To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. In ROLL, each…
▽ More
To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion. To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism, which balances the information from the different sources. Exhaustive evaluation demonstrates the effectiveness of our approach, which yields a new state-of-the-art on two challenging video question answering datasets: KnowIT VQA and TVQA+.
△ Less
Submitted 17 July, 2020;
originally announced July 2020.
-
Lyndon Words, the Three Squares Lemma, and Primitive Squares
Authors:
Hideo Bannai,
Takuya Mieno,
Yuto Nakashima
Abstract:
We revisit the so-called "Three Squares Lemma" by Crochemore and Rytter [Algorithmica 1995] and, using arguments based on Lyndon words, derive a more general variant which considers three overlap** squares which do not necessarily share a common prefix. We also give an improved upper bound of $n\log_2 n$ on the maximum number of (occurrences of) primitively rooted squares in a string of length…
▽ More
We revisit the so-called "Three Squares Lemma" by Crochemore and Rytter [Algorithmica 1995] and, using arguments based on Lyndon words, derive a more general variant which considers three overlap** squares which do not necessarily share a common prefix. We also give an improved upper bound of $n\log_2 n$ on the maximum number of (occurrences of) primitively rooted squares in a string of length $n$, also using arguments based on Lyndon words. To the best of our knowledge, the only known upper bound was $n \log_φn \approx 1.441n\log_2 n$, where $φ$ is the golden ratio, reported by Fraenkel and Simpson [TCS 1999] obtained via the Three Squares Lemma.
△ Less
Submitted 22 July, 2020; v1 submitted 24 June, 2020;
originally announced June 2020.
-
Palindromic Trees for a Sliding Window and Its Applications
Authors:
Takuya Mieno,
Kiichi Watanabe,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
The palindromic tree (a.k.a. eertree) for a string $S$ of length $n$ is a tree-like data structure that represents the set of all distinct palindromic substrings of $S$, using $O(n)$ space [Rubinchik and Shur, 2018]. It is known that, when $S$ is over an alphabet of size $σ$ and is given in an online manner, then the palindromic tree of $S$ can be constructed in $O(n\logσ)$ time with $O(n)$ space.…
▽ More
The palindromic tree (a.k.a. eertree) for a string $S$ of length $n$ is a tree-like data structure that represents the set of all distinct palindromic substrings of $S$, using $O(n)$ space [Rubinchik and Shur, 2018]. It is known that, when $S$ is over an alphabet of size $σ$ and is given in an online manner, then the palindromic tree of $S$ can be constructed in $O(n\logσ)$ time with $O(n)$ space. In this paper, we consider the sliding window version of the problem: For a sliding window of length at most $d$, we present two versions of an algorithm which maintains the palindromic tree of size $O(d)$ for every sliding window $S[i..j]$ over $S$, where $1 \leq j-i+1 \leq d$. The first version works in $O(n\logσ')$ time with $O(d)$ space where $σ' \leq d$ is the maximum number of distinct characters in the windows, and the second one works in $O(n + dσ)$ time with $(d+2)σ+ O(d)$ space. We also show how our algorithms can be applied to efficient computation of minimal unique palindromic substrings (MUPS) and minimal absent palindromic words (MAPW) for a sliding window.
△ Less
Submitted 11 November, 2020; v1 submitted 3 June, 2020;
originally announced June 2020.
-
Joint Learning of Vessel Segmentation and Artery/Vein Classification with Post-processing
Authors:
Liangzhi Li,
Manisha Verma,
Yuta Nakashima,
Ryo Kawasaki,
Hajime Nagahara
Abstract:
Retinal imaging serves as a valuable tool for diagnosis of various diseases. However, reading retinal images is a difficult and time-consuming task even for experienced specialists. The fundamental step towards automated retinal image analysis is vessel segmentation and artery/vein classification, which provide various information on potential disorders. To improve the performance of the existing…
▽ More
Retinal imaging serves as a valuable tool for diagnosis of various diseases. However, reading retinal images is a difficult and time-consuming task even for experienced specialists. The fundamental step towards automated retinal image analysis is vessel segmentation and artery/vein classification, which provide various information on potential disorders. To improve the performance of the existing automated methods for retinal image analysis, we propose a two-step vessel classification. We adopt a UNet-based model, SeqNet, to accurately segment vessels from the background and make prediction on the vessel type. Our model does segmentation and classification sequentially, which alleviates the problem of label distribution bias and facilitates training. To further refine classification results, we post-process them considering the structural information among vessels to propagate highly confident prediction to surrounding vessels. Our experiments show that our method improves AUC to 0.98 for segmentation and the accuracy to 0.92 in classification over DRIVE dataset.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.
-
On repetitiveness measures of Thue-Morse words
Authors:
Kanaru Kutsukake,
Takuya Matsumoto,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
We show that the size $γ(t_n)$ of the smallest string attractor of the $n$th Thue-Morse word $t_n$ is 4 for any $n\geq 4$, disproving the conjecture by Mantaci et al. [ICTCS 2019] that it is $n$. We also show that $δ(t_n) = \frac{10}{3+2^{4-n}}$ for $n \geq 3$, where $δ(w)$ is the maximum over all $k = 1,\ldots,|w|$, the number of distinct substrings of length $k$ in $w$ divided by $k$, which is a…
▽ More
We show that the size $γ(t_n)$ of the smallest string attractor of the $n$th Thue-Morse word $t_n$ is 4 for any $n\geq 4$, disproving the conjecture by Mantaci et al. [ICTCS 2019] that it is $n$. We also show that $δ(t_n) = \frac{10}{3+2^{4-n}}$ for $n \geq 3$, where $δ(w)$ is the maximum over all $k = 1,\ldots,|w|$, the number of distinct substrings of length $k$ in $w$ divided by $k$, which is a measure of repetitiveness recently studied by Kociumaka et al. [LATIN 2020]. Furthermore, we show that the number $z(t_n)$ of factors in the self-referencing Lempel-Ziv factorization of $t_n$ is exactly $2n$.
△ Less
Submitted 12 August, 2020; v1 submitted 19 May, 2020;
originally announced May 2020.
-
Towards Efficient Interactive Computation of Dynamic Time War** Distance
Authors:
Akihiro Nishi,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
The dynamic time war** (DTW) is a widely-used method that allows us to efficiently compare two time series that can vary in speed. Given two strings $A$ and $B$ of respective lengths $m$ and $n$, there is a fundamental dynamic programming algorithm that computes the DTW distance for $A$ and $B$ together with an optimal alignment in $Θ(mn)$ time and space. In this paper, we tackle the problem of…
▽ More
The dynamic time war** (DTW) is a widely-used method that allows us to efficiently compare two time series that can vary in speed. Given two strings $A$ and $B$ of respective lengths $m$ and $n$, there is a fundamental dynamic programming algorithm that computes the DTW distance for $A$ and $B$ together with an optimal alignment in $Θ(mn)$ time and space. In this paper, we tackle the problem of interactive computation of the DTW distance for dynamic strings, denoted $\mathrm{D^2TW}$, where character-wise edit operation (insertion, deletion, substitution) can be performed at an arbitrary position of the strings. Let $M$ and $N$ be the sizes of the run-length encoding (RLE) of $A$ and $B$, respectively. We present an algorithm for $\mathrm{D^2TW}$ that occupies $Θ(mN+nM)$ space and uses $O(m+n+\#_{\mathrm{chg}}) \subseteq O(mN + nM)$ time to update a compact differential representation $\mathit{DS}$ of the DP table per edit operation, where $\#_{\mathrm{chg}}$ denotes the number of cells in $\mathit{DS}$ whose values change after the edit operation. Our method is at least as efficient as the algorithm recently proposed by Froese et al. running in $Θ(mN + nM)$ time, and is faster when $\#_{\mathrm{chg}}$ is smaller than $O(mN + nM)$ which, as our preliminary experiments suggest, is likely to be the case in the majority of instances.
△ Less
Submitted 29 July, 2020; v1 submitted 17 May, 2020;
originally announced May 2020.
-
Yoga-82: A New Dataset for Fine-grained Classification of Human Poses
Authors:
Manisha Verma,
Sudhakar Kumawat,
Yuta Nakashima,
Shanmuganathan Raman
Abstract:
Human pose estimation is a well-known problem in computer vision to locate joint positions. Existing datasets for the learning of poses are observed to be not challenging enough in terms of pose diversity, object occlusion, and viewpoints. This makes the pose annotation process relatively simple and restricts the application of the models that have been trained on them. To handle more variety in h…
▽ More
Human pose estimation is a well-known problem in computer vision to locate joint positions. Existing datasets for the learning of poses are observed to be not challenging enough in terms of pose diversity, object occlusion, and viewpoints. This makes the pose annotation process relatively simple and restricts the application of the models that have been trained on them. To handle more variety in human poses, we propose the concept of fine-grained hierarchical pose classification, in which we formulate the pose estimation as a classification task, and propose a dataset, Yoga-82, for large-scale yoga pose recognition with 82 classes. Yoga-82 consists of complex poses where fine annotations may not be possible. To resolve this, we provide hierarchical labels for yoga poses based on the body configuration of the pose. The dataset contains a three-level hierarchy including body positions, variations in body positions, and the actual pose names. We present the classification accuracy of the state-of-the-art convolutional neural network architectures on Yoga-82. We also present several hierarchical variants of DenseNet in order to utilize the hierarchical labels.
△ Less
Submitted 21 April, 2020;
originally announced April 2020.
-
Knowledge-Based Visual Question Answering in Videos
Authors:
Noa Garcia,
Mayu Otani,
Chenhui Chu,
Yuta Nakashima
Abstract:
We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the serie…
▽ More
We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered. Second, we propose a video understanding model by combining the visual and textual video content with specific knowledge about the show. Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations.
△ Less
Submitted 16 April, 2020;
originally announced April 2020.
-
Grammar-compressed Self-index with Lyndon Words
Authors:
Kazuya Tsuruta,
Dominik Köppl,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
We introduce a new class of straight-line programs (SLPs), named the Lyndon SLP, inspired by the Lyndon trees (Barcelo, 1990). Based on this SLP, we propose a self-index data structure of $O(g)$ words of space that can be built from a string $T$ in $O(n \lg n)$ expected time, retrieving the starting positions of all occurrences of a pattern $P$ of length $m$ in $O(m + \lg m \lg n + occ \lg g)$ tim…
▽ More
We introduce a new class of straight-line programs (SLPs), named the Lyndon SLP, inspired by the Lyndon trees (Barcelo, 1990). Based on this SLP, we propose a self-index data structure of $O(g)$ words of space that can be built from a string $T$ in $O(n \lg n)$ expected time, retrieving the starting positions of all occurrences of a pattern $P$ of length $m$ in $O(m + \lg m \lg n + occ \lg g)$ time, where $n$ is the length of $T$, $g$ is the size of the Lyndon SLP for $T$, and $occ$ is the number of occurrences of $P$ in $T$.
△ Less
Submitted 27 April, 2020; v1 submitted 11 April, 2020;
originally announced April 2020.
-
Detecting $k$-(Sub-)Cadences and Equidistant Subsequence Occurrences
Authors:
Mitsuru Funakoshi,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda,
Ayumi Shinohara
Abstract:
The equidistant subsequence pattern matching problem is considered. Given a pattern string $P$ and a text string $T$, we say that $P$ is an \emph{equidistant subsequence} of $T$ if $P$ is a subsequence of the text such that consecutive symbols of $P$ in the occurrence are equally spaced. We can consider the problem of equidistant subsequences as generalizations of (sub-)cadences. We give bit-paral…
▽ More
The equidistant subsequence pattern matching problem is considered. Given a pattern string $P$ and a text string $T$, we say that $P$ is an \emph{equidistant subsequence} of $T$ if $P$ is a subsequence of the text such that consecutive symbols of $P$ in the occurrence are equally spaced. We can consider the problem of equidistant subsequences as generalizations of (sub-)cadences. We give bit-parallel algorithms that yield $o(n^2)$ time algorithms for finding $k$-(sub-)cadences and equidistant subsequences. Furthermore, $O(n\log^2 n)$ and $O(n\log n)$ time algorithms, respectively for equidistant and Abelian equidistant matching for the case $|P| = 3$, are shown. The algorithms make use of a technique that was recently introduced which can efficiently compute convolutions with linear constraints.
△ Less
Submitted 17 February, 2020;
originally announced February 2020.
-
Parameterized DAWGs: efficient constructions and bidirectional pattern searches
Authors:
Katsuhito Nakashima,
Noriki Fujisato,
Diptarama Hendrian,
Yuto Nakashima,
Ryo Yoshinaka,
Shunsuke Inenaga,
Hideo Bannai,
Ayumi Shinohara,
Masayuki Takeda
Abstract:
Two strings $x$ and $y$ over $Σ\cup Π$ of equal length are said to \emph{parameterized match} (\emph{p-match}) if there is a renaming bijection $f:Σ\cup Π\rightarrow Σ\cup Π$ that is identity on $Σ$ and transforms $x$ to $y$ (or vice versa). The \emph{p-matching} problem is to look for substrings in a text that p-match a given pattern. In this paper, we propose \emph{parameterized suffix automata}…
▽ More
Two strings $x$ and $y$ over $Σ\cup Π$ of equal length are said to \emph{parameterized match} (\emph{p-match}) if there is a renaming bijection $f:Σ\cup Π\rightarrow Σ\cup Π$ that is identity on $Σ$ and transforms $x$ to $y$ (or vice versa). The \emph{p-matching} problem is to look for substrings in a text that p-match a given pattern. In this paper, we propose \emph{parameterized suffix automata} (\emph{p-suffix automata}) and \emph{parameterized directed acyclic word graphs} (\emph{PDAWGs}) which are the p-matching versions of suffix automata and DAWGs. While suffix automata and DAWGs are equivalent for standard strings, we show that p-suffix automata can have $Θ(n^2)$ nodes and edges but PDAWGs have only $O(n)$ nodes and edges, where $n$ is the length of an input string. We also give an $O(n |Π| \log (|Π| + |Σ|))$-time $O(n)$-space algorithm that builds the PDAWG in a left-to-right online manner. As a byproduct, it is shown that the \emph{parameterized suffix tree} for the reversed string can also be built in the same time and space, in a right-to-left online manner. This duality also leads us to two further efficient algorithms for p-matching: Given the parameterized suffix tree for the reversal of the input string $T$, one can build the PDAWG of $T$ in $O(n)$ time in an offline manner; One can perform \emph{bidirectional} p-matching in $O(m \log (|Π|+|Σ|) + \mathit{occ})$ time using $O(n)$ space, where $m$ denotes the pattern length and $\mathit{occ}$ is the number of pattern occurrences in the text $T$.
△ Less
Submitted 16 September, 2022; v1 submitted 17 February, 2020;
originally announced February 2020.
-
Faster STR-EC-LCS Computation
Authors:
Kohei Yamada,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
The longest common subsequence (LCS) problem is a central problem in stringology that finds the longest common subsequence of given two strings $A$ and $B$. More recently, a set of four constrained LCS problems (called generalized constrained LCS problem) were proposed by Chen and Chao [J. Comb. Optim, 2011]. In this paper, we consider the substring-excluding constrained LCS (STR-EC-LCS) problem.…
▽ More
The longest common subsequence (LCS) problem is a central problem in stringology that finds the longest common subsequence of given two strings $A$ and $B$. More recently, a set of four constrained LCS problems (called generalized constrained LCS problem) were proposed by Chen and Chao [J. Comb. Optim, 2011]. In this paper, we consider the substring-excluding constrained LCS (STR-EC-LCS) problem. A string $Z$ is said to be an STR-EC-LCS of two given strings $A$ and $B$ excluding $P$ if, $Z$ is one of the longest common subsequences of $A$ and $B$ that does not contain $P$ as a substring. Wang et al. proposed a dynamic programming solution which computes an STR-EC-LCS in $O(mnr)$ time and space where $m = |A|, n = |B|, r = |P|$ [Inf. Process. Lett., 2013]. In this paper, we show a new solution for the STR-EC-LCS problem. Our algorithm computes an STR-EC-LCS in $O(n|Σ| + (L+1)(m-L+1)r)$ time where $|Σ| \leq \min\{m, n\}$ denotes the set of distinct characters occurring in both $A$ and $B$, and $L$ is the length of the STR-EC-LCS. This algorithm is faster than the $O(mnr)$-time algorithm for short/long STR-EC-LCS (namely, $L \in O(1)$ or $m-L \in O(1)$), and is at least as efficient as the $O(mnr)$-time algorithm for all cases.
△ Less
Submitted 16 January, 2020;
originally announced January 2020.
-
IterNet: Retinal Image Segmentation Utilizing Structural Redundancy in Vessel Networks
Authors:
Liangzhi Li,
Manisha Verma,
Yuta Nakashima,
Hajime Nagahara,
Ryo Kawasaki
Abstract:
Retinal vessel segmentation is of great interest for diagnosis of retinal vascular diseases. To further improve the performance of vessel segmentation, we propose IterNet, a new model based on UNet, with the ability to find obscured details of the vessel from the segmented vessel image itself, rather than the raw input image. IterNet consists of multiple iterations of a mini-UNet, which can be 4…
▽ More
Retinal vessel segmentation is of great interest for diagnosis of retinal vascular diseases. To further improve the performance of vessel segmentation, we propose IterNet, a new model based on UNet, with the ability to find obscured details of the vessel from the segmented vessel image itself, rather than the raw input image. IterNet consists of multiple iterations of a mini-UNet, which can be 4$\times$ deeper than the common UNet. IterNet also adopts the weight-sharing and skip-connection features to facilitate training; therefore, even with such a large architecture, IterNet can still learn from merely 10$\sim$20 labeled images, without pre-training or any prior knowledge. IterNet achieves AUCs of 0.9816, 0.9851, and 0.9881 on three mainstream datasets, namely DRIVE, CHASE-DB1, and STARE, respectively, which currently are the best scores in the literature. The source code is available.
△ Less
Submitted 11 December, 2019;
originally announced December 2019.
-
KnowIT VQA: Answering Knowledge-Based Questions about Videos
Authors:
Noa Garcia,
Mayu Otani,
Chenhui Chu,
Yuta Nakashima
Abstract:
We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the serie…
▽ More
We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered. Second, we propose a video understanding model by combining the visual and textual video content with specific knowledge about the show. Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations.
△ Less
Submitted 23 December, 2019; v1 submitted 22 October, 2019;
originally announced October 2019.
-
BUDA.ART: A Multimodal Content-Based Analysis and Retrieval System for Buddha Statues
Authors:
Benjamin Renoust,
Matheus Oliveira Franca,
Jacob Chan,
Van Le,
Ayaka Uesaka,
Yuta Nakashima,
Hajime Nagahara,
Jueren Wang,
Yutaka Fujioka
Abstract:
We introduce BUDA.ART, a system designed to assist researchers in Art History, to explore and analyze an archive of pictures of Buddha statues. The system combines different CBIR and classical retrieval techniques to assemble 2D pictures, 3D statue scans and meta-data, that is focused on the Buddha facial characteristics. We build the system from an archive of 50,000 Buddhism pictures, identify un…
▽ More
We introduce BUDA.ART, a system designed to assist researchers in Art History, to explore and analyze an archive of pictures of Buddha statues. The system combines different CBIR and classical retrieval techniques to assemble 2D pictures, 3D statue scans and meta-data, that is focused on the Buddha facial characteristics. We build the system from an archive of 50,000 Buddhism pictures, identify unique Buddha statues, extract contextual information, and provide specific facial embedding to first index the archive. The system allows for mobile, on-site search, and to explore similarities of statues in the archive. In addition, we provide search visualization and 3D analysis of the statues
△ Less
Submitted 17 September, 2019;
originally announced September 2019.
-
Historical and Modern Features for Buddha Statue Classification
Authors:
Benjamin Renoust,
Matheus Oliveira Franca,
Jacob Chan,
Noa Garcia,
Van Le,
Ayaka Uesaka,
Yuta Nakashima,
Hajime Nagahara,
Jueren Wang,
Yutaka Fujioka
Abstract:
While Buddhism has spread along the Silk Roads, many pieces of art have been displaced. Only a few experts may identify these works, subjectively to their experience. The construction of Buddha statues was taught through the definition of canon rules, but the applications of those rules greatly varies across time and space. Automatic art analysis aims at supporting these challenges. We propose to…
▽ More
While Buddhism has spread along the Silk Roads, many pieces of art have been displaced. Only a few experts may identify these works, subjectively to their experience. The construction of Buddha statues was taught through the definition of canon rules, but the applications of those rules greatly varies across time and space. Automatic art analysis aims at supporting these challenges. We propose to automatically recover the proportions induced by the construction guidelines, in order to use them and compare between different deep learning features for several classification tasks, in a medium size but rich dataset of Buddha statues, collected with experts of Buddhism art history.
△ Less
Submitted 6 October, 2019; v1 submitted 17 September, 2019;
originally announced September 2019.
-
Minimal Unique Substrings and Minimal Absent Words in a Sliding Window
Authors:
Takuya Mieno,
Yuki Kuhara,
Tooru Akagi,
Yuta Fujishige,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
A substring $u$ of a string $T$ is called a minimal unique substring (MUS) of $T$ if $u$ occurs exactly once in $T$ and any proper substring of $u$ occurs at least twice in $T$. A string $w$ is called a minimal absent word (MAW) of $T$ if $w$ does not occur in $T$ and any proper substring of $w$ occurs in $T$. In this paper, we study the problems of computing MUSs and MAWs in a sliding window over…
▽ More
A substring $u$ of a string $T$ is called a minimal unique substring (MUS) of $T$ if $u$ occurs exactly once in $T$ and any proper substring of $u$ occurs at least twice in $T$. A string $w$ is called a minimal absent word (MAW) of $T$ if $w$ does not occur in $T$ and any proper substring of $w$ occurs in $T$. In this paper, we study the problems of computing MUSs and MAWs in a sliding window over a given string $T$. We first show how the set of MUSs can change in a sliding window over $T$, and present an $O(n\logσ)$-time and $O(d)$-space algorithm to compute MUSs in a sliding window of width $d$ over $T$, where $σ$ is the maximum number of distinct characters in every window. We then give tight upper and lower bounds on the maximum number of changes in the set of MAWs in a sliding window over $T$. Our bounds improve on the previous results in [Crochemore et al., 2017].
△ Less
Submitted 13 September, 2019; v1 submitted 6 September, 2019;
originally announced September 2019.
-
On Longest Common Property Preserved Substring Queries
Authors:
Kazuki Kai,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda,
Tomasz Kociumaka
Abstract:
We revisit the problem of longest common property preserving substring queries introduced by~Ayad et al. (SPIRE 2018, arXiv 2018). We consider a generalized and unified on-line setting, where we are given a set $X$ of $k$ strings of total length $n$ that can be pre-processed so that, given a query string $y$ and a positive integer $k'\leq k$, we can determine the longest substring of $y$ that sati…
▽ More
We revisit the problem of longest common property preserving substring queries introduced by~Ayad et al. (SPIRE 2018, arXiv 2018). We consider a generalized and unified on-line setting, where we are given a set $X$ of $k$ strings of total length $n$ that can be pre-processed so that, given a query string $y$ and a positive integer $k'\leq k$, we can determine the longest substring of $y$ that satisfies some specific property and is common to at least $k'$ strings in $X$. Ayad et al. considered the longest square-free substring in an on-line setting and the longest periodic and palindromic substring in an off-line setting. In this paper, we give efficient solutions in the on-line setting for finding the longest common square, periodic, palindromic, and Lyndon substrings. More precisely, we show that $X$ can be pre-processed in $O(n)$ time resulting in a data structure of $O(n)$ size that answers queries in $O(|y|\logσ)$ time and $O(1)$ working space, where $σ$ is the size of the alphabet, and the common substring must be a square, a periodic substring, a palindrome, or a Lyndon word.
△ Less
Submitted 13 June, 2019;
originally announced June 2019.
-
Direct Linear Time Construction of Parameterized Suffix and LCP Arrays for Constant Alphabets
Authors:
Noriki Fujisato,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
We present the first worst-case linear time algorithm that directly computes the parameterized suffix and LCP arrays for constant sized alphabets. Previous algorithms either required quadratic time or the parameterized suffix tree to be built first. More formally, for a string over static alphabet $Σ$ and parameterized alphabet $Π$, our algorithm runs in $O(nπ)$ time and $O(n)$ words of space, whe…
▽ More
We present the first worst-case linear time algorithm that directly computes the parameterized suffix and LCP arrays for constant sized alphabets. Previous algorithms either required quadratic time or the parameterized suffix tree to be built first. More formally, for a string over static alphabet $Σ$ and parameterized alphabet $Π$, our algorithm runs in $O(nπ)$ time and $O(n)$ words of space, where $π$ is the number of distinct symbols of $Π$ in the string.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Space-Efficient Algorithms for Computing Minimal/Shortest Unique Substrings
Authors:
Takuya Mieno,
Dominik Köppl,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
Given a string $T$ of length $n$, a substring $u = T[i..j]$ of $T$ is called a shortest unique substring (SUS) for an interval $[s,t]$ if (a) $u$ occurs exactly once in $T$, (b) $u$ contains the interval $[s,t]$ (i.e. $i \leq s \leq t \leq j$), and (c) every substring $v$ of $T$ with $|v| < |u|$ containing $[s,t]$ occurs at least twice in $T$. Given a query interval $[s, t] \subset [1, n]$, the in…
▽ More
Given a string $T$ of length $n$, a substring $u = T[i..j]$ of $T$ is called a shortest unique substring (SUS) for an interval $[s,t]$ if (a) $u$ occurs exactly once in $T$, (b) $u$ contains the interval $[s,t]$ (i.e. $i \leq s \leq t \leq j$), and (c) every substring $v$ of $T$ with $|v| < |u|$ containing $[s,t]$ occurs at least twice in $T$. Given a query interval $[s, t] \subset [1, n]$, the interval SUS problem is to output all the SUSs for the interval $[s,t]$. In this article, we propose a $4n + o(n)$ bits data structure answering an interval SUS query in output-sensitive $O(\mathit{occ})$ time, where $\mathit{occ}$ is the number of returned SUSs. Additionally, we focus on the point SUS problem, which is the interval SUS problem for $s = t$. Here, we propose a $\lceil (\log_2{3} + 1)n \rceil + o(n)$ bits data structure answering a point SUS query in the same output-sensitive time. We also propose space-efficient algorithms for computing the minimal unique substrings of $T$.
△ Less
Submitted 14 September, 2020; v1 submitted 30 May, 2019;
originally announced May 2019.
-
A Compact Low-Latency Systematic Successive Cancellation Polar Decoder for Visible Light Communication Systems
Authors:
Duc-Phuc Nguyen,
Dinh-Dung Le,
Thi-Hong Tran,
Takashi Nakada,
Yasuhiko Nakashima
Abstract:
Channel polarization and Polar code are widely considered as major breakthroughs in coding theory because they have shown promising features for future wireless standards. The main drawbacks of Polar code are high-latency in decoding hardware, and unimpressive error-correction performance in case limited code-length is implemented. These two disadvantages limit implementation of Polar code in low-…
▽ More
Channel polarization and Polar code are widely considered as major breakthroughs in coding theory because they have shown promising features for future wireless standards. The main drawbacks of Polar code are high-latency in decoding hardware, and unimpressive error-correction performance in case limited code-length is implemented. These two disadvantages limit implementation of Polar code in low-throughput wireless communication systems. In this paper, we propose a low-complexity low-latency hardware architecture for the soft-decision compact (16,11) Systematic Successive Cancellation Polar Decoder (S-SCD). Experimental results has shown that the latency of the proposed S-SCD improves 3.75 times and 2.75 times compared with conventional and 2b-SC architectures. Besides, it has also shown a better BER/FER performance compared with RS(15,11) code, which is applied widely in current VLC-based systems.
△ Less
Submitted 6 May, 2019;
originally announced May 2019.
-
Understanding Art through Multi-Modal Retrieval in Paintings
Authors:
Noa Garcia,
Benjamin Renoust,
Yuta Nakashima
Abstract:
In computer vision, visual arts are often studied from a purely aesthetics perspective, mostly by analysing the visual appearance of an artistic reproduction to infer its style, its author, or its representative features. In this work, however, we explore art from both a visual and a language perspective. Our aim is to bridge the gap between the visual appearance of an artwork and its underlying m…
▽ More
In computer vision, visual arts are often studied from a purely aesthetics perspective, mostly by analysing the visual appearance of an artistic reproduction to infer its style, its author, or its representative features. In this work, however, we explore art from both a visual and a language perspective. Our aim is to bridge the gap between the visual appearance of an artwork and its underlying meaning, by jointly analysing its aesthetics and its semantics. We introduce the use of multi-modal techniques in the field of automatic art analysis by 1) collecting a multi-modal dataset with fine-art paintings and comments, and 2) exploring robust visual and textual representations in artistic images.
△ Less
Submitted 23 April, 2019;
originally announced April 2019.
-
c-trie++: A Dynamic Trie Tailored for Fast Prefix Searches
Authors:
Kazuya Tsuruta,
Dominik Köppl,
Shunsuke Kanda,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
Given a dynamic set $K$ of $k$ strings of total length $n$ whose characters are drawn from an alphabet of size $σ$, a keyword dictionary is a data structure built on $K$ that provides locate, prefix search, and update operations on $K$. Under the assumption that $α= w / \lg σ$ characters fit into a single machine word $w$, we propose a keyword dictionary that represents $K$ in…
▽ More
Given a dynamic set $K$ of $k$ strings of total length $n$ whose characters are drawn from an alphabet of size $σ$, a keyword dictionary is a data structure built on $K$ that provides locate, prefix search, and update operations on $K$. Under the assumption that $α= w / \lg σ$ characters fit into a single machine word $w$, we propose a keyword dictionary that represents $K$ in $n \lg σ+ Θ(k \lg n)$ bits of space, supporting all operations in $O(m / α+ \lg α)$ expected time on an input string of length $m$ in the word RAM model. This data structure is underlined with an exhaustive practical evaluation, highlighting the practical usefulness of the proposed data structure, especially for prefix searches - one of the most elementary keyword dictionary operations.
△ Less
Submitted 7 October, 2020; v1 submitted 16 April, 2019;
originally announced April 2019.
-
Context-Aware Embeddings for Automatic Art Analysis
Authors:
Noa Garcia,
Benjamin Renoust,
Yuta Nakashima
Abstract:
Automatic art analysis aims to classify and retrieve artistic representations from a collection of images by using computer vision and machine learning techniques. In this work, we propose to enhance visual representations from neural networks with contextual artistic information. Whereas visual representations are able to capture information about the content and the style of an artwork, our prop…
▽ More
Automatic art analysis aims to classify and retrieve artistic representations from a collection of images by using computer vision and machine learning techniques. In this work, we propose to enhance visual representations from neural networks with contextual artistic information. Whereas visual representations are able to capture information about the content and the style of an artwork, our proposed context-aware embeddings additionally encode relationships between different artistic attributes, such as author, school, or historical period. We design two different approaches for using context in automatic art analysis. In the first one, contextual data is obtained through a multi-task learning model, in which several attributes are trained together to find visual relationships between elements. In the second approach, context is obtained through an art-specific knowledge graph, which encodes relationships between artistic attributes. An exhaustive evaluation of both of our models in several art analysis problems, such as author identification, type classification, or cross-modal retrieval, show that performance is improved by up to 7.3% in art classification and 37.24% in retrieval when context-aware embeddings are used.
△ Less
Submitted 9 April, 2019;
originally announced April 2019.
-
Non-RLL DC-Balance based on a Pre-scrambled Polar Encoder for Beacon-based Visible Light Communication Systems
Authors:
Duc-Phuc Nguyen,
Dinh-Dung Le,
Thi-Hong Tran,
Yasuhiko Nakashima
Abstract:
Current flicker mitigation (or DC-balance) solutions based on run-length limited (RLL) decoding algorithms are high in complexity, suffer from reduced code rates, or are limited in application to hard-decoding forward error correction (FEC) decoders. Fortunately, non-RLL DC-balance solutions can overcome the drawbacks of RLL-based algorithms, but they meet some difficulties in system latency, low…
▽ More
Current flicker mitigation (or DC-balance) solutions based on run-length limited (RLL) decoding algorithms are high in complexity, suffer from reduced code rates, or are limited in application to hard-decoding forward error correction (FEC) decoders. Fortunately, non-RLL DC-balance solutions can overcome the drawbacks of RLL-based algorithms, but they meet some difficulties in system latency, low code rate or inferior error-correction performance. Recently, non-RLL flicker mitigation solution based on Polar code has proved to be a most optimal approach due to its natural equal probabilities of short runs of 1's and 0's with high error-correction performance. However, we found that this solution can only maintain DC balance only when the data frame length is sufficiently long. Therefore, these solutions are not suitable for using in beacon-based visible light communication (VLC) systems, which usually transmit ID information in small-size data frames. In this paper, we introduce a flicker mitigation solution designed for beacon-based VLC systems that combines a simple pre-scrambler with a (256;158) non-systematic polar encoder.
△ Less
Submitted 29 March, 2019;
originally announced April 2019.
-
Rethinking the Evaluation of Video Summaries
Authors:
Mayu Otani,
Yuta Nakashima,
Esa Rahtu,
Janne Heikkilä
Abstract:
Video summarization is a technique to create a short skim of the original video while preserving the main stories/content. There exists a substantial interest in automatizing this process due to the rapid growth of the available material. The recent progress has been facilitated by public benchmark datasets, which enable easy and fair comparison of methods. Currently the established evaluation pro…
▽ More
Video summarization is a technique to create a short skim of the original video while preserving the main stories/content. There exists a substantial interest in automatizing this process due to the rapid growth of the available material. The recent progress has been facilitated by public benchmark datasets, which enable easy and fair comparison of methods. Currently the established evaluation protocol is to compare the generated summary with respect to a set of reference summaries provided by the dataset. In this paper, we will provide in-depth assessment of this pipeline using two popular benchmark datasets. Surprisingly, we observe that randomly generated summaries achieve comparable or better performance to the state-of-the-art. In some cases, the random summaries outperform even the human generated summaries in leave-one-out experiments. Moreover, it turns out that the video segmentation, which is often considered as a fixed pre-processing method, has the most significant impact on the performance measure. Based on our observations, we propose alternative approaches for assessing the importance scores as well as an intuitive visualization of correlation between the estimated scoring and human annotations.
△ Less
Submitted 11 April, 2019; v1 submitted 27 March, 2019;
originally announced March 2019.
-
Fast Algorithms for the Shortest Unique Palindromic Substring Problem on Run-Length Encoded Strings
Authors:
Kiichi Watanabe,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
For a string $S$, a palindromic substring $S[i..j]$ is said to be a \emph{shortest unique palindromic substring} ($\mathit{SUPS}$) for an interval $[s, t]$ in $S$, if $S[i..j]$ occurs exactly once in $S$, the interval $[i, j]$ contains $[s, t]$, and every palindromic substring containing $[s, t]$ which is shorter than $S[i..j]$ occurs at least twice in $S$. In this paper, we study the problem of a…
▽ More
For a string $S$, a palindromic substring $S[i..j]$ is said to be a \emph{shortest unique palindromic substring} ($\mathit{SUPS}$) for an interval $[s, t]$ in $S$, if $S[i..j]$ occurs exactly once in $S$, the interval $[i, j]$ contains $[s, t]$, and every palindromic substring containing $[s, t]$ which is shorter than $S[i..j]$ occurs at least twice in $S$. In this paper, we study the problem of answering $\mathit{SUPS}$ queries on run-length encoded strings. We show how to preprocess a given run-length encoded string $\mathit{RLE}_{S}$ of size $m$ in $O(m)$ space and $O(m \log σ_{\mathit{RLE}_{S}} + m \sqrt{\log m / \log\log m})$ time so that all $\mathit{SUPSs}$ for any subsequent query interval can be answered in $O(\sqrt{\log m / \log\log m} + α)$ time, where $α$ is the number of outputs, and $σ_{\mathit{RLE}_{S}}$ is the number of distinct runs of $\mathit{RLE}_{S}$. Additionaly, we consider a variant of the SUPS problem where a query interval is also given in a run-length encoded form. For this variant of the problem, we present two alternative algorithms with faster queries. The first one answers queries in $O(\sqrt{\log\log m /\log\log\log m} + α)$ time and can be built in $O(m \log σ_{\mathit{RLE}_{S}} + m \sqrt{\log m / \log\log m})$ time, and the second one answers queries in $O(\log \log m + α)$ time and can be built in $O(m \log σ_{\mathit{RLE}_{S}})$ time. Both of these data structures require $O(m)$ space.
△ Less
Submitted 23 March, 2020; v1 submitted 14 March, 2019;
originally announced March 2019.
-
The Parameterized Position Heap of a Trie
Authors:
Noriki Fujisato,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
Let $Σ$ and $Π$ be disjoint alphabets of respective size $σ$ and $π$. Two strings over $Σ\cup Π$ of equal length are said to parameterized match (p-match) if there is a bijection $f:Σ\cup Π\rightarrow Σ\cup Π$ such that (1) $f$ is identity on $Σ$ and (2) $f$ maps the characters of one string to those of the other string so that the two strings become identical. We consider the p-matching problem o…
▽ More
Let $Σ$ and $Π$ be disjoint alphabets of respective size $σ$ and $π$. Two strings over $Σ\cup Π$ of equal length are said to parameterized match (p-match) if there is a bijection $f:Σ\cup Π\rightarrow Σ\cup Π$ such that (1) $f$ is identity on $Σ$ and (2) $f$ maps the characters of one string to those of the other string so that the two strings become identical. We consider the p-matching problem on a (reversed) trie $\mathcal{T}$ and a string pattern $P$ such that every path that p-matches $P$ has to be reported. Let $N$ be the size of the given trie $\mathcal{T}$. In this paper, we propose the parameterized position heap for $\mathcal{T}$ that occupies $O(N)$ space and supports p-matching queries in $O(m \log (σ+ π) + m π+ \mathit{pocc}))$ time, where $m$ is the length of a query pattern $P$ and $\mathit{pocc}$ is the number of paths in $\mathcal{T}$ to report. We also present an algorithm which constructs the parameterized position heap for a given trie $\mathcal{T}$ in $O(N (σ+ π))$ time and working space.
△ Less
Submitted 14 March, 2019;
originally announced March 2019.
-
Computing longest palindromic substring after single-character or block-wise edits
Authors:
Mitsuru Funakoshi,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
Palindromes are important objects in strings which have been extensively studied from combinatorial, algorithmic, and bioinformatics points of views. It is known that the length of the longest palindromic substrings (LPSs) of a given string T of length n can be computed in O(n) time by Manacher's algorithm [J. ACM '75]. In this paper, we consider the problem of finding the LPS after the string is…
▽ More
Palindromes are important objects in strings which have been extensively studied from combinatorial, algorithmic, and bioinformatics points of views. It is known that the length of the longest palindromic substrings (LPSs) of a given string T of length n can be computed in O(n) time by Manacher's algorithm [J. ACM '75]. In this paper, we consider the problem of finding the LPS after the string is edited. We present an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LPSs in O(\log (\min \{σ, \log n\})) time after a single character substitution, insertion, or deletion, where σdenotes the number of distinct characters appearing in T. We also propose an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LPSs in O(\ell + \log \log n) time, after an existing substring in T is replaced by a string of arbitrary length \ell.
△ Less
Submitted 8 January, 2021; v1 submitted 30 January, 2019;
originally announced January 2019.
-
Efficiently computing runs on a trie
Authors:
Ryo Sugahara,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
A maximal repetition, or run, in a string, is a maximal periodic substring whose smallest period is at most half the length of the substring. In this paper, we consider runs that correspond to a path on a trie, or in other words, on a rooted edge-labeled tree where the endpoints of the path must be a descendant/ancestor of the other. For a trie with $n$ edges, we show that the number of runs is le…
▽ More
A maximal repetition, or run, in a string, is a maximal periodic substring whose smallest period is at most half the length of the substring. In this paper, we consider runs that correspond to a path on a trie, or in other words, on a rooted edge-labeled tree where the endpoints of the path must be a descendant/ancestor of the other. For a trie with $n$ edges, we show that the number of runs is less than $n$. We also show an asymptotic lower bound on the maximum density of runs in tries: $\lim_{n\rightarrow\infty}ρ_\mathcal{T}(n)/n \geq 0.993238$ where $ρ_{\mathcal{T}}(n)$ is the maximum number of runs in a trie with $n$ edges. Furthermore, we also show an $O(n\log \log n)$ time and $O(n)$ space algorithm for finding all runs.
△ Less
Submitted 20 April, 2021; v1 submitted 29 January, 2019;
originally announced January 2019.
-
MR-RePair: Grammar Compression based on Maximal Repeats
Authors:
Isamu Furuya,
Takuya Takagi,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Takuya Kida
Abstract:
We analyze the grammar generation algorithm of the RePair compression algorithm and show the relation between a grammar generated by RePair and maximal repeats. We reveal that RePair replaces step by step the most frequent pairs within the corresponding most frequent maximal repeats. Then, we design a novel variant of RePair, called MR-RePair, which substitutes the most frequent maximal repeats at…
▽ More
We analyze the grammar generation algorithm of the RePair compression algorithm and show the relation between a grammar generated by RePair and maximal repeats. We reveal that RePair replaces step by step the most frequent pairs within the corresponding most frequent maximal repeats. Then, we design a novel variant of RePair, called MR-RePair, which substitutes the most frequent maximal repeats at once instead of substituting the most frequent pairs consecutively. We implemented MR-RePair and compared the size of the grammar generated by MR-RePair to that by RePair on several text corpus. Our experiments show that MR-RePair generates more compact grammars than RePair does, especially for highly repetitive texts.
△ Less
Submitted 18 February, 2019; v1 submitted 12 November, 2018;
originally announced November 2018.
-
Right-to-left online construction of parameterized position heaps
Authors:
Noriki Fujisato,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
Two strings of equal length are said to parameterized match if there is a bijection that maps the characters of one string to those of the other string, so that two strings become identical. The parameterized pattern matching problem is, given two strings $T$ and $P$, to find the occurrences of substrings in $T$ that parameterized match $P$. Diptarama et al. [Position Heaps for Parameterized Strin…
▽ More
Two strings of equal length are said to parameterized match if there is a bijection that maps the characters of one string to those of the other string, so that two strings become identical. The parameterized pattern matching problem is, given two strings $T$ and $P$, to find the occurrences of substrings in $T$ that parameterized match $P$. Diptarama et al. [Position Heaps for Parameterized Strings, CPM 2017] proposed an indexing data structure called parameterized position heaps, and gave a left-to-right online construction algorithm. In this paper, we present a right-to-left online construction algorithm for parameterized position heaps. For a text string $T$ of length $n$ over two kinds of alphabets $Σ$ and $Π$ of respective size $σ$ and $π$, our construction algorithm runs in $O(n \log(σ+ π))$ time with $O(n)$ space. Our right-to-left parameterized position heaps support pattern matching queries in $O(m \log (σ+ π) + m π+ \mathit{pocc}))$ time, where $m$ is the length of a query pattern $P$ and $\mathit{pocc}$ is the number of occurrences to report. Our construction and pattern matching algorithms are as efficient as Diptarama et al.'s algorithms.
△ Less
Submitted 2 August, 2018;
originally announced August 2018.
-
Representing a Partially Observed Non-Rigid 3D Human Using Eigen-Texture and Eigen-Deformation
Authors:
Ryosuke Kimura,
Akihiko Sayo,
Fabian Lorenzo Dayrit,
Yuta Nakashima,
Hiroshi Kawasaki,
Ambrosio Blanco,
Katsushi Ikeuchi
Abstract:
Reconstruction of the shape and motion of humans from RGB-D is a challenging problem, receiving much attention in recent years. Recent approaches for full-body reconstruction use a statistic shape model, which is built upon accurate full-body scans of people in skin-tight clothes, to complete invisible parts due to occlusion. Such a statistic model may still be fit to an RGB-D measurement with loo…
▽ More
Reconstruction of the shape and motion of humans from RGB-D is a challenging problem, receiving much attention in recent years. Recent approaches for full-body reconstruction use a statistic shape model, which is built upon accurate full-body scans of people in skin-tight clothes, to complete invisible parts due to occlusion. Such a statistic model may still be fit to an RGB-D measurement with loose clothes but cannot describe its deformations, such as clothing wrinkles. Observed surfaces may be reconstructed precisely from actual measurements, while we have no cues for unobserved surfaces. For full-body reconstruction with loose clothes, we propose to use lower dimensional embeddings of texture and deformation referred to as eigen-texturing and eigen-deformation, to reproduce views of even unobserved surfaces. Provided a full-body reconstruction from a sequence of partial measurements as 3D meshes, the texture and deformation of each triangle are then embedded using eigen-decomposition. Combined with neural-network-based coefficient regression, our method synthesizes the texture and deformation from arbitrary viewpoints. We evaluate our method using simulated data and visually demonstrate how our method works on real data.
△ Less
Submitted 7 July, 2018;
originally announced July 2018.
-
$O(n \log n)$-time text compression by LZ-style longest first substitution
Authors:
Akihiro Nishi,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
Mauer et al. [A Lempel-Ziv-style Compression Method for Repetitive Texts, PSC 2017] proposed a hybrid text compression method called LZ-LFS which has both features of Lempel-Ziv 77 factorization and longest first substitution. They showed that LZ-LFS can achieve better compression ratio for repetitive texts, compared to some state-of-the-art compression algorithms. The drawback of Mauer et al.'s m…
▽ More
Mauer et al. [A Lempel-Ziv-style Compression Method for Repetitive Texts, PSC 2017] proposed a hybrid text compression method called LZ-LFS which has both features of Lempel-Ziv 77 factorization and longest first substitution. They showed that LZ-LFS can achieve better compression ratio for repetitive texts, compared to some state-of-the-art compression algorithms. The drawback of Mauer et al.'s method is that their LZ-LFS compression algorithm takes $O(n^2)$ time on an input string of length $n$. In this paper, we show a faster LZ-LFS compression algorithm that works in $O(n \log n)$ time. We also propose a simpler version of LZ-LFS that can be computed in $O(n)$ time.
△ Less
Submitted 13 June, 2018;
originally announced June 2018.
-
iParaphrasing: Extracting Visually Grounded Paraphrases via an Image
Authors:
Chenhui Chu,
Mayu Otani,
Yuta Nakashima
Abstract:
A paraphrase is a restatement of the meaning of a text in other words. Paraphrases have been studied to enhance the performance of many natural language processing tasks. In this paper, we propose a novel task iParaphrasing to extract visually grounded paraphrases (VGPs), which are different phrasal expressions describing the same visual concept in an image. These extracted VGPs have the potential…
▽ More
A paraphrase is a restatement of the meaning of a text in other words. Paraphrases have been studied to enhance the performance of many natural language processing tasks. In this paper, we propose a novel task iParaphrasing to extract visually grounded paraphrases (VGPs), which are different phrasal expressions describing the same visual concept in an image. These extracted VGPs have the potential to improve language and image multimodal tasks such as visual question answering and image captioning. How to model the similarity between VGPs is the key of iParaphrasing. We apply various existing methods as well as propose a novel neural network-based method with image attention, and report the results of the first attempt toward iParaphrasing.
△ Less
Submitted 11 June, 2018;
originally announced June 2018.
-
VLSI Architecture of Compact Non-RLL Beacon-based Visible Light Communication Transmitter and Receiver
Authors:
Duc-Phuc Nguyen,
Dinh-Dung Le,
Thi-Hong Tran,
Huu-Thuan Huynh,
Yasuhiko Nakashima
Abstract:
In this paper, we introduce a couple of hardware implementations of compact VLC transmitter and receiver for the first time. Compared with related works, our VLC transmitter is non-RLL one, that means flicker mitigation can be guaranteed even without RLL codes. In particular, we have utilized a centralized bit probability distribution of a prescrambler and a Polar encoder to create a non-RLL flick…
▽ More
In this paper, we introduce a couple of hardware implementations of compact VLC transmitter and receiver for the first time. Compared with related works, our VLC transmitter is non-RLL one, that means flicker mitigation can be guaranteed even without RLL codes. In particular, we have utilized a centralized bit probability distribution of a prescrambler and a Polar encoder to create a non-RLL flicker mitigation solution. Moreover, at the receiver, a 3-bit soft-decision filter is proposed to analyze signals received from real VLC channel to extract log-likelihood ratio (LLR) values and feed them to the FEC decoder. Therefore, soft-decoding of Polar decoder can be implemented to improve the bit-error-rate (BER) performance of the VLC system. Finally, we introduce a novel very large scale integration (VLSI) architecture for the compact VLC transmitter and receiver; and synthesis our design under FPGA/ASIC synthesis tools. Due to the non-RLL basic, our system has an evidently good code-rate and a reduced-complexity compared with other RLL-based receiver works. Also, we present FPGA and ASIC synthesis results of the proposed architecture with evaluations of power consumption, area, energy-per-bits and so on.
△ Less
Submitted 9 May, 2018;
originally announced May 2018.
-
Hardware Implementation of A Non-RLL Soft-decoding Beacon-based Visible Light Communication Receiver
Authors:
Duc-Phuc Nguyen,
Dinh-Dung Le,
Thi-Hong Tran,
Huu-Thuan Huynh,
Yasuhiko Nakashima
Abstract:
Visible light communication (VLC)-based beacon systems, which usually transmit identification (ID) information in small-size data frames are applied widely in indoor localization applications. There is one fact that flicker of LED light should be avoid in any VLC systems. Current flicker mitigation solutions based on run-length limited (RLL) codes suffer from reduced code rates, or are limited to…
▽ More
Visible light communication (VLC)-based beacon systems, which usually transmit identification (ID) information in small-size data frames are applied widely in indoor localization applications. There is one fact that flicker of LED light should be avoid in any VLC systems. Current flicker mitigation solutions based on run-length limited (RLL) codes suffer from reduced code rates, or are limited to hard-decoding forward error correction (FEC) decoders. Recently, soft-decoding techniques of RLL-codes are proposed to support soft-decoding FEC algorithms, but they contain potentials of high-complexity and time-consuming computations. Fortunately, non-RLL direct current (DC)-balance solutions can overcome the drawbacks of RLL-based algorithms, however, they meet some difficulties in system latency or inferior error-correction performances. Recently, non-RLL flicker mitigation solution based on Polar code has proved to be an optimal approach due to its natural equal probabilities of short runs of 1's and 0's with high error-correction performance. However, we found that this solution can only maintain the DC balance only when the data frame length is sufficiently long. Accordingly, short beacon-based data frames might still be a big challenge for flicker mitigation in such non-RLL cases. In this paper, we introduce a flicker mitigation solution designed for VLC-based beacon systems that combines a simple pre-scrambler with a Polar encoder which has a codeword smaller than the previous work 8 times. We also propose a hardware architecture for the proposed compact non-RLL VLC receiver for the first time. Also, a 3-bit soft-decision filter is introduce to enable soft-decoding of Polar decoder to improve the performance of the receiver.
△ Less
Submitted 29 May, 2018; v1 submitted 27 April, 2018;
originally announced May 2018.
-
Summarization of User-Generated Sports Video by Using Deep Action Recognition Features
Authors:
Antonio Tejero-de-Pablos,
Yuta Nakashima,
Tomokazu Sato,
Naokazu Yokoya,
Marko Linna,
Esa Rahtu
Abstract:
Automatically generating a summary of sports video poses the challenge of detecting interesting moments, or highlights, of a game. Traditional sports video summarization methods leverage editing conventions of broadcast sports video that facilitate the extraction of high-level semantics. However, user-generated videos are not edited, and thus traditional methods are not suitable to generate a summ…
▽ More
Automatically generating a summary of sports video poses the challenge of detecting interesting moments, or highlights, of a game. Traditional sports video summarization methods leverage editing conventions of broadcast sports video that facilitate the extraction of high-level semantics. However, user-generated videos are not edited, and thus traditional methods are not suitable to generate a summary. In order to solve this problem, this work proposes a novel video summarization method that uses players' actions as a cue to determine the highlights of the original video. A deep neural network-based approach is used to extract two types of action-related features and to classify video segments into interesting or uninteresting parts. The proposed method can be applied to any sports in which games consist of a succession of actions. Especially, this work considers the case of Kendo (Japanese fencing) as an example of a sport to evaluate the proposed method. The method is trained using Kendo videos with ground truth labels that indicate the video highlights. The labels are provided by annotators possessing different experience with respect to Kendo to demonstrate how the proposed method adapts to different needs. The performance of the proposed method is compared with several combinations of different features, and the results show that it outperforms previous summarization methods.
△ Less
Submitted 13 April, 2018; v1 submitted 25 September, 2017;
originally announced September 2017.
-
On the Size of Lempel-Ziv and Lyndon Factorizations
Authors:
Juha Kärkkäinen,
Dominik Kempa,
Yuto Nakashima,
Simon J. Puglisi,
Arseny M. Shur
Abstract:
Lyndon factorization and Lempel-Ziv (LZ) factorization are both important tools for analysing the structure and complexity of strings, but their combinatorial structure is very different. In this paper, we establish the first direct connection between the two by showing that while the Lyndon factorization can be bigger than the non-overlap** LZ factorization (which we demonstrate by describing a…
▽ More
Lyndon factorization and Lempel-Ziv (LZ) factorization are both important tools for analysing the structure and complexity of strings, but their combinatorial structure is very different. In this paper, we establish the first direct connection between the two by showing that while the Lyndon factorization can be bigger than the non-overlap** LZ factorization (which we demonstrate by describing a new, non-trivial family of strings) it is never more than twice the size.
△ Less
Submitted 27 November, 2016;
originally announced November 2016.
-
Video Summarization using Deep Semantic Features
Authors:
Mayu Otani,
Yuta Nakashima,
Esa Rahtu,
Janne Heikkilä,
Naokazu Yokoya
Abstract:
This paper presents a video summarization technique for an Internet video to provide a quick way to overview its content. This is a challenging problem because finding important or informative parts of the original video requires to understand its content. Furthermore the content of Internet videos is very diverse, ranging from home videos to documentaries, which makes video summarization much mor…
▽ More
This paper presents a video summarization technique for an Internet video to provide a quick way to overview its content. This is a challenging problem because finding important or informative parts of the original video requires to understand its content. Furthermore the content of Internet videos is very diverse, ranging from home videos to documentaries, which makes video summarization much more tough as prior knowledge is almost not available. To tackle this problem, we propose to use deep video features that can encode various levels of content semantics, including objects, actions, and scenes, improving the efficiency of standard video summarization techniques. For this, we design a deep neural network that maps videos as well as descriptions to a common semantic space and jointly trained it with associated pairs of videos and descriptions. To generate a video summary, we extract the deep features from each segment of the original video and apply a clustering-based summarization technique to them. We evaluate our video summaries using the SumMe dataset as well as baseline approaches. The results demonstrated the advantages of incorporating our deep semantic features in a video summarization technique.
△ Less
Submitted 27 September, 2016;
originally announced September 2016.