-
Deep convolutional demosaicking network for multispectral polarization filter array
Authors:
Tomoharu Ishiuchi,
Kazuma Shinoda
Abstract:
To address the demosaicking problem in multispectral polarization filter array (MSPFA) imaging, we propose a multispectral polarization demosaicking network (MSPDNet) that improves image reconstruction accuracy. Imaging with a multispectral polarization filter array acquires multispectral polarization information in a snapshot. The full-resolution multispectral polarization image must be reconstru…
▽ More
To address the demosaicking problem in multispectral polarization filter array (MSPFA) imaging, we propose a multispectral polarization demosaicking network (MSPDNet) that improves image reconstruction accuracy. Imaging with a multispectral polarization filter array acquires multispectral polarization information in a snapshot. The full-resolution multispectral polarization image must be reconstructed from a mosaic image. In the proposed method, a sparse image in which pixel values of the same channel are extracted from a mosaic image is used as input to MSPDNet. Missing pixels are interpolated by learning spatial and wavelength correlations from the observed pixels in the mosaic image. Moreover, by using 3D convolution, features are extracted at each convolution layer, and by deepening the network, even detailed features of the multispectral polarization image can be learned. Experimental results show that MSPDNet can reconstruct multi-wavelength and multi-polarization angle information with high accuracy in terms of peak signal-to-noise ratio (PSNR) evaluation and visual quality, indicating the effectiveness of the proposed method compared to other methods.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Orthogonal Series Estimation for the Ratio of Conditional Expectation Functions
Authors:
Kazuhiko Shinoda,
Takahiro Hoshino
Abstract:
In various fields of data science, researchers are often interested in estimating the ratio of conditional expectation functions (CEFR). Specifically in causal inference problems, it is sometimes natural to consider ratio-based treatment effects, such as odds ratios and hazard ratios, and even difference-based treatment effects are identified as CEFR in some empirically relevant settings. This cha…
▽ More
In various fields of data science, researchers are often interested in estimating the ratio of conditional expectation functions (CEFR). Specifically in causal inference problems, it is sometimes natural to consider ratio-based treatment effects, such as odds ratios and hazard ratios, and even difference-based treatment effects are identified as CEFR in some empirically relevant settings. This chapter develops the general framework for estimation and inference on CEFR, which allows the use of flexible machine learning for infinite-dimensional nuisance parameters. In the first stage of the framework, the orthogonal signals are constructed using debiased machine learning techniques to mitigate the negative impacts of the regularization bias in the nuisance estimates on the target estimates. The signals are then combined with a novel series estimator tailored for CEFR. We derive the pointwise and uniform asymptotic results for estimation and inference on CEFR, including the validity of the Gaussian bootstrap, and provide low-level sufficient conditions to apply the proposed framework to some specific examples. We demonstrate the finite-sample performance of the series estimator constructed under the proposed framework by numerical simulations. Finally, we apply the proposed method to estimate the causal effect of the 401(k) program on household assets.
△ Less
Submitted 26 December, 2022;
originally announced December 2022.
-
Which Shortcut Solution Do Question Answering Models Prefer to Learn?
Authors:
Kazutoshi Shinoda,
Saku Sugawara,
Akiko Aizawa
Abstract:
Question answering (QA) models for reading comprehension tend to learn shortcut solutions rather than the solutions intended by QA datasets. QA models that have learned shortcut solutions can achieve human-level performance in shortcut examples where shortcuts are valid, but these same behaviors degrade generalization potential on anti-shortcut examples where shortcuts are invalid. Various methods…
▽ More
Question answering (QA) models for reading comprehension tend to learn shortcut solutions rather than the solutions intended by QA datasets. QA models that have learned shortcut solutions can achieve human-level performance in shortcut examples where shortcuts are valid, but these same behaviors degrade generalization potential on anti-shortcut examples where shortcuts are invalid. Various methods have been proposed to mitigate this problem, but they do not fully take the characteristics of shortcuts themselves into account. We assume that the learnability of shortcuts, i.e., how easy it is to learn a shortcut, is useful to mitigate the problem. Thus, we first examine the learnability of the representative shortcuts on extractive and multiple-choice QA datasets. Behavioral tests using biased training sets reveal that shortcuts that exploit answer positions and word-label correlations are preferentially learned for extractive and multiple-choice QA, respectively. We find that the more learnable a shortcut is, the flatter and deeper the loss landscape is around the shortcut solution in the parameter space. We also find that the availability of the preferred shortcuts tends to make the task easier to perform from an information-theoretic viewpoint. Lastly, we experimentally show that the learnability of shortcuts can be utilized to construct an effective QA training set; the more learnable a shortcut is, the smaller the proportion of anti-shortcut examples required to achieve comparable performance on shortcut and anti-shortcut examples. We claim that the learnability of shortcuts should be considered when designing mitigation methods.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Penalizing Confident Predictions on Largely Perturbed Inputs Does Not Improve Out-of-Distribution Generalization in Question Answering
Authors:
Kazutoshi Shinoda,
Saku Sugawara,
Akiko Aizawa
Abstract:
Question answering (QA) models are shown to be insensitive to large perturbations to inputs; that is, they make correct and confident predictions even when given largely perturbed inputs from which humans can not correctly derive answers. In addition, QA models fail to generalize to other domains and adversarial test sets, while humans maintain high accuracy. Based on these observations, we assume…
▽ More
Question answering (QA) models are shown to be insensitive to large perturbations to inputs; that is, they make correct and confident predictions even when given largely perturbed inputs from which humans can not correctly derive answers. In addition, QA models fail to generalize to other domains and adversarial test sets, while humans maintain high accuracy. Based on these observations, we assume that QA models do not use intended features necessary for human reading but rely on spurious features, causing the lack of generalization ability. Therefore, we attempt to answer the question: If the overconfident predictions of QA models for various types of perturbations are penalized, will the out-of-distribution (OOD) generalization be improved? To prevent models from making confident predictions on perturbed inputs, we first follow existing studies and maximize the entropy of the output probability for perturbed inputs. However, we find that QA models trained to be sensitive to a certain perturbation type are often insensitive to unseen types of perturbations. Thus, we simultaneously maximize the entropy for the four perturbation types (i.e., word- and sentence-level shuffling and deletion) to further close the gap between models and humans. Contrary to our expectations, although models become sensitive to the four types of perturbations, we find that the OOD generalization is not improved. Moreover, the OOD generalization is sometimes degraded after entropy maximization. Making unconfident predictions on largely perturbed inputs per se may be beneficial to gaining human trust. However, our negative results suggest that researchers should pay attention to the side effect of entropy maximization.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Look to the Right: Mitigating Relative Position Bias in Extractive Question Answering
Authors:
Kazutoshi Shinoda,
Saku Sugawara,
Akiko Aizawa
Abstract:
Extractive question answering (QA) models tend to exploit spurious correlations to make predictions when a training set has unintended biases. This tendency results in models not being generalizable to examples where the correlations do not hold. Determining the spurious correlations QA models can exploit is crucial in building generalizable QA models in real-world applications; moreover, a method…
▽ More
Extractive question answering (QA) models tend to exploit spurious correlations to make predictions when a training set has unintended biases. This tendency results in models not being generalizable to examples where the correlations do not hold. Determining the spurious correlations QA models can exploit is crucial in building generalizable QA models in real-world applications; moreover, a method needs to be developed that prevents these models from learning the spurious correlations even when a training set is biased. In this study, we discovered that the relative position of an answer, which is defined as the relative distance from an answer span to the closest question-context overlap word, can be exploited by QA models as superficial cues for making predictions. Specifically, we find that when the relative positions in a training set are biased, the performance on examples with relative positions unseen during training is significantly degraded. To mitigate the performance degradation for unseen relative positions, we propose an ensemble-based debiasing method that does not require prior knowledge about the distribution of relative positions. We demonstrate that the proposed method mitigates the models' reliance on relative positions using the biased and full SQuAD dataset. We hope that this study can help enhance the generalization ability of QA models in real-world applications.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
Development of Fast and Precise Scan Mirror Mechanism for an Airborne Solar Telescope
Authors:
Takayoshi Oba,
Toshifumi Shimizu,
Yukio Katsukawa,
Masahito Kubo,
Yusuke Kawabata,
Hirohisa Hara,
Fumihiro Uraguchi,
Toshihiro Tsuzuki,
Tomonori Tamura,
Kazuya Shinoda,
Kazuhide Kodeki,
Kazuhiko Fukushima,
José Miguel Morales Fernández,
Antonio Sánchez Gómez,
María Balaguer Jimenéz,
David Hernández Expósito,
Achim Gandorfer
Abstract:
We developed a scan mirror mechanism (SMM) that enable a slit-based spectrometer or spectropolarimeter to precisely and quickly map an astronomical object. The SMM, designed to be installed in the optical path preceding the entrance slit, tilts a folding mirror and then moves the reflected image laterally on the slit plane, thereby feeding a different one-dimensional image to be dispersed by the s…
▽ More
We developed a scan mirror mechanism (SMM) that enable a slit-based spectrometer or spectropolarimeter to precisely and quickly map an astronomical object. The SMM, designed to be installed in the optical path preceding the entrance slit, tilts a folding mirror and then moves the reflected image laterally on the slit plane, thereby feeding a different one-dimensional image to be dispersed by the spectroscopic equipment. In general, the SMM is required to scan quickly and broadly while precisely placing the slit position across the field-of-view (FOV). These performances are highly in demand for near-future observations, such as studies on the magnetohydrodynamics of the photosphere and the chromosphere. Our SMM implements a closed-loop control system by installing electromagnetic actuators and gap-based capacitance sensors. Our optical test measurements confirmed that the SMM fulfils the following performance criteria: i) supreme scan-step uniformity (linearity of 0.08%) across the wide scan range (${\pm}$1005 arcsec), ii) high stability (3$σ$ = 0.1 arcsec), where the angles are expressed in mechanical angle, and iii) fast step** speed (26 ms). The excellent capability of the SMM will be demonstrated soon in actual use by installing the mechanism for a near-infrared spectropolarimeter onboard the balloon-borne solar observatory for the third launch, Sunrise III.
△ Less
Submitted 27 July, 2022;
originally announced July 2022.
-
Implicit Neural Representations for Variable Length Human Motion Generation
Authors:
Pablo Cervantes,
Yusuke Sekikawa,
Ikuro Sato,
Koichi Shinoda
Abstract:
We propose an action-conditional human motion generation method using variational implicit neural representations (INR). The variational formalism enables action-conditional distributions of INRs, from which one can easily sample representations to generate novel human motion sequences. Our method offers variable-length sequence generation by construction because a part of INR is optimized for a w…
▽ More
We propose an action-conditional human motion generation method using variational implicit neural representations (INR). The variational formalism enables action-conditional distributions of INRs, from which one can easily sample representations to generate novel human motion sequences. Our method offers variable-length sequence generation by construction because a part of INR is optimized for a whole sequence of arbitrary length with temporal embeddings. In contrast, previous works reported difficulties with modeling variable-length sequences. We confirm that our method with a Transformer decoder outperforms all relevant methods on HumanAct12, NTU-RGBD, and UESTC datasets in terms of realism and diversity of generated motions. Surprisingly, even our method with an MLP decoder consistently outperforms the state-of-the-art Transformer-based auto-encoder. In particular, we show that variable-length motions generated by our method are better than fixed-length motions generated by the state-of-the-art method in terms of realism and diversity. Code at https://github.com/PACerv/ImplicitMotion.
△ Less
Submitted 15 July, 2022; v1 submitted 25 March, 2022;
originally announced March 2022.
-
Multimodal Emotion Recognition with High-level Speech and Text Features
Authors:
Mariana Rodrigues Makiuchi,
Kuniaki Uto,
Koichi Shinoda
Abstract:
Automatic emotion recognition is one of the central concerns of the Human-Computer Interaction field as it can bridge the gap between humans and machines. Current works train deep learning models on low-level data representations to solve the emotion recognition task. Since emotion datasets often have a limited amount of data, these approaches may suffer from overfitting, and they may learn based…
▽ More
Automatic emotion recognition is one of the central concerns of the Human-Computer Interaction field as it can bridge the gap between humans and machines. Current works train deep learning models on low-level data representations to solve the emotion recognition task. Since emotion datasets often have a limited amount of data, these approaches may suffer from overfitting, and they may learn based on superficial cues. To address these issues, we propose a novel cross-representation speech model, inspired by disentanglement representation learning, to perform emotion recognition on wav2vec 2.0 speech features. We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models. We further combine the speech-based and text-based results with a score fusion approach. Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem, and it surpasses current works on speech-only, text-only, and multimodal emotion recognition.
△ Less
Submitted 29 September, 2021;
originally announced November 2021.
-
Improving the Robustness to Variations of Objects and Instructions with a Neuro-Symbolic Approach for Interactive Instruction Following
Authors:
Kazutoshi Shinoda,
Yuki Takezawa,
Masahiro Suzuki,
Yusuke Iwasawa,
Yutaka Matsuo
Abstract:
An interactive instruction following task has been proposed as a benchmark for learning to map natural language instructions and first-person vision into sequences of actions to interact with objects in 3D environments. We found that an existing end-to-end neural model for this task tends to fail to interact with objects of unseen attributes and follow various instructions. We assume that this pro…
▽ More
An interactive instruction following task has been proposed as a benchmark for learning to map natural language instructions and first-person vision into sequences of actions to interact with objects in 3D environments. We found that an existing end-to-end neural model for this task tends to fail to interact with objects of unseen attributes and follow various instructions. We assume that this problem is caused by the high sensitivity of neural feature extraction to small changes in vision and language inputs. To mitigate this problem, we propose a neuro-symbolic approach that utilizes high-level symbolic features, which are robust to small changes in raw inputs, as intermediate representations. We verify the effectiveness of our model with the subtask evaluation on the ALFRED benchmark. Our experiments show that our approach significantly outperforms the end-to-end neural model by 9, 46, and 74 points in the success rate on the ToggleObject, PickupObject, and SliceObject subtasks in unseen environments respectively.
△ Less
Submitted 15 November, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Can Question Generation Debias Question Answering Models? A Case Study on Question-Context Lexical Overlap
Authors:
Kazutoshi Shinoda,
Saku Sugawara,
Akiko Aizawa
Abstract:
Question answering (QA) models for reading comprehension have been demonstrated to exploit unintended dataset biases such as question-context lexical overlap. This hinders QA models from generalizing to under-represented samples such as questions with low lexical overlap. Question generation (QG), a method for augmenting QA datasets, can be a solution for such performance degradation if QG can pro…
▽ More
Question answering (QA) models for reading comprehension have been demonstrated to exploit unintended dataset biases such as question-context lexical overlap. This hinders QA models from generalizing to under-represented samples such as questions with low lexical overlap. Question generation (QG), a method for augmenting QA datasets, can be a solution for such performance degradation if QG can properly debias QA datasets. However, we discover that recent neural QG models are biased towards generating questions with high lexical overlap, which can amplify the dataset bias. Moreover, our analysis reveals that data augmentation with these QG models frequently impairs the performance on questions with low lexical overlap, while improving that on questions with high lexical overlap. To address this problem, we use a synonym replacement-based approach to augment questions with low lexical overlap. We demonstrate that the proposed data augmentation approach is simple yet effective to mitigate the degradation problem with only 70k synthetic examples. Our data is publicly available at https://github.com/KazutoshiShinoda/Synonym-Replacement.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
Estimation of Local Average Treatment Effect by Data Combination
Authors:
Kazuhiko Shinoda,
Takahiro Hoshino
Abstract:
It is important to estimate the local average treatment effect (LATE) when compliance with a treatment assignment is incomplete. The previously proposed methods for LATE estimation required all relevant variables to be jointly observed in a single dataset; however, it is sometimes difficult or even impossible to collect such data in many real-world problems for technical or privacy reasons. We con…
▽ More
It is important to estimate the local average treatment effect (LATE) when compliance with a treatment assignment is incomplete. The previously proposed methods for LATE estimation required all relevant variables to be jointly observed in a single dataset; however, it is sometimes difficult or even impossible to collect such data in many real-world problems for technical or privacy reasons. We consider a novel problem setting in which LATE, as a function of covariates, is nonparametrically identified from the combination of separately observed datasets. For estimation, we show that the direct least squares method, which was originally developed for estimating the average treatment effect under complete compliance, is applicable to our setting. However, model selection and hyperparameter tuning for the direct least squares estimator can be unstable in practice since it is defined as a solution to the minimax problem. We then propose a weighted least squares estimator that enables simpler model selection by avoiding the minimax objective formulation. Unlike the inverse probability weighted (IPW) estimator, the proposed estimator directly uses the pre-estimated weight without inversion, avoiding the problems caused by the IPW methods. We demonstrate the effectiveness of our method through experiments using synthetic and real-world datasets.
△ Less
Submitted 21 March, 2022; v1 submitted 10 September, 2021;
originally announced September 2021.
-
MSR-DARTS: Minimum Stable Rank of Differentiable Architecture Search
Authors:
Kengo Machida,
Kuniaki Uto,
Koichi Shinoda,
Taiji Suzuki
Abstract:
In neural architecture search (NAS), differentiable architecture search (DARTS) has recently attracted much attention due to its high efficiency. It defines an over-parameterized network with mixed edges, each of which represents all operator candidates, and jointly optimizes the weights of the network and its architecture in an alternating manner. However, this method finds a model with the weigh…
▽ More
In neural architecture search (NAS), differentiable architecture search (DARTS) has recently attracted much attention due to its high efficiency. It defines an over-parameterized network with mixed edges, each of which represents all operator candidates, and jointly optimizes the weights of the network and its architecture in an alternating manner. However, this method finds a model with the weights converging faster than the others, and such a model with fastest convergence often leads to overfitting. Accordingly, the resulting model cannot always be well-generalized. To overcome this problem, we propose a method called minimum stable rank DARTS (MSR-DARTS), for finding a model with the best generalization error by replacing architecture optimization with the selection process using the minimum stable rank criterion. Specifically, a convolution operator is represented by a matrix, and MSR-DARTS selects the one with the smallest stable rank. We evaluated MSR-DARTS on CIFAR-10 and ImageNet datasets. It achieves an error rate of 2.54% with 4.0M parameters within 0.3 GPU-days on CIFAR-10, and a top-1 error rate of 23.9% on ImageNet. The official code is available at https://github.com/mtaecchhi/msrdarts.git.
△ Less
Submitted 15 March, 2021; v1 submitted 19 September, 2020;
originally announced September 2020.
-
Speech Paralinguistic Approach for Detecting Dementia Using Gated Convolutional Neural Network
Authors:
Mariana Rodrigues Makiuchi,
Tifani Warnita,
Nakamasa Inoue,
Koichi Shinoda,
Michitaka Yoshimura,
Momoko Kitazawa,
Kei Funaki,
Yoko Eguchi,
Taishiro Kishimoto
Abstract:
We propose a non-invasive and cost-effective method to automatically detect dementia by utilizing solely speech audio data. We extract paralinguistic features for a short speech segment and use Gated Convolutional Neural Networks (GCNN) to classify it into dementia or healthy. We evaluate our method on the Pitt Corpus and on our own dataset, the PROMPT Database. Our method yields the accuracy of 7…
▽ More
We propose a non-invasive and cost-effective method to automatically detect dementia by utilizing solely speech audio data. We extract paralinguistic features for a short speech segment and use Gated Convolutional Neural Networks (GCNN) to classify it into dementia or healthy. We evaluate our method on the Pitt Corpus and on our own dataset, the PROMPT Database. Our method yields the accuracy of 73.1% on the Pitt Corpus using an average of 114 seconds of speech data. In the PROMPT Database, our method yields the accuracy of 74.7% using 4 seconds of speech data and it improves to 80.8% when we use all the patient's speech data. Furthermore, we evaluate our method on a three-class classification problem in which we included the Mild Cognitive Impairment (MCI) class and achieved the accuracy of 60.6% with 40 seconds of speech data.
△ Less
Submitted 6 October, 2020; v1 submitted 16 April, 2020;
originally announced April 2020.
-
Improving the Robustness of QA Models to Challenge Sets with Variational Question-Answer Pair Generation
Authors:
Kazutoshi Shinoda,
Saku Sugawara,
Akiko Aizawa
Abstract:
Question answering (QA) models for reading comprehension have achieved human-level accuracy on in-distribution test sets. However, they have been demonstrated to lack robustness to challenge sets, whose distribution is different from that of training sets. Existing data augmentation methods mitigate this problem by simply augmenting training sets with synthetic examples sampled from the same distr…
▽ More
Question answering (QA) models for reading comprehension have achieved human-level accuracy on in-distribution test sets. However, they have been demonstrated to lack robustness to challenge sets, whose distribution is different from that of training sets. Existing data augmentation methods mitigate this problem by simply augmenting training sets with synthetic examples sampled from the same distribution as the challenge sets. However, these methods assume that the distribution of a challenge set is known a priori, making them less applicable to unseen challenge sets. In this study, we focus on question-answer pair generation (QAG) to mitigate this problem. While most existing QAG methods aim to improve the quality of synthetic examples, we conjecture that diversity-promoting QAG can mitigate the sparsity of training sets and lead to better robustness. We present a variational QAG model that generates multiple diverse QA pairs from a paragraph. Our experiments show that our method can improve the accuracy of 12 challenge sets, as well as the in-distribution accuracy. Our code and data are available at https://github.com/KazutoshiShinoda/VQAG.
△ Less
Submitted 3 June, 2021; v1 submitted 7 April, 2020;
originally announced April 2020.
-
Binary Classification from Positive Data with Skewed Confidence
Authors:
Kazuhiko Shinoda,
Hirotaka Kaji,
Masashi Sugiyama
Abstract:
Positive-confidence (Pconf) classification [Ishida et al., 2018] is a promising weakly-supervised learning method which trains a binary classifier only from positive data equipped with confidence. However, in practice, the confidence may be skewed by bias arising in an annotation process. The Pconf classifier cannot be properly learned with skewed confidence, and consequently, the classification p…
▽ More
Positive-confidence (Pconf) classification [Ishida et al., 2018] is a promising weakly-supervised learning method which trains a binary classifier only from positive data equipped with confidence. However, in practice, the confidence may be skewed by bias arising in an annotation process. The Pconf classifier cannot be properly learned with skewed confidence, and consequently, the classification performance might be deteriorated. In this paper, we introduce the parameterized model of the skewed confidence, and propose the method for selecting the hyperparameter which cancels out the negative impact of skewed confidence under the assumption that we have the misclassification rate of positive samples as a prior knowledge. We demonstrate the effectiveness of the proposed method through a synthetic experiment with simple linear models and benchmark problems with neural network models. We also apply our method to drivers' drowsiness prediction to show that it works well with a real-world problem where confidence is obtained based on manual annotation.
△ Less
Submitted 28 January, 2020;
originally announced January 2020.
-
I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences
Authors:
Kong Aik Lee,
Ville Hautamaki,
Tomi Kinnunen,
Hitoshi Yamamoto,
Koji Okabe,
Ville Vestman,
**g Huang,
Guohong Ding,
Hanwu Sun,
Anthony Larcher,
Rohan Kumar Das,
Haizhou Li,
Mickael Rouvier,
Pierre-Michel Bousquet,
Wei Rao,
Qing Wang,
Chunlei Zhang,
Fahimeh Bahmaninezhad,
Hector Delgado,
Jose Patino,
Qiongqiong Wang,
Ling Guo,
Takafumi Koshinaka,
Jiacen Zhang,
Koichi Shinoda
, et al. (21 additional authors not shown)
Abstract:
The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the res…
▽ More
The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve sub-systems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others, a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation.
△ Less
Submitted 15 April, 2019;
originally announced April 2019.
-
Multi-style Generative Reading Comprehension
Authors:
Kyosuke Nishida,
Itsumi Saito,
Kosuke Nishida,
Kazutoshi Shinoda,
Atsushi Otsuka,
Hisako Asano,
Junji Tomita
Abstract:
This study tackles generative reading comprehension (RC), which consists of answering questions based on textual evidence and natural language generation (NLG). We propose a multi-style abstractive summarization model for question answering, called Masque. The proposed model has two key characteristics. First, unlike most studies on RC that have focused on extracting an answer span from the provid…
▽ More
This study tackles generative reading comprehension (RC), which consists of answering questions based on textual evidence and natural language generation (NLG). We propose a multi-style abstractive summarization model for question answering, called Masque. The proposed model has two key characteristics. First, unlike most studies on RC that have focused on extracting an answer span from the provided passages, our model instead focuses on generating a summary from the question and multiple passages. This serves to cover various answer styles required for real-world applications. Second, whereas previous studies built a specific model for each answer style because of the difficulty of acquiring one general model, our approach learns multi-style answers within a model to improve the NLG capability for all styles involved. This also enables our model to give an answer in the target style. Experiments show that our model achieves state-of-the-art performance on the Q&A task and the Q&A + NLG task of MS MARCO 2.1 and the summary task of NarrativeQA. We observe that the transfer of the style-independent NLG capability to the target style is the key to its success.
△ Less
Submitted 27 May, 2019; v1 submitted 8 January, 2019;
originally announced January 2019.
-
Sequence-Level Knowledge Distillation for Model Compression of Attention-based Sequence-to-Sequence Speech Recognition
Authors:
Raden Mu'az Mun'im,
Nakamasa Inoue,
Koichi Shinoda
Abstract:
We investigate the feasibility of sequence-level knowledge distillation of Sequence-to-Sequence (Seq2Seq) models for Large Vocabulary Continuous Speech Recognition (LVSCR). We first use a pre-trained larger teacher model to generate multiple hypotheses per utterance with beam search. With the same input, we then train the student model using these hypotheses generated from the teacher as pseudo la…
▽ More
We investigate the feasibility of sequence-level knowledge distillation of Sequence-to-Sequence (Seq2Seq) models for Large Vocabulary Continuous Speech Recognition (LVSCR). We first use a pre-trained larger teacher model to generate multiple hypotheses per utterance with beam search. With the same input, we then train the student model using these hypotheses generated from the teacher as pseudo labels in place of the original ground truth labels. We evaluate our proposed method using Wall Street Journal (WSJ) corpus. It achieved up to $ 9.8 \times$ parameter reduction with accuracy loss of up to 7.0\% word-error rate (WER) increase
△ Less
Submitted 11 November, 2018;
originally announced November 2018.
-
Deep Learning Based Multi-modal Addressee Recognition in Visual Scenes with Utterances
Authors:
Thao Minh Le,
Nobuyuki Shimizu,
Takashi Miyazaki,
Koichi Shinoda
Abstract:
With the widespread use of intelligent systems, such as smart speakers, addressee recognition has become a concern in human-computer interaction, as more and more people expect such systems to understand complicated social scenes, including those outdoors, in cafeterias, and hospitals. Because previous studies typically focused only on pre-specified tasks with limited conversational situations suc…
▽ More
With the widespread use of intelligent systems, such as smart speakers, addressee recognition has become a concern in human-computer interaction, as more and more people expect such systems to understand complicated social scenes, including those outdoors, in cafeterias, and hospitals. Because previous studies typically focused only on pre-specified tasks with limited conversational situations such as controlling smart homes, we created a mock dataset called Addressee Recognition in Visual Scenes with Utterances (ARVSU) that contains a vast body of image variations in visual scenes with an annotated utterance and a corresponding addressee for each scenario. We also propose a multi-modal deep-learning-based model that takes different human cues, specifically eye gazes and transcripts of an utterance corpus, into account to predict the conversational addressee from a specific speaker's view in various real-life conversational scenarios. To the best of our knowledge, we are the first to introduce an end-to-end deep learning model that combines vision and transcripts of utterance for addressee recognition. As a result, our study suggests that future addressee recognition can reach the ability to understand human intention in many social situations previously unexplored, and our modality dataset is a first step in promoting research in this field.
△ Less
Submitted 12 September, 2018;
originally announced September 2018.
-
Snapshot multispectral imaging using a filter array
Authors:
Kazuma Shinoda
Abstract:
A multispectral filter array (MSFA) is one solution for capturing a multispectral image (MSI) in a single shot at low cost. We introduce our optimization method of the spectral sensitivity of the MSFAs and demosaicking, and show a new prototype filter array for snapshot imaging based on a photonic crystal.
A multispectral filter array (MSFA) is one solution for capturing a multispectral image (MSI) in a single shot at low cost. We introduce our optimization method of the spectral sensitivity of the MSFAs and demosaicking, and show a new prototype filter array for snapshot imaging based on a photonic crystal.
△ Less
Submitted 28 August, 2018;
originally announced August 2018.
-
Deep demosaicking for multispectral filter arrays
Authors:
Kazuma Shinoda,
Shoichiro Yoshiba,
Madoka Hasegawa
Abstract:
We propose a novel demosaicking method for multispectral filter arrays based on a deep convolutional neural network. The proposed method first interpolates mosaicked multispectral images utilizing a bilinear approach, then applies a residual network to initial demosaicked images. The residual network consists of various three-dimensional convolutional layers and a rectified linear unit for describ…
▽ More
We propose a novel demosaicking method for multispectral filter arrays based on a deep convolutional neural network. The proposed method first interpolates mosaicked multispectral images utilizing a bilinear approach, then applies a residual network to initial demosaicked images. The residual network consists of various three-dimensional convolutional layers and a rectified linear unit for describing the features of a multispectral data cube. Experimental results reveal that the proposed method outperforms conventional demosaicking methods.
△ Less
Submitted 21 October, 2018; v1 submitted 24 August, 2018;
originally announced August 2018.
-
Few-Shot Adaptation for Multimedia Semantic Indexing
Authors:
Nakamasa Inoue,
Koichi Shinoda
Abstract:
We propose a few-shot adaptation framework, which bridges zero-shot learning and supervised many-shot learning, for semantic indexing of image and video data. Few-shot adaptation provides robust parameter estimation with few training examples, by optimizing the parameters of zero-shot learning and supervised many-shot learning simultaneously. In this method, first we build a zero-shot detector, an…
▽ More
We propose a few-shot adaptation framework, which bridges zero-shot learning and supervised many-shot learning, for semantic indexing of image and video data. Few-shot adaptation provides robust parameter estimation with few training examples, by optimizing the parameters of zero-shot learning and supervised many-shot learning simultaneously. In this method, first we build a zero-shot detector, and then update it by using the few examples. Our experiments show the effectiveness of the proposed framework on three datasets: TRECVID Semantic Indexing 2010, 2014, and ImageNET. On the ImageNET dataset, we show that our method outperforms recent few-shot learning methods. On the TRECVID 2014 dataset, we achieve 15.19% and 35.98% in Mean Average Precision under the zero-shot condition and the supervised condition, respectively. To the best of our knowledge, these are the best results on this dataset.
△ Less
Submitted 18 July, 2018;
originally announced July 2018.
-
Optimal Spectral Sensitivity of Multispectral Filter Array for Pathological Images
Authors:
Kazuma Shinoda,
Maru Kawase,
Madoka Hasegawa,
Masahiro Ishikawa,
Hideki Komagata,
Naoki Kobayashi
Abstract:
A capturing system with multispectral filter array (MSFA) technology has been researched to shorten the capturing time and reduce the cost. In this system, the mosaicked image captured by the MSFA is demosaicked to reconstruct multispectral images (MSIs). We focus on the spectral sensitivity design of a MSFA in this paper and propose a pathology-specific MSFA. The proposed method optimizes the MSF…
▽ More
A capturing system with multispectral filter array (MSFA) technology has been researched to shorten the capturing time and reduce the cost. In this system, the mosaicked image captured by the MSFA is demosaicked to reconstruct multispectral images (MSIs). We focus on the spectral sensitivity design of a MSFA in this paper and propose a pathology-specific MSFA. The proposed method optimizes the MSFA by minimizing the reconstruction error between training data of a pathological tissue and a demosaicked MSI under a cost function. Firstly, the spectral sensitivities of the filter array are set randomly, and the mosaicked image is obtained from the training data and the filter array. Then, a reconstructed image is obtained by Wiener estimation. The spectral sensitivities of the filter array are optimized iteratively by an interior-point approach to minimize the reconstruction error. We show the effectiveness of the proposed MSFA by comparing the recovered spectrum and RGB image with a conventional method.
△ Less
Submitted 3 July, 2018;
originally announced July 2018.
-
Joint optimization of multispectral filter arrays and demosaicking for pathological images
Authors:
Kazuma Shinoda,
Maru Kawase,
Madoka Hasegawa,
Masahiro Ishikawa,
Hideki Komagata,
Naoki Kobayashi
Abstract:
A capturing system with multispectral filter array (MSFA) technology is proposed for shortening the capture time and reducing costs. Therein, a mosaicked image captured using an MSFA is demosaicked to reconstruct multispectral images (MSIs). Joint optimization of the spectral sensitivity of the MSFAs and demosaicking is considered, and pathology-specific multispectral imaging is proposed. This opt…
▽ More
A capturing system with multispectral filter array (MSFA) technology is proposed for shortening the capture time and reducing costs. Therein, a mosaicked image captured using an MSFA is demosaicked to reconstruct multispectral images (MSIs). Joint optimization of the spectral sensitivity of the MSFAs and demosaicking is considered, and pathology-specific multispectral imaging is proposed. This optimizes the MSFA and the demosaicking matrix by minimizing the reconstruction error between the training data of a hematoxylin and eosin-stained pathological tissue and a demosaicked MSI using a cost function. Initially, the spectral sensitivity of the filter array is set randomly and the mosaicked image is obtained from the training data. Subsequently, a reconstructed image is obtained using Wiener estimation. To minimize the reconstruction error, the spectral sensitivity of the filter array and the Wiener estimation matrix are optimized iteratively through an interior-point approach. The effectiveness of the proposed MSFA and demosaicking is demonstrated by comparing the recovered spectrum and RGB image with those obtained using a conventional method.
△ Less
Submitted 3 July, 2018;
originally announced July 2018.
-
A Fine-to-Coarse Convolutional Neural Network for 3D Human Action Recognition
Authors:
Thao Minh Le,
Nakamasa Inoue,
Koichi Shinoda
Abstract:
This paper presents a new framework for human action recognition from a 3D skeleton sequence. Previous studies do not fully utilize the temporal relationships between video segments in a human action. Some studies successfully used very deep Convolutional Neural Network (CNN) models but often suffer from the data insufficiency problem. In this study, we first segment a skeleton sequence into disti…
▽ More
This paper presents a new framework for human action recognition from a 3D skeleton sequence. Previous studies do not fully utilize the temporal relationships between video segments in a human action. Some studies successfully used very deep Convolutional Neural Network (CNN) models but often suffer from the data insufficiency problem. In this study, we first segment a skeleton sequence into distinct temporal segments in order to exploit the correlations between them. The temporal and spatial features of a skeleton sequence are then extracted simultaneously by utilizing a fine-to-coarse (F2C) CNN architecture optimized for human skeleton sequences. We evaluate our proposed method on NTU RGB+D and SBU Kinect Interaction dataset. It achieves 79.6% and 84.6% of accuracies on NTU RGB+D with cross-object and cross-view protocol, respectively, which are almost identical with the state-of-the-art performance. In addition, our method significantly improves the accuracy of the actions in two-person interactions.
△ Less
Submitted 18 August, 2018; v1 submitted 29 May, 2018;
originally announced May 2018.
-
I-vector Transformation Using Conditional Generative Adversarial Networks for Short Utterance Speaker Verification
Authors:
Jiacen Zhang,
Nakamasa Inoue,
Koichi Shinoda
Abstract:
I-vector based text-independent speaker verification (SV) systems often have poor performance with short utterances, as the biased phonetic distribution in a short utterance makes the extracted i-vector unreliable. This paper proposes an i-vector compensation method using a generative adversarial network (GAN), where its generator network is trained to generate a compensated i-vector from a short-…
▽ More
I-vector based text-independent speaker verification (SV) systems often have poor performance with short utterances, as the biased phonetic distribution in a short utterance makes the extracted i-vector unreliable. This paper proposes an i-vector compensation method using a generative adversarial network (GAN), where its generator network is trained to generate a compensated i-vector from a short-utterance i-vector and its discriminator network is trained to determine whether an i-vector is generated by the generator or the one extracted from a long utterance. Additionally, we assign two other learning tasks to the GAN to stabilize its training and to make the generated ivector more speaker-specific. Speaker verification experiments on the NIST SRE 2008 "10sec-10sec" condition show that our method reduced the equal error rate by 11.3% from the conventional i-vector and PLDA system.
△ Less
Submitted 1 April, 2018;
originally announced April 2018.
-
Detecting Alzheimer's Disease Using Gated Convolutional Neural Network from Audio Data
Authors:
Tifani Warnita,
Nakamasa Inoue,
Koichi Shinoda
Abstract:
We propose an automatic detection method of Alzheimer's diseases using a gated convolutional neural network (GCNN) from speech data. This GCNN can be trained with a relatively small amount of data and can capture the temporal information in audio paralinguistic features. Since it does not utilize any linguistic features, it can be easily applied to any languages. We evaluated our method using Pitt…
▽ More
We propose an automatic detection method of Alzheimer's diseases using a gated convolutional neural network (GCNN) from speech data. This GCNN can be trained with a relatively small amount of data and can capture the temporal information in audio paralinguistic features. Since it does not utilize any linguistic features, it can be easily applied to any languages. We evaluated our method using Pitt Corpus. The proposed method achieved the accuracy of 73.6%, which is better than the conventional sequential minimal optimization (SMO) by 7.6 points.
△ Less
Submitted 30 March, 2018;
originally announced March 2018.
-
Attentive Statistics Pooling for Deep Speaker Embedding
Authors:
Koji Okabe,
Takafumi Koshinaka,
Koichi Shinoda
Abstract:
This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also…
▽ More
This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations. In this way, it can capture long-term variations in speaker characteristics more effectively. An evaluation on the NIST SRE 2012 and the VoxCeleb data sets shows that it reduces equal error rates (EERs) from the conventional method by 7.5% and 8.1%, respectively.
△ Less
Submitted 24 February, 2019; v1 submitted 29 March, 2018;
originally announced March 2018.
-
Mosaicked multispectral image compression based on inter- and intra-band correlation
Authors:
Kazuma Shinoda,
Madoka Hasegawa,
Masahiro Yamaguchi,
Antonio Ortega
Abstract:
Multispectral imaging has been utilized in many fields, but the cost of capturing and storing image data is still high. Single-sensor cameras with multispectral filter arrays can reduce the cost of capturing images at the expense of slightly lower image quality. When multispectral filter arrays are used, conventional multispectral image compression methods can be applied after interpolation, but t…
▽ More
Multispectral imaging has been utilized in many fields, but the cost of capturing and storing image data is still high. Single-sensor cameras with multispectral filter arrays can reduce the cost of capturing images at the expense of slightly lower image quality. When multispectral filter arrays are used, conventional multispectral image compression methods can be applied after interpolation, but the compressed image data after interpolation has some redundancy because the interpolated data are computed from the captured raw data. In this paper, we propose an efficient image compression method for single-sensor multispectral cameras. The proposed method encodes the captured multispectral data before interpolation. We also propose a new spectral transform method for the compression of mosaicked multispectral images. This transform is designed by considering the filter arrangement and the spectral sensitivities of a multispectral filter array. The experimental results show that the proposed method achieves a higher peak signal-to-noise ratio at higher bit rates than a conventional compression method that encodes a multispectral image after interpolation, e.g., 3-dB gain over conventional compression when coding at rates of over 0.1 bit/pixel/bands.
△ Less
Submitted 10 January, 2018;
originally announced January 2018.
-
Chaotic Griffiths Phase with Anomalous Lyapunov Spectra in Coupled Map Networks
Authors:
Kenji Shinoda,
Kunihiko Kaneko
Abstract:
Dynamics of coupled chaotic oscillators on a network are studied using coupled maps. Within a broad range of parameter values representing the coupling strength or the degree of elements, the system repeats formation and split of coherent clusters. The distribution of the cluster size follows a power law with the exponent $α$, which changes with the parameter values. The number of positive Lyapuno…
▽ More
Dynamics of coupled chaotic oscillators on a network are studied using coupled maps. Within a broad range of parameter values representing the coupling strength or the degree of elements, the system repeats formation and split of coherent clusters. The distribution of the cluster size follows a power law with the exponent $α$, which changes with the parameter values. The number of positive Lyapunov exponents and their spectra are scaled anomalously with the power of the system size with the exponent $β$, which also changes with the parameters. The scaling relation $α\sim 2(β+1)$ is uncovered, which seems to be universal independent of parameters and networks.
△ Less
Submitted 14 May, 2016; v1 submitted 7 May, 2016;
originally announced May 2016.
-
Classical and Quantum Cosmology of Multigravity
Authors:
Teruki Hanada,
Koichiro Kobayashi,
Kazuhiko Shinoda,
Kiyoshi Shiraishi
Abstract:
Recently, a multigraviton theory on a simple closed circuit graph corresponding to the discretization of $S^1$ compactification of the Kaluza-Klein (KK) theory has been considered. In the present paper, we extend this theory to that on a general graph and study what modes of particles are included. Furthermore, we generalize it in a possible nonlinear theory based on the vierbein formalism and stu…
▽ More
Recently, a multigraviton theory on a simple closed circuit graph corresponding to the discretization of $S^1$ compactification of the Kaluza-Klein (KK) theory has been considered. In the present paper, we extend this theory to that on a general graph and study what modes of particles are included. Furthermore, we generalize it in a possible nonlinear theory based on the vierbein formalism and study classical and quantum cosmological solutions in the theory. We found that scale factors in a solution for this theory repeat acceleration and deceleration.
△ Less
Submitted 24 August, 2010; v1 submitted 29 April, 2010;
originally announced April 2010.
-
Cosmology of multigravity
Authors:
Teruki Hanada,
Kazuhiko Shinoda,
Kiyoshi Shiraishi
Abstract:
We have constructed a nonlinear multi-graviton theory. An application of this theory to cosmology is discussed. We found that scale factors in a solution for this theory repeat acceleration and deceleration.
We have constructed a nonlinear multi-graviton theory. An application of this theory to cosmology is discussed. We found that scale factors in a solution for this theory repeat acceleration and deceleration.
△ Less
Submitted 5 February, 2009; v1 submitted 31 January, 2009;
originally announced February 2009.
-
Multi-graviton theory in vierbein formalism
Authors:
Teruki Hanada,
Kazuhiko Shinoda,
Kiyoshi Shiraishi
Abstract:
Recently, multi-graviton theory on a simple closed circuit graph corresponding to the $S^1$ compactification of the Kaluza-Klein (KK) theory has been considered. In the present paper, we extend this theory to that on a general graph and study what modes of particles are included. Furthermore, we generalize it in a possible non-linear theory based on the vierbein formalism and study cosmological…
▽ More
Recently, multi-graviton theory on a simple closed circuit graph corresponding to the $S^1$ compactification of the Kaluza-Klein (KK) theory has been considered. In the present paper, we extend this theory to that on a general graph and study what modes of particles are included. Furthermore, we generalize it in a possible non-linear theory based on the vierbein formalism and study cosmological solutions.
△ Less
Submitted 17 January, 2008;
originally announced January 2008.