Search | arXiv e-print repository

Deep convolutional demosaicking network for multispectral polarization filter array

Authors: Tomoharu Ishiuchi, Kazuma Shinoda

Abstract: To address the demosaicking problem in multispectral polarization filter array (MSPFA) imaging, we propose a multispectral polarization demosaicking network (MSPDNet) that improves image reconstruction accuracy. Imaging with a multispectral polarization filter array acquires multispectral polarization information in a snapshot. The full-resolution multispectral polarization image must be reconstru… ▽ More To address the demosaicking problem in multispectral polarization filter array (MSPFA) imaging, we propose a multispectral polarization demosaicking network (MSPDNet) that improves image reconstruction accuracy. Imaging with a multispectral polarization filter array acquires multispectral polarization information in a snapshot. The full-resolution multispectral polarization image must be reconstructed from a mosaic image. In the proposed method, a sparse image in which pixel values of the same channel are extracted from a mosaic image is used as input to MSPDNet. Missing pixels are interpolated by learning spatial and wavelength correlations from the observed pixels in the mosaic image. Moreover, by using 3D convolution, features are extracted at each convolution layer, and by deepening the network, even detailed features of the multispectral polarization image can be learned. Experimental results show that MSPDNet can reconstruct multi-wavelength and multi-polarization angle information with high accuracy in terms of peak signal-to-noise ratio (PSNR) evaluation and visual quality, indicating the effectiveness of the proposed method compared to other methods. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2111.10202 [pdf, other]

Multimodal Emotion Recognition with High-level Speech and Text Features

Authors: Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda

Abstract: Automatic emotion recognition is one of the central concerns of the Human-Computer Interaction field as it can bridge the gap between humans and machines. Current works train deep learning models on low-level data representations to solve the emotion recognition task. Since emotion datasets often have a limited amount of data, these approaches may suffer from overfitting, and they may learn based… ▽ More Automatic emotion recognition is one of the central concerns of the Human-Computer Interaction field as it can bridge the gap between humans and machines. Current works train deep learning models on low-level data representations to solve the emotion recognition task. Since emotion datasets often have a limited amount of data, these approaches may suffer from overfitting, and they may learn based on superficial cues. To address these issues, we propose a novel cross-representation speech model, inspired by disentanglement representation learning, to perform emotion recognition on wav2vec 2.0 speech features. We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models. We further combine the speech-based and text-based results with a score fusion approach. Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem, and it surpasses current works on speech-only, text-only, and multimodal emotion recognition. △ Less

Submitted 29 September, 2021; originally announced November 2021.

Comments: Accepted at ASRU 2021. Code available at https://github.com/mmakiuchi/multimodal_emotion_recognition

arXiv:2004.07992 [pdf, other]

doi 10.1587/transinf.2020EDP7196

Speech Paralinguistic Approach for Detecting Dementia Using Gated Convolutional Neural Network

Authors: Mariana Rodrigues Makiuchi, Tifani Warnita, Nakamasa Inoue, Koichi Shinoda, Michitaka Yoshimura, Momoko Kitazawa, Kei Funaki, Yoko Eguchi, Taishiro Kishimoto

Abstract: We propose a non-invasive and cost-effective method to automatically detect dementia by utilizing solely speech audio data. We extract paralinguistic features for a short speech segment and use Gated Convolutional Neural Networks (GCNN) to classify it into dementia or healthy. We evaluate our method on the Pitt Corpus and on our own dataset, the PROMPT Database. Our method yields the accuracy of 7… ▽ More We propose a non-invasive and cost-effective method to automatically detect dementia by utilizing solely speech audio data. We extract paralinguistic features for a short speech segment and use Gated Convolutional Neural Networks (GCNN) to classify it into dementia or healthy. We evaluate our method on the Pitt Corpus and on our own dataset, the PROMPT Database. Our method yields the accuracy of 73.1% on the Pitt Corpus using an average of 114 seconds of speech data. In the PROMPT Database, our method yields the accuracy of 74.7% using 4 seconds of speech data and it improves to 80.8% when we use all the patient's speech data. Furthermore, we evaluate our method on a three-class classification problem in which we included the Mild Cognitive Impairment (MCI) class and achieved the accuracy of 60.6% with 40 seconds of speech data. △ Less

Submitted 6 October, 2020; v1 submitted 16 April, 2020; originally announced April 2020.

arXiv:1904.07386 [pdf, other]

I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

Authors: Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Hitoshi Yamamoto, Koji Okabe, Ville Vestman, **g Huang, Guohong Ding, Hanwu Sun, Anthony Larcher, Rohan Kumar Das, Haizhou Li, Mickael Rouvier, Pierre-Michel Bousquet, Wei Rao, Qing Wang, Chunlei Zhang, Fahimeh Bahmaninezhad, Hector Delgado, Jose Patino, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Koichi Shinoda , et al. (21 additional authors not shown)

Abstract: The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the res… ▽ More The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve sub-systems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others, a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation. △ Less

Submitted 15 April, 2019; originally announced April 2019.

Comments: 5 pages

arXiv:1808.09106 [pdf]

Snapshot multispectral imaging using a filter array

Authors: Kazuma Shinoda

Abstract: A multispectral filter array (MSFA) is one solution for capturing a multispectral image (MSI) in a single shot at low cost. We introduce our optimization method of the spectral sensitivity of the MSFAs and demosaicking, and show a new prototype filter array for snapshot imaging based on a photonic crystal. A multispectral filter array (MSFA) is one solution for capturing a multispectral image (MSI) in a single shot at low cost. We introduce our optimization method of the spectral sensitivity of the MSFAs and demosaicking, and show a new prototype filter array for snapshot imaging based on a photonic crystal. △ Less

Submitted 28 August, 2018; originally announced August 2018.

Comments: This paper has been submitted to International Workshop on Image Sensors and Imaging Systems (IWISS2018) (Invited talk)

Journal ref: International Workshop on Image Sensors and Imaging Systems (IWISS2018)

arXiv:1808.08021 [pdf, other]

Deep demosaicking for multispectral filter arrays

Authors: Kazuma Shinoda, Shoichiro Yoshiba, Madoka Hasegawa

Abstract: We propose a novel demosaicking method for multispectral filter arrays based on a deep convolutional neural network. The proposed method first interpolates mosaicked multispectral images utilizing a bilinear approach, then applies a residual network to initial demosaicked images. The residual network consists of various three-dimensional convolutional layers and a rectified linear unit for describ… ▽ More We propose a novel demosaicking method for multispectral filter arrays based on a deep convolutional neural network. The proposed method first interpolates mosaicked multispectral images utilizing a bilinear approach, then applies a residual network to initial demosaicked images. The residual network consists of various three-dimensional convolutional layers and a rectified linear unit for describing the features of a multispectral data cube. Experimental results reveal that the proposed method outperforms conventional demosaicking methods. △ Less

Submitted 21 October, 2018; v1 submitted 24 August, 2018; originally announced August 2018.

arXiv:1807.01386 [pdf, other]

Optimal Spectral Sensitivity of Multispectral Filter Array for Pathological Images

Authors: Kazuma Shinoda, Maru Kawase, Madoka Hasegawa, Masahiro Ishikawa, Hideki Komagata, Naoki Kobayashi

Abstract: A capturing system with multispectral filter array (MSFA) technology has been researched to shorten the capturing time and reduce the cost. In this system, the mosaicked image captured by the MSFA is demosaicked to reconstruct multispectral images (MSIs). We focus on the spectral sensitivity design of a MSFA in this paper and propose a pathology-specific MSFA. The proposed method optimizes the MSF… ▽ More A capturing system with multispectral filter array (MSFA) technology has been researched to shorten the capturing time and reduce the cost. In this system, the mosaicked image captured by the MSFA is demosaicked to reconstruct multispectral images (MSIs). We focus on the spectral sensitivity design of a MSFA in this paper and propose a pathology-specific MSFA. The proposed method optimizes the MSFA by minimizing the reconstruction error between training data of a pathological tissue and a demosaicked MSI under a cost function. Firstly, the spectral sensitivities of the filter array are set randomly, and the mosaicked image is obtained from the training data and the filter array. Then, a reconstructed image is obtained by Wiener estimation. The spectral sensitivities of the filter array are optimized iteratively by an interior-point approach to minimize the reconstruction error. We show the effectiveness of the proposed MSFA by comparing the recovered spectrum and RGB image with a conventional method. △ Less

Submitted 3 July, 2018; originally announced July 2018.

Journal ref: Image Electronics and Visual Computing Workshop (IEVC), 1P-10, Mar. 2017

arXiv:1807.01385 [pdf, other]

Joint optimization of multispectral filter arrays and demosaicking for pathological images

Authors: Kazuma Shinoda, Maru Kawase, Madoka Hasegawa, Masahiro Ishikawa, Hideki Komagata, Naoki Kobayashi

Abstract: A capturing system with multispectral filter array (MSFA) technology is proposed for shortening the capture time and reducing costs. Therein, a mosaicked image captured using an MSFA is demosaicked to reconstruct multispectral images (MSIs). Joint optimization of the spectral sensitivity of the MSFAs and demosaicking is considered, and pathology-specific multispectral imaging is proposed. This opt… ▽ More A capturing system with multispectral filter array (MSFA) technology is proposed for shortening the capture time and reducing costs. Therein, a mosaicked image captured using an MSFA is demosaicked to reconstruct multispectral images (MSIs). Joint optimization of the spectral sensitivity of the MSFAs and demosaicking is considered, and pathology-specific multispectral imaging is proposed. This optimizes the MSFA and the demosaicking matrix by minimizing the reconstruction error between the training data of a hematoxylin and eosin-stained pathological tissue and a demosaicked MSI using a cost function. Initially, the spectral sensitivity of the filter array is set randomly and the mosaicked image is obtained from the training data. Subsequently, a reconstructed image is obtained using Wiener estimation. To minimize the reconstruction error, the spectral sensitivity of the filter array and the Wiener estimation matrix are optimized iteratively through an interior-point approach. The effectiveness of the proposed MSFA and demosaicking is demonstrated by comparing the recovered spectrum and RGB image with those obtained using a conventional method. △ Less

Submitted 3 July, 2018; originally announced July 2018.

Journal ref: IIEEJ Transactions on Image Electronics and Visual Computing, Vol. 6, No. 1, pp. 13-21, Jun. 2018

arXiv:1804.00290 [pdf, other]

I-vector Transformation Using Conditional Generative Adversarial Networks for Short Utterance Speaker Verification

Authors: Jiacen Zhang, Nakamasa Inoue, Koichi Shinoda

Abstract: I-vector based text-independent speaker verification (SV) systems often have poor performance with short utterances, as the biased phonetic distribution in a short utterance makes the extracted i-vector unreliable. This paper proposes an i-vector compensation method using a generative adversarial network (GAN), where its generator network is trained to generate a compensated i-vector from a short-… ▽ More I-vector based text-independent speaker verification (SV) systems often have poor performance with short utterances, as the biased phonetic distribution in a short utterance makes the extracted i-vector unreliable. This paper proposes an i-vector compensation method using a generative adversarial network (GAN), where its generator network is trained to generate a compensated i-vector from a short-utterance i-vector and its discriminator network is trained to determine whether an i-vector is generated by the generator or the one extracted from a long utterance. Additionally, we assign two other learning tasks to the GAN to stabilize its training and to make the generated ivector more speaker-specific. Speaker verification experiments on the NIST SRE 2008 "10sec-10sec" condition show that our method reduced the equal error rate by 11.3% from the conventional i-vector and PLDA system. △ Less

Submitted 1 April, 2018; originally announced April 2018.

arXiv:1803.11344 [pdf, other]

Detecting Alzheimer's Disease Using Gated Convolutional Neural Network from Audio Data

Authors: Tifani Warnita, Nakamasa Inoue, Koichi Shinoda

Abstract: We propose an automatic detection method of Alzheimer's diseases using a gated convolutional neural network (GCNN) from speech data. This GCNN can be trained with a relatively small amount of data and can capture the temporal information in audio paralinguistic features. Since it does not utilize any linguistic features, it can be easily applied to any languages. We evaluated our method using Pitt… ▽ More We propose an automatic detection method of Alzheimer's diseases using a gated convolutional neural network (GCNN) from speech data. This GCNN can be trained with a relatively small amount of data and can capture the temporal information in audio paralinguistic features. Since it does not utilize any linguistic features, it can be easily applied to any languages. We evaluated our method using Pitt Corpus. The proposed method achieved the accuracy of 73.6%, which is better than the conventional sequential minimal optimization (SMO) by 7.6 points. △ Less

Submitted 30 March, 2018; originally announced March 2018.

Comments: 5 pages, 3 figures, submitted to INTERSPEECH 2018

arXiv:1803.10963 [pdf, ps, other]

doi 10.21437/Interspeech.2018-993

Attentive Statistics Pooling for Deep Speaker Embedding

Authors: Koji Okabe, Takafumi Koshinaka, Koichi Shinoda

Abstract: This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also… ▽ More This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations. In this way, it can capture long-term variations in speaker characteristics more effectively. An evaluation on the NIST SRE 2012 and the VoxCeleb data sets shows that it reduces equal error rates (EERs) from the conventional method by 7.5% and 8.1%, respectively. △ Less

Submitted 24 February, 2019; v1 submitted 29 March, 2018; originally announced March 2018.

Comments: Proc. Interspeech 2018, pp2252--2256. arXiv admin note: text overlap with arXiv:1809.09311

arXiv:1801.03577 [pdf, ps, other]

Mosaicked multispectral image compression based on inter- and intra-band correlation

Authors: Kazuma Shinoda, Madoka Hasegawa, Masahiro Yamaguchi, Antonio Ortega

Abstract: Multispectral imaging has been utilized in many fields, but the cost of capturing and storing image data is still high. Single-sensor cameras with multispectral filter arrays can reduce the cost of capturing images at the expense of slightly lower image quality. When multispectral filter arrays are used, conventional multispectral image compression methods can be applied after interpolation, but t… ▽ More Multispectral imaging has been utilized in many fields, but the cost of capturing and storing image data is still high. Single-sensor cameras with multispectral filter arrays can reduce the cost of capturing images at the expense of slightly lower image quality. When multispectral filter arrays are used, conventional multispectral image compression methods can be applied after interpolation, but the compressed image data after interpolation has some redundancy because the interpolated data are computed from the captured raw data. In this paper, we propose an efficient image compression method for single-sensor multispectral cameras. The proposed method encodes the captured multispectral data before interpolation. We also propose a new spectral transform method for the compression of mosaicked multispectral images. This transform is designed by considering the filter arrangement and the spectral sensitivities of a multispectral filter array. The experimental results show that the proposed method achieves a higher peak signal-to-noise ratio at higher bit rates than a conventional compression method that encodes a multispectral image after interpolation, e.g., 3-dB gain over conventional compression when coding at rates of over 0.1 bit/pixel/bands. △ Less

Submitted 10 January, 2018; originally announced January 2018.

Showing 1–12 of 12 results for author: Shinoda, K