-
Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM
Authors:
Bagus Tris Atmaja,
Masato Akagi
Abstract:
Automatic speech emotion recognition (SER) by a computer is a critical component for more natural human-machine interaction. As in human-human interaction, the capability to perceive emotion correctly is essential to take further steps in a particular situation. One issue in SER is whether it is necessary to combine acoustic features with other data such as facial expressions, text, and motion cap…
▽ More
Automatic speech emotion recognition (SER) by a computer is a critical component for more natural human-machine interaction. As in human-human interaction, the capability to perceive emotion correctly is essential to take further steps in a particular situation. One issue in SER is whether it is necessary to combine acoustic features with other data such as facial expressions, text, and motion capture. This research proposes to combine acoustic and text information by applying a late-fusion approach consisting of two steps. First, acoustic and text features are trained separately in deep learning systems. Second, the prediction results from the deep learning systems are fed into a support vector machine (SVM) to predict the final regression score. Furthermore, the task in this research is dimensional emotion modeling because it can enable a deeper analysis of affective states. Experimental results show that this two-stage, late-fusion approach, obtains higher performance than that of any one-stage processing, with a linear correlation from one-stage to two-stage processing. This late-fusion approach improves previous early fusion results measured in concordance correlation coefficients score.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
Speak Like a Professional: Increasing Speech Intelligibility by Mimicking Professional Announcer Voice with Voice Conversion
Authors:
Tuan Vu Ho,
Maori Kobayashi,
Masato Akagi
Abstract:
In most of practical scenarios, the announcement system must deliver speech messages in a noisy environment, in which the background noise cannot be cancelled out. The local noise reduces speech intelligibility and increases listening effort of the listener, hence hamper the effectiveness of announcement system. There has been reported that voices of professional announcers are clearer and more co…
▽ More
In most of practical scenarios, the announcement system must deliver speech messages in a noisy environment, in which the background noise cannot be cancelled out. The local noise reduces speech intelligibility and increases listening effort of the listener, hence hamper the effectiveness of announcement system. There has been reported that voices of professional announcers are clearer and more comprehensive than that of non-expert speakers in noisy environment. This finding suggests that the speech intelligibility might be related to the speaking style of professional announcer, which can be adapted using voice conversion method. Motivated by this idea, this paper proposes a speech intelligibility enhancement in noisy environment by applying voice conversion method on non-professional voice. We discovered that the professional announcers and non-professional speakers are clusterized into different clusters on the speaker embedding plane. This implies that the speech intelligibility can be controlled as an independent feature of speaker individuality. To examine the advantage of converted voice in noisy environment, we experimented using test words masked in pink noise at different SNR levels. The results of objective and subjective evaluations confirm that the speech intelligibility of converted voice is higher than that of original voice in low SNR conditions.
△ Less
Submitted 26 June, 2022;
originally announced June 2022.
-
Deep Multilayer Perceptrons for Dimensional Speech Emotion Recognition
Authors:
Bagus Tris Atmaja,
Masato Akagi
Abstract:
Modern deep learning architectures are ordinarily performed on high-performance computing facilities due to the large size of the input features and complexity of its model. This paper proposes traditional multilayer perceptrons (MLP) with deep layers and small input size to tackle that computation requirement limitation. The result shows that our proposed deep MLP outperformed modern deep learnin…
▽ More
Modern deep learning architectures are ordinarily performed on high-performance computing facilities due to the large size of the input features and complexity of its model. This paper proposes traditional multilayer perceptrons (MLP) with deep layers and small input size to tackle that computation requirement limitation. The result shows that our proposed deep MLP outperformed modern deep learning architectures, i.e., LSTM and CNN, on the same number of layers and value of parameters. The deep MLP exhibited the highest performance on both speaker-dependent and speaker-independent scenarios on IEMOCAP and MSP-IMPROV corpus.
△ Less
Submitted 5 April, 2020;
originally announced April 2020.
-
On The Differences Between Song and Speech Emotion Recognition: Effect of Feature Sets, Feature Types, and Classifiers
Authors:
Bagus Tris Atmaja,
Masato Akagi
Abstract:
In this paper, we evaluate the different features sets, feature types, and classifiers on both song and speech emotion recognition. Three feature sets: GeMAPS, pyAudioAnalysis, and LibROSA; two feature types: low-level descriptors and high-level statistical functions; and four classifiers: multilayer perceptron, LSTM, GRU, and convolution neural networks are examined on both song and speech data w…
▽ More
In this paper, we evaluate the different features sets, feature types, and classifiers on both song and speech emotion recognition. Three feature sets: GeMAPS, pyAudioAnalysis, and LibROSA; two feature types: low-level descriptors and high-level statistical functions; and four classifiers: multilayer perceptron, LSTM, GRU, and convolution neural networks are examined on both song and speech data with the same parameter values. The results show no remarkable difference between song and speech data using the same method. In addition, high-level statistical functions of acoustic features gained higher performance scores than low-level descriptors in this classification task. This result strengthens the previous finding on the regression task which reported the advantage use of high-level features.
△ Less
Submitted 31 March, 2020;
originally announced April 2020.
-
Evaluation of Error and Correlation-Based Loss Functions For Multitask Learning Dimensional Speech Emotion Recognition
Authors:
Bagus Tris Atmaja,
Masato Akagi
Abstract:
The choice of a loss function is a critical part of machine learning. This paper evaluated two different loss functions commonly used in regression-task dimensional speech emotion recognition, an error-based and a correlation-based loss functions. We found that using a correlation-based loss function with a concordance correlation coefficient (CCC) loss resulted in better performance than an error…
▽ More
The choice of a loss function is a critical part of machine learning. This paper evaluated two different loss functions commonly used in regression-task dimensional speech emotion recognition, an error-based and a correlation-based loss functions. We found that using a correlation-based loss function with a concordance correlation coefficient (CCC) loss resulted in better performance than an error-based loss function with mean squared error (MSE) loss and mean absolute error (MAE), in terms of the averaged CCC score. The results are consistent with two input feature sets and two datasets. The scatter plots of test prediction by those two loss functions also confirmed the results measured by CCC scores.
△ Less
Submitted 18 November, 2020; v1 submitted 24 March, 2020;
originally announced March 2020.
-
The Effect of Silence Feature in Dimensional Speech Emotion Recognition
Authors:
Bagus Tris Atmaja,
Masato Akagi
Abstract:
Silence is a part of human-to-human communication, which can be a clue for human emotion perception. For automatic emotion recognition by a computer, it is not clear whether silence is useful to determine human emotion within a speech. This paper presents an investigation of the effect of using silence feature in dimensional emotion recognition. Since the silence feature is extracted per utterance…
▽ More
Silence is a part of human-to-human communication, which can be a clue for human emotion perception. For automatic emotion recognition by a computer, it is not clear whether silence is useful to determine human emotion within a speech. This paper presents an investigation of the effect of using silence feature in dimensional emotion recognition. Since the silence feature is extracted per utterance, we grouped the silence feature with high statistical functions from a set of acoustic features. The result reveals that the silence features affect the arousal dimension more than other emotion dimensions. The proper choice of a threshold factor in the calculation of silence feature improved the performance of dimensional speech emotion recognition performance, in terms of a concordance correlation coefficient. On the other side, improper choice of that factor leads to a decrease in performance by using the same architecture.
△ Less
Submitted 21 April, 2020; v1 submitted 2 March, 2020;
originally announced March 2020.
-
Multitask Learning and Multistage Fusion for Dimensional Audiovisual Emotion Recognition
Authors:
Bagus Tris Atmaja,
Masato Akagi
Abstract:
Due to its ability to accurately predict emotional state using multimodal features, audiovisual emotion recognition has recently gained more interest from researchers. This paper proposes two methods to predict emotional attributes from audio and visual data using a multitask learning and a fusion strategy. First, multitask learning is employed by adjusting three parameters for each attribute to i…
▽ More
Due to its ability to accurately predict emotional state using multimodal features, audiovisual emotion recognition has recently gained more interest from researchers. This paper proposes two methods to predict emotional attributes from audio and visual data using a multitask learning and a fusion strategy. First, multitask learning is employed by adjusting three parameters for each attribute to improve the recognition rate. Second, a multistage fusion is proposed to combine results from various modalities' final prediction. Our approach used multitask learning, employed at unimodal and early fusion methods, shows improvement over single-task learning with an average CCC score of 0.431 compared to 0.297. A multistage method, employed at the late fusion approach, significantly improved the agreement score between true and predicted values on the development set of data (from [0.537, 0.565, 0.083] to [0.68, 0.656, 0.443]) for arousal, valence, and liking.
△ Less
Submitted 9 March, 2020; v1 submitted 26 February, 2020;
originally announced February 2020.
-
A ferroelectric-like structural transition in a metal
Authors:
Youguo Shi,
Yanfeng Guo,
Xia Wang,
Andrew J. Princep,
Dmitry Khalyavin,
Pascal Manuel,
Yuichi Michiue,
Akira Sato,
Kenji Tsuda,
Shan Yu,
Masao Arai,
Yuichi Shirako,
Masaki Akaogi,
Nanlin Wang,
Kazunari Yamaura,
Andrew T. Boothroyd
Abstract:
Metals cannot exhibit ferroelectricity because static internal electric fields are screened by conduction electrons, but in 1965, Anderson and Blount predicted the possibility of a ferroelectric metal, in which a ferroelectric-like structural transition occurs in the metallic state. Up to now, no clear example of such a material has been identified. Here we report on a centrosymmetric (R-3c) to no…
▽ More
Metals cannot exhibit ferroelectricity because static internal electric fields are screened by conduction electrons, but in 1965, Anderson and Blount predicted the possibility of a ferroelectric metal, in which a ferroelectric-like structural transition occurs in the metallic state. Up to now, no clear example of such a material has been identified. Here we report on a centrosymmetric (R-3c) to non-centrosymmetric (R3c) transition in metallic LiOsO3 that is structurally equivalent to the ferroelectric transition of LiNbO3. The transition involves a continuous shift in the mean position of Li+ ions on cooling below 140K. Its discovery realizes the scenario described by Anderson and Blount, and establishes a new class of materials whose properties may differ from those of normal metals.
△ Less
Submitted 6 September, 2015;
originally announced September 2015.
-
Superconductivity suppression of Ba0.5K0.5Fe2-2xM2xAs2 single crystals by substitution of transition-metal (M = Mn, Ru, Co, Ni, Cu, and Zn)
Authors:
Jun Li,
Yanfeng Guo,
Shoubao Zhang,
Jie Yuan,
Yoshihiro Tsujimoto,
Xia Wang,
C. I. Sathish,
Ying Sun,
Shan Yu,
Wei Yi,
Kazunari Yamaura,
Eiji Takayama-Muromachi,
Yuichi Shirako,
Masaki Akaogi,
Hiroshi Kontani
Abstract:
We investigated the do** effects of magnetic and nonmagnetic impurities on the single-crystalline p-type Ba0.5K0.5Fe2-2xM2xAs2 (M = Mn, Ru, Co, Ni, Cu and Zn) superconductors. The superconductivity indicates robustly against impurity of Ru, while weakly against the impurities of Mn, Co, Ni, Cu, and Zn. However, the present Tc suppression rate of both magnetic and nonmagnetic impurities remains m…
▽ More
We investigated the do** effects of magnetic and nonmagnetic impurities on the single-crystalline p-type Ba0.5K0.5Fe2-2xM2xAs2 (M = Mn, Ru, Co, Ni, Cu and Zn) superconductors. The superconductivity indicates robustly against impurity of Ru, while weakly against the impurities of Mn, Co, Ni, Cu, and Zn. However, the present Tc suppression rate of both magnetic and nonmagnetic impurities remains much lower than what was expected for the s\pm-wave model. The temperature dependence of resistivity data is observed an obvious low-T upturn for the crystals doped with high-level impurity, which is due to the occurrence of localization. Thus, the relatively weak Tc suppression effect from Mn, Co, Ni, Cu, and Zn are considered as a result of localization rather than pair-breaking effect in s\pm-wave model.
△ Less
Submitted 4 June, 2012;
originally announced June 2012.
-
Integer spin-chain antiferromagnetism of the 4d oxide CaRuO3 with post-perovskite structure
Authors:
Y. Shirako,
H. Satsukawa,
X. X. Wang,
J. J. Li,
Y. F. Guo,
M. Arai,
K. Yamaura,
M. Yoshida,
H. Kojitani,
T. Katsumata,
Y. Inaguma,
K. Hiraki,
T. Takahashi,
M. Akaogi
Abstract:
A quasi-one dimensional magnetism was discovered in the post-perovskite CaRuO3 (Ru4+: 4d4, Cmcm), which is iso-compositional with the perovskite CaRuO3 (Pbnm). An antiferromagnetic spin-chain function with -J/kB = 350 K well reproduces the experimental curve of the magnetic susceptibility vs. temperature, suggesting long-range antiferromagnetic correlations. The anisotropic magnetism is probably o…
▽ More
A quasi-one dimensional magnetism was discovered in the post-perovskite CaRuO3 (Ru4+: 4d4, Cmcm), which is iso-compositional with the perovskite CaRuO3 (Pbnm). An antiferromagnetic spin-chain function with -J/kB = 350 K well reproduces the experimental curve of the magnetic susceptibility vs. temperature, suggesting long-range antiferromagnetic correlations. The anisotropic magnetism is probably owing to the dyz - 2p- dzx and dzx - 2p- dyz superexchange bonds along a-axis. The Sommerfeld coefficient of the specific heat is fairly small, 0.16(2) mJ mol-1 K-2, indicating that the magnetism reflects localized nature of the 4d electrons. As far as we know, this is the first observation of an integer (S = 1) spin-chain antiferromagnetism in the 4d electron system.
△ Less
Submitted 7 April, 2011;
originally announced April 2011.