Search | arXiv e-print repository

arXiv:2406.03299 [pdf, other]

The Good, the Bad, and the Hulk-like GPT: Analyzing Emotional Decisions of Large Language Models in Cooperation and Bargaining Games

Authors: Mikhail Mozikov, Nikita Severin, Valeria Bodishtianu, Maria Glushanina, Mikhail Baklashkin, Andrey V. Savchenko, Ilya Makarov

Abstract: Behavior study experiments are an important part of society modeling and understanding human interactions. In practice, many behavioral experiments encounter challenges related to internal and external validity, reproducibility, and social bias due to the complexity of social interactions and cooperation in human user studies. Recent advances in Large Language Models (LLMs) have provided researche… ▽ More Behavior study experiments are an important part of society modeling and understanding human interactions. In practice, many behavioral experiments encounter challenges related to internal and external validity, reproducibility, and social bias due to the complexity of social interactions and cooperation in human user studies. Recent advances in Large Language Models (LLMs) have provided researchers with a new promising tool for the simulation of human behavior. However, existing LLM-based simulations operate under the unproven hypothesis that LLM agents behave similarly to humans as well as ignore a crucial factor in human decision-making: emotions. In this paper, we introduce a novel methodology and the framework to study both, the decision-making of LLMs and their alignment with human behavior under emotional states. Experiments with GPT-3.5 and GPT-4 on four games from two different classes of behavioral game theory showed that emotions profoundly impact the performance of LLMs, leading to the development of more optimal strategies. While there is a strong alignment between the behavioral responses of GPT-3.5 and human participants, particularly evident in bargaining games, GPT-4 exhibits consistent behavior, ignoring induced emotions for rationality decisions. Surprisingly, emotional prompting, particularly with `anger' emotion, can disrupt the "superhuman" alignment of GPT-4, resembling human emotional responses. △ Less

Submitted 5 June, 2024; originally announced June 2024.

ACM Class: I.2.7; J.4

arXiv:2403.11590 [pdf, other]

HSEmotion Team at the 6th ABAW Competition: Facial Expressions, Valence-Arousal and Emotion Intensity Prediction

Authors: Andrey V. Savchenko

Abstract: This article presents our results for the sixth Affective Behavior Analysis in-the-wild (ABAW) competition. To improve the trustworthiness of facial analysis, we study the possibility of using pre-trained deep models that extract reliable emotional features without the need to fine-tune the neural networks for a downstream task. In particular, we introduce several lightweight models based on Mobil… ▽ More This article presents our results for the sixth Affective Behavior Analysis in-the-wild (ABAW) competition. To improve the trustworthiness of facial analysis, we study the possibility of using pre-trained deep models that extract reliable emotional features without the need to fine-tune the neural networks for a downstream task. In particular, we introduce several lightweight models based on MobileViT, MobileFaceNet, EfficientNet, and DDAMFN architectures trained in multi-task scenarios to recognize facial expressions, valence, and arousal on static photos. These neural networks extract frame-level features fed into a simple classifier, e.g., linear feed-forward neural network, to predict emotion intensity, compound expressions, action units, facial expressions, and valence/arousal. Experimental results for five tasks from the sixth ABAW challenge demonstrate that our approach lets us significantly improve quality metrics on validation sets compared to existing non-ensemble techniques. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: 10 pages, 1 figure, 8 tables

MSC Class: 68T10 ACM Class: I.4.9

arXiv:2303.09162 [pdf, other]

EmotiEffNet Facial Features in Uni-task Emotion Recognition in Video at ABAW-5 competition

Authors: Andrey V. Savchenko

Abstract: In this article, the results of our team for the fifth Affective Behavior Analysis in-the-wild (ABAW) competition are presented. The usage of the pre-trained convolutional networks from the EmotiEffNet family for frame-level feature extraction is studied. In particular, we propose an ensemble of a multi-layered perceptron and the LightAutoML-based classifier. The post-processing by smoothing the r… ▽ More In this article, the results of our team for the fifth Affective Behavior Analysis in-the-wild (ABAW) competition are presented. The usage of the pre-trained convolutional networks from the EmotiEffNet family for frame-level feature extraction is studied. In particular, we propose an ensemble of a multi-layered perceptron and the LightAutoML-based classifier. The post-processing by smoothing the results for sequential frames is implemented. Experimental results for the large-scale Aff-Wild2 database demonstrate that our model achieves a much greater macro-averaged F1-score for facial expression recognition and action unit detection and concordance correlation coefficients for valence/arousal estimation when compared to baseline. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: 7 pages; 5 figures; 3 tables

MSC Class: 68T10 ACM Class: I.4.9

arXiv:2207.09508 [pdf, other]

HSE-NN Team at the 4th ABAW Competition: Multi-task Emotion Recognition and Learning from Synthetic Images

Authors: Andrey V. Savchenko

Abstract: In this paper, we present the results of the HSE-NN team in the 4th competition on Affective Behavior Analysis in-the-wild (ABAW). The novel multi-task EfficientNet model is trained for simultaneous recognition of facial expressions and prediction of valence and arousal on static photos. The resulting MT-EmotiEffNet extracts visual features that are fed into simple feed-forward neural networks in… ▽ More In this paper, we present the results of the HSE-NN team in the 4th competition on Affective Behavior Analysis in-the-wild (ABAW). The novel multi-task EfficientNet model is trained for simultaneous recognition of facial expressions and prediction of valence and arousal on static photos. The resulting MT-EmotiEffNet extracts visual features that are fed into simple feed-forward neural networks in the multi-task learning challenge. We obtain performance measure 1.3 on the validation set, which is significantly greater when compared to either performance of baseline (0.3) or existing models that are trained only on the s-Aff-Wild2 database. In the learning from synthetic data challenge, the quality of the original synthetic training set is increased by using the super-resolution techniques, such as Real-ESRGAN. Next, the MT-EmotiEffNet is fine-tuned on the new training set. The final prediction is a simple blending ensemble of pre-trained and fine-tuned MT-EmotiEffNets. Our average validation F1 score is 18% greater than the baseline convolutional neural network. △ Less

Submitted 20 October, 2022; v1 submitted 19 July, 2022; originally announced July 2022.

Comments: accepted at ECCV Workshop ABAW4; 14 pages, 3 figures, 8 tables

MSC Class: 68T10 ACM Class: I.4.9

arXiv:2203.13436 [pdf, other]

Frame-level Prediction of Facial Expressions, Valence, Arousal and Action Units for Mobile Devices

Authors: Andrey V. Savchenko

Abstract: In this paper, we consider the problem of real-time video-based facial emotion analytics, namely, facial expression recognition, prediction of valence and arousal and detection of action unit points. We propose the novel frame-level emotion recognition algorithm by extracting facial features with the single EfficientNet model pre-trained on AffectNet. As a result, our approach may be implemented e… ▽ More In this paper, we consider the problem of real-time video-based facial emotion analytics, namely, facial expression recognition, prediction of valence and arousal and detection of action unit points. We propose the novel frame-level emotion recognition algorithm by extracting facial features with the single EfficientNet model pre-trained on AffectNet. As a result, our approach may be implemented even for video analytics on mobile devices. Experimental results for the large scale Aff-Wild2 database from the third Affective Behavior Analysis in-the-wild (ABAW) Competition demonstrate that our simple model is significantly better when compared to the VggFace baseline. In particular, our method is characterized by 0.15-0.2 higher performance measures for validation sets in uni-task Expression Classification, Valence-Arousal Estimation and Expression Classification. Due to simplicity, our approach may be considered as a new baseline for all four sub-challenges. △ Less

Submitted 24 May, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

Comments: accepted at CVPR Workshop ABAW3, 8 pages, 2 figures, 6 tables

MSC Class: 68T10 ACM Class: I.4.9

arXiv:2103.17107 [pdf, other]

doi 10.1109/SISY52375.2021.9582508

Facial expression and attributes recognition based on multi-task learning of lightweight neural networks

Authors: Andrey V. Savchenko

Abstract: In this paper, the multi-task learning of lightweight convolutional neural networks is studied for face identification and classification of facial attributes (age, gender, ethnicity) trained on cropped faces without margins. The necessity to fine-tune these networks to predict facial expressions is highlighted. Several models are presented based on MobileNet, EfficientNet and RexNet architectures… ▽ More In this paper, the multi-task learning of lightweight convolutional neural networks is studied for face identification and classification of facial attributes (age, gender, ethnicity) trained on cropped faces without margins. The necessity to fine-tune these networks to predict facial expressions is highlighted. Several models are presented based on MobileNet, EfficientNet and RexNet architectures. It was experimentally demonstrated that they lead to near state-of-the-art results in age, gender and race recognition on the UTKFace dataset and emotion classification on the AffectNet dataset. Moreover, it is shown that the usage of the trained models as feature extractors of facial regions in video frames leads to 4.5% higher accuracy than the previously known state-of-the-art single models for the AFEW and the VGAF datasets from the EmotiW challenges. The models and source code are publicly available at https://github.com/HSE-asavchenko/face-emotion-recognition. △ Less

Submitted 4 October, 2021; v1 submitted 31 March, 2021; originally announced March 2021.

Comments: 14 pages, 3 figures, accepted at IEEE SISY 2021

MSC Class: 68T10

arXiv:2010.04224 [pdf]

Gender domain adaptation for automatic speech recognition task

Authors: Sokolov Artem, Andrey V. Savchenko

Abstract: This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conduct experiments with finetuning on the gender-specific test subsets and. In general, we do not obtain essential WER reduction by finetuning techniques by this approach. We achieved up to ~5% lower word error rate on the m… ▽ More This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conduct experiments with finetuning on the gender-specific test subsets and. In general, we do not obtain essential WER reduction by finetuning techniques by this approach. We achieved up to ~5% lower word error rate on the male subset and 3% on the female subset if the layers in the encoder and decoder are not frozen, but the tuning is started from the last checkpoints. Moreover, we adapted our base model on the full L2 Arctic dataset of accented speech and fine-tuned it for particular speakers and male and female genders separately. The models trained on the gender subsets obtained 1-2% higher accuracy when compared to the model tuned on the whole L2 Arctic dataset. Finally, we tested the concatenation of the pretrained x-vector voice embeddings and embeddings from a conventional encoder, but its gain in accuracy is not significant. △ Less

Submitted 17 November, 2020; v1 submitted 8 October, 2020; originally announced October 2020.

Comments: Draft of paper for SAMI conference

arXiv:1911.11010 [pdf, other]

Event Recognition with Automatic Album Detection based on Sequential Processing, Neural Attention and Image Captioning

Authors: Andrey V. Savchenko

Abstract: In this paper a new formulation of event recognition task is examined: it is required to predict event categories in a gallery of images, for which albums (groups of photos corresponding to a single event) are unknown. We propose the novel two-stage approach. At first, features are extracted in each photo using the pre-trained convolutional neural network. These features are classified individuall… ▽ More In this paper a new formulation of event recognition task is examined: it is required to predict event categories in a gallery of images, for which albums (groups of photos corresponding to a single event) are unknown. We propose the novel two-stage approach. At first, features are extracted in each photo using the pre-trained convolutional neural network. These features are classified individually. The scores of the classifier are used to group sequential photos into several clusters. Finally, the features of photos in each group are aggregated into a single descriptor using neural attention mechanism. This algorithm is optionally extended to improve the accuracy for classification of each image in an album. In contrast to conventional fine-tuning of convolutional neural networks (CNN) we proposed to use image captioning, i.e., generative model that converts images to textual descriptions. They are one-hot encoded and summarized into sparse feature vector suitable for learning of arbitrary classifier. Experimental study with Photo Event Collection and Multi-Label Curation of Flickr Events Dataset demonstrates that our approach is 9-20% more accurate than event recognition on single photos. Moreover, proposed method has 13-16% lower error rate than classification of groups of photos obtained with hierarchical clustering. It is experimentally shown that the image captions trained on Conceptual Captions dataset can be classified more accurately than the features from object detector, though they both are obviously not as rich as the CNN-based features. However, it is possible to combine our approach with conventional CNNs in an ensemble to provide the state-of-the-art results for several event datasets. △ Less

Submitted 15 January, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

Comments: 11 pages, 5 figures

MSC Class: 68T10 (Primary)

arXiv:1907.04519 [pdf, other]

doi 10.1016/j.patcog.2021.108248

Preferences Prediction using a Gallery of Mobile Device based on Scene Recognition and Object Detection

Authors: A. V. Savchenko, K. V. Demochkin, I. S. Grechikhin

Abstract: In this paper user modeling task is examined by processing a gallery of photos and videos on a mobile device. We propose novel engine for user preference prediction based on scene recognition, object detection and facial analysis. At first, all faces in a gallery are clustered and all private photos and videos with faces from large clusters are processed on the embedded system in offline mode. Oth… ▽ More In this paper user modeling task is examined by processing a gallery of photos and videos on a mobile device. We propose novel engine for user preference prediction based on scene recognition, object detection and facial analysis. At first, all faces in a gallery are clustered and all private photos and videos with faces from large clusters are processed on the embedded system in offline mode. Other photos may be sent to the remote server to be analyzed by very deep models. The visual features of each photo are obtained from scene recognition and object detection models. These features are aggregated into a single user descriptor in the neural attention block. The proposed pipeline is implemented for the Android mobile platform. Experimental results with a subset of Photo Event Collection, Web Image Dataset for Event Recognition and Amazon Fashion datasets demonstrate the possibility to process images very efficiently without significant accuracy degradation. The source code of Android mobile application is publicly available at https://github.com/HSE-asavchenko/mobile-visual-preferences. △ Less

Submitted 18 April, 2021; v1 submitted 10 July, 2019; originally announced July 2019.

Comments: 19 pages; 9 figures, preprint submitter to Pattern Recognition journal

MSC Class: 68T10

arXiv:1902.02380 [pdf, ps, other]

doi 10.1016/j.asoc.2019.03.057

Compression of Recurrent Neural Networks for Efficient Language Modeling

Authors: Artem M. Grachev, Dmitry I. Ignatov, Andrey V. Savchenko

Abstract: Recurrent neural networks have proved to be an effective method for statistical language modeling. However, in practice their memory and run-time complexity are usually too large to be implemented in real-time offline mobile applications. In this paper we consider several compression techniques for recurrent neural networks including Long-Short Term Memory models. We make particular attention to t… ▽ More Recurrent neural networks have proved to be an effective method for statistical language modeling. However, in practice their memory and run-time complexity are usually too large to be implemented in real-time offline mobile applications. In this paper we consider several compression techniques for recurrent neural networks including Long-Short Term Memory models. We make particular attention to the high-dimensional output problem caused by the very large vocabulary size. We focus on effective compression methods in the context of their exploitation on devices: pruning, quantization, and matrix decomposition approaches (low-rank factorization and tensor train decomposition, in particular). For each model we investigate the trade-off between its size, suitability for fast inference and perplexity. We propose a general pipeline for applying the most suitable methods to compress recurrent neural networks for language modeling. It has been shown in the experimental study with the Penn Treebank (PTB) dataset that the most efficient results in terms of speed and compression-perplexity balance are obtained by matrix decomposition techniques. △ Less

Submitted 6 February, 2019; originally announced February 2019.

Comments: 25 pages, 3 tables, 4 figures

arXiv:1807.07718 [pdf, other]

doi 10.7717/peerj-cs.197

Efficient Facial Representations for Age, Gender and Identity Recognition in Organizing Photo Albums using Multi-output CNN

Authors: Andrey V. Savchenko

Abstract: This paper is focused on the automatic extraction of persons and their attributes (gender, year of born) from album of photos and videos. We propose the two-stage approach, in which, firstly, the convolutional neural network simultaneously predicts age/gender from all photos and additionally extracts facial representations suitable for face identification. We modified the MobileNet, which is preli… ▽ More This paper is focused on the automatic extraction of persons and their attributes (gender, year of born) from album of photos and videos. We propose the two-stage approach, in which, firstly, the convolutional neural network simultaneously predicts age/gender from all photos and additionally extracts facial representations suitable for face identification. We modified the MobileNet, which is preliminarily trained to perform face recognition, in order to additionally recognize age and gender. In the second stage of our approach, extracted faces are grouped using hierarchical agglomerative clustering techniques. The born year and gender of a person in each cluster are estimated using aggregation of predictions for individual photos. We experimentally demonstrated that our facial clustering quality is competitive with the state-of-the-art neural networks, though our implementation is much computationally cheaper. Moreover, our approach is characterized by more accurate video-based age/gender recognition when compared to the publicly available models. △ Less

Submitted 13 June, 2019; v1 submitted 20 July, 2018; originally announced July 2018.

Comments: 19 pages, 2 figures, 8 tables

MSC Class: 68T10

Journal ref: PeerJ Computer Science 5:e197 (2019)

arXiv:1709.05675 [pdf, ps, other]

doi 10.1007/978-3-319-73013-4_20

Organizing Multimedia Data in Video Surveillance Systems Based on Face Verification with Convolutional Neural Networks

Authors: Anastasiia D. Sokolova, Angelina S. Kharchevnikova, Andrey V. Savchenko

Abstract: In this paper we propose the two-stage approach of organizing information in video surveillance systems. At first, the faces are detected in each frame and a video stream is split into sequences of frames with face region of one person. Secondly, these sequences (tracks) that contain identical faces are grouped using face verification algorithms and hierarchical agglomerative clustering. Gender an… ▽ More In this paper we propose the two-stage approach of organizing information in video surveillance systems. At first, the faces are detected in each frame and a video stream is split into sequences of frames with face region of one person. Secondly, these sequences (tracks) that contain identical faces are grouped using face verification algorithms and hierarchical agglomerative clustering. Gender and age are estimated for each cluster (person) in order to facilitate the usage of the organized video collection. The particular attention is focused on the aggregation of features extracted from each frame with the deep convolutional neural networks. The experimental results of the proposed approach using YTF and IJB-A datasets demonstrated that the most accurate and fast solution is achieved for matching of normalized average of feature vectors of all frames in a track. △ Less

Submitted 17 September, 2017; originally announced September 2017.

Comments: 8 pages; 1 figure, accepted for publication at AIST17

MSC Class: 68T10; 68T45 ACM Class: I.4.8; I.5.4

Journal ref: Proceedings of the International Conference on Analysis of Images, Social Networks and Texts (AIST), 2018, pp. 223-230

arXiv:1709.01688 [pdf]

doi 10.1145/3136755.3143007

Group-level Emotion Recognition using Transfer Learning from Face Identification

Authors: Alexandr G. Rassadin, Alexey S. Gruzdev, Andrey V. Savchenko

Abstract: In this paper, we describe our algorithmic approach, which was used for submissions in the fifth Emotion Recognition in the Wild (EmotiW 2017) group-level emotion recognition sub-challenge. We extracted feature vectors of detected faces using the Convolutional Neural Network trained for face identification task, rather than traditional pre-training on emotion recognition problems. In the final pip… ▽ More In this paper, we describe our algorithmic approach, which was used for submissions in the fifth Emotion Recognition in the Wild (EmotiW 2017) group-level emotion recognition sub-challenge. We extracted feature vectors of detected faces using the Convolutional Neural Network trained for face identification task, rather than traditional pre-training on emotion recognition problems. In the final pipeline an ensemble of Random Forest classifiers was learned to predict emotion score using available training set. In case when the faces have not been detected, one member of our ensemble extracts features from the whole image. During our experimental study, the proposed approach showed the lowest error rate when compared to other explored techniques. In particular, we achieved 75.4% accuracy on the validation data, which is 20% higher than the handcrafted feature-based baseline. The source code using Keras framework is publicly available. △ Less

Submitted 30 October, 2017; v1 submitted 6 September, 2017; originally announced September 2017.

Comments: 5 pages, 3 figures, accepted for publication at ICMI17 (EmotiW Grand Challenge)

MSC Class: 68T10; 68T45 ACM Class: I.4.8; I.5.4

Journal ref: Proceedings of the 19th ACM International Conference on Multimodal Interaction (ICMI), 2017, pp. 544-548

arXiv:1708.07972 [pdf, ps, other]

doi 10.1016/j.eswa.2018.04.039

Maximum A Posteriori Estimation of Distances Between Deep Features in Still-to-Video Face Recognition

Authors: Andrey V. Savchenko, Natalya S. Belova

Abstract: The paper deals with the still-to-video face recognition for the small sample size problem based on computation of distances between high-dimensional deep bottleneck features. We present the novel statistical recognition method, in which the still-to-video recognition task is casted into Maximum A Posteriori estimation. In this method we maximize the joint probabilistic density of the distances to… ▽ More The paper deals with the still-to-video face recognition for the small sample size problem based on computation of distances between high-dimensional deep bottleneck features. We present the novel statistical recognition method, in which the still-to-video recognition task is casted into Maximum A Posteriori estimation. In this method we maximize the joint probabilistic density of the distances to all reference still images. It is shown that this likelihood can be estimated with the known asymptotically normal distribution of the Kullback-Leibler discriminations between nonnegative features. The experimental study with the LFW (Labeled Faces in the Wild), YTF (YouTube Faces) and IJB-A (IARPA Janus Benchmark A) datasets has been provided. We demonstrated, that the proposed approach can be applied with the state-of-the-art deep features and dissimilarity measures. Our algorithm achieves 3-5% higher accuracy when compared with conventional aggregation of decisions obtained for all frames. △ Less

Submitted 26 August, 2017; originally announced August 2017.

Comments: 20 pages, 5 figures, 40 references

MSC Class: 68T10

arXiv:1708.05963 [pdf, ps, other]

doi 10.1007/978-3-319-69900-4_44

Neural Networks Compression for Language Modeling

Authors: Artem M. Grachev, Dmitry I. Ignatov, Andrey V. Savchenko

Abstract: In this paper, we consider several compression techniques for the language modeling problem based on recurrent neural networks (RNNs). It is known that conventional RNNs, e.g, LSTM-based networks in language modeling, are characterized with either high space complexity or substantial inference time. This problem is especially crucial for mobile applications, in which the constant interaction with… ▽ More In this paper, we consider several compression techniques for the language modeling problem based on recurrent neural networks (RNNs). It is known that conventional RNNs, e.g, LSTM-based networks in language modeling, are characterized with either high space complexity or substantial inference time. This problem is especially crucial for mobile applications, in which the constant interaction with the remote server is inappropriate. By using the Penn Treebank (PTB) dataset we compare pruning, quantization, low-rank factorization, tensor train decomposition for LSTM networks in terms of model size and suitability for fast inference. △ Less

Submitted 20 August, 2017; originally announced August 2017.

Comments: Keywords: LSTM, RNN, language modeling, low-rank factorization, pruning, quantization. Published by Springer in the LNCS series, 7th International Conference on Pattern Recognition and Machine Intelligence, 2017

MSC Class: 62M45; 68T50 ACM Class: I.2.7, I.2.6, I.5.1, I.5.4

Showing 1–15 of 15 results for author: Savchenko, A V