Search | arXiv e-print repository

Carrot and Stick: Inducing Self-Motivation with Positive & Negative Feedback

Authors: Jimin Sohn, Jeihee Cho, Junyong Lee, Songmu Heo, Ji-Eun Han, David R. Mortensen

Abstract: Positive thinking is thought to be an important component of self-motivation in various practical fields such as education and the workplace. Previous work, including sentiment transfer and positive reframing, has focused on the positive side of language. However, self-motivation that drives people to reach their goals has not yet been studied from a computational perspective. Moreover, negative f… ▽ More Positive thinking is thought to be an important component of self-motivation in various practical fields such as education and the workplace. Previous work, including sentiment transfer and positive reframing, has focused on the positive side of language. However, self-motivation that drives people to reach their goals has not yet been studied from a computational perspective. Moreover, negative feedback has not yet been explored, even though positive and negative feedback are both necessary to grow self-motivation. To facilitate self-motivation, we propose CArrot and STICk (CASTIC) dataset, consisting of 12,590 sentences with 5 different strategies for enhancing self-motivation. Our data and code are publicly available at here. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 10 pages, 8 figures

arXiv:2405.01857 [pdf, other]

doi 10.1145/3652032.3657576

TinySeg: Model Optimizing Framework for Image Segmentation on Tiny Embedded Systems

Authors: Byungchul Chae, Jiae Kim, Seonyeong Heo

Abstract: Image segmentation is one of the major computer vision tasks, which is applicable in a variety of domains, such as autonomous navigation of an unmanned aerial vehicle. However, image segmentation cannot easily materialize on tiny embedded systems because image segmentation models generally have high peak memory usage due to their architectural characteristics. This work finds that image segmentati… ▽ More Image segmentation is one of the major computer vision tasks, which is applicable in a variety of domains, such as autonomous navigation of an unmanned aerial vehicle. However, image segmentation cannot easily materialize on tiny embedded systems because image segmentation models generally have high peak memory usage due to their architectural characteristics. This work finds that image segmentation models unnecessarily require large memory space with an existing tiny machine learning framework. That is, the existing framework cannot effectively manage the memory space for the image segmentation models. This work proposes TinySeg, a new model optimizing framework that enables memory-efficient image segmentation for tiny embedded systems. TinySeg analyzes the lifetimes of tensors in the target model and identifies long-living tensors. Then, TinySeg optimizes the memory usage of the target model mainly with two methods: (i) tensor spilling into local or remote storage and (ii) fused fetching of spilled tensors. This work implements TinySeg on top of the existing tiny machine learning framework and demonstrates that TinySeg can reduce the peak memory usage of an image segmentation model by 39.3% for tiny embedded systems. △ Less

Submitted 3 May, 2024; originally announced May 2024.

Comments: LCTES 2024

arXiv:2311.10792 [pdf]

Enhancing Data Efficiency and Feature Identification for Lithium-Ion Battery Lifespan Prediction by Deciphering Interpretation of Temporal Patterns and Cyclic Variability Using Attention-Based Models

Authors: Jaewook Lee, Seongmin Heo, Jay H. Lee

Abstract: Accurately predicting the lifespan of lithium-ion batteries is crucial for optimizing operational strategies and mitigating risks. While numerous studies have aimed at predicting battery lifespan, few have examined the interpretability of their models or how such insights could improve predictions. Addressing this gap, we introduce three innovative models that integrate shallow attention layers in… ▽ More Accurately predicting the lifespan of lithium-ion batteries is crucial for optimizing operational strategies and mitigating risks. While numerous studies have aimed at predicting battery lifespan, few have examined the interpretability of their models or how such insights could improve predictions. Addressing this gap, we introduce three innovative models that integrate shallow attention layers into a foundational model from our previous work, which combined elements of recurrent and convolutional neural networks. Utilizing a well-known public dataset, we showcase our methodology's effectiveness. Temporal attention is applied to identify critical timesteps and highlight differences among test cell batches, particularly underscoring the significance of the "rest" phase. Furthermore, by applying cyclic attention via self-attention to context vectors, our approach effectively identifies key cycles, enabling us to strategically decrease the input size for quicker predictions. Employing both single- and multi-head attention mechanisms, we have systematically minimized the required input from 100 to 50 and then to 30 cycles, refining this process based on cyclic attention scores. Our refined model exhibits strong regression capabilities, accurately forecasting the initiation of rapid capacity fade with an average deviation of only 58 cycles by analyzing just the initial 30 cycles of easily accessible input data. △ Less

Submitted 11 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

arXiv:2311.07163 [pdf, other]

Enhancing Lightweight Neural Networks for Small Object Detection in IoT Applications

Authors: Liam Boyle, Nicolas Baumann, Seonyeong Heo, Michele Magno

Abstract: Advances in lightweight neural networks have revolutionized computer vision in a broad range of IoT applications, encompassing remote monitoring and process automation. However, the detection of small objects, which is crucial for many of these applications, remains an underexplored area in current computer vision research, particularly for embedded devices. To address this gap, the paper proposes… ▽ More Advances in lightweight neural networks have revolutionized computer vision in a broad range of IoT applications, encompassing remote monitoring and process automation. However, the detection of small objects, which is crucial for many of these applications, remains an underexplored area in current computer vision research, particularly for embedded devices. To address this gap, the paper proposes a novel adaptive tiling method that can be used on top of any existing object detector including the popular FOMO network for object detection on microcontrollers. Our experimental results show that the proposed tiling method can boost the F1-score by up to 225% while reducing the average object count error by up to 76%. Furthermore, the findings of this work suggest that using a soft F1 loss over the popular binary cross-entropy loss can significantly reduce the negative impact of imbalanced data. Finally, we validate our approach by conducting experiments on the Sony Spresense microcontroller, showcasing the proposed method's ability to strike a balance between detection performance, low latency, and minimal memory consumption. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2211.00437 [pdf, other]

Disentangled representation learning for multilingual speaker recognition

Authors: Kihyun Nam, Youkyum Kim, Jaesung Huh, Hee Soo Heo, Jee-weon Jung, Joon Son Chung

Abstract: The goal of this paper is to learn robust speaker representation for bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when speaking in different languages. Popular speaker recognition evaluation sets do not consider the bilingual scenario, making it difficult to analyse t… ▽ More The goal of this paper is to learn robust speaker representation for bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when speaking in different languages. Popular speaker recognition evaluation sets do not consider the bilingual scenario, making it difficult to analyse the effect of bilingual speakers on speaker recognition performance. In this paper, we publish a large-scale evaluation set named VoxCeleb1-B derived from VoxCeleb that considers bilingual scenarios. We introduce an effective disentanglement learning strategy that combines adversarial and metric learning-based methods. This approach addresses the bilingual situation by disentangling language-related information from speaker representation while ensuring stable speaker representation learning. Our language-disentangled learning method only uses language pseudo-labels without manual information. △ Less

Submitted 6 June, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: Interspeech 2023

arXiv:2210.01126 [pdf]

Wheel Impact Test by Deep Learning: Prediction of Location and Magnitude of Maximum Stress

Authors: Seungyeon Shin, Ah-hyeon **, Soyoung Yoo, Sunghee Lee, ChangGon Kim, Sungpil Heo, Namwoo Kang

Abstract: For ensuring vehicle safety, the impact performance of wheels during wheel development must be ensured through a wheel impact test. However, manufacturing and testing a real wheel requires a significant time and money because develo** an optimal wheel design requires numerous iterative processes to modify the wheel design and verify the safety performance. Accordingly, wheel impact tests have be… ▽ More For ensuring vehicle safety, the impact performance of wheels during wheel development must be ensured through a wheel impact test. However, manufacturing and testing a real wheel requires a significant time and money because develo** an optimal wheel design requires numerous iterative processes to modify the wheel design and verify the safety performance. Accordingly, wheel impact tests have been replaced by computer simulations such as finite element analysis (FEA); however, it still incurs high computational costs for modeling and analysis, and requires FEA experts. In this study, we present an aluminum road wheel impact performance prediction model based on deep learning that replaces computationally expensive and time-consuming 3D FEA. For this purpose, 2D disk-view wheel image data, 3D wheel voxel data, and barrier mass values used for the wheel impact test were utilized as the inputs to predict the magnitude of the maximum von Mises stress, corresponding location, and the stress distribution of the 2D disk-view. The input data were first compressed into a latent space with a 3D convolutional variational autoencoder (cVAE) and 2D convolutional autoencoder (cAE). Subsequently, the fully connected layers were used to predict the impact performance, and a decoder was used to predict the stress distribution heatmap of the 2D disk-view. The proposed model can replace the impact test in the early wheel-development stage by predicting the impact performance in real-time and can be used without domain knowledge. The time required for the wheel development process can be reduced by using this mechanism. △ Less

Submitted 18 December, 2022; v1 submitted 3 October, 2022; originally announced October 2022.

arXiv:2208.13900 [pdf, other]

doi 10.1145/3543174.3546835

Enjoy the Ride Consciously with CAWA: Context-Aware Advisory Warnings for Automated Driving

Authors: Erfan Pakdamanian, Erzhen Hu, Shili Sheng, Sarit Kraus, Seongkook Heo, Lu Feng

Abstract: In conditionally automated driving, drivers decoupled from driving while immersed in non-driving-related tasks (NDRTs) could potentially either miss the system-initiated takeover request (TOR) or a sudden TOR may startle them. To better prepare drivers for a safer takeover in an emergency, we propose novel context-aware advisory warnings (CAWA) for automated driving to gently inform drivers. This… ▽ More In conditionally automated driving, drivers decoupled from driving while immersed in non-driving-related tasks (NDRTs) could potentially either miss the system-initiated takeover request (TOR) or a sudden TOR may startle them. To better prepare drivers for a safer takeover in an emergency, we propose novel context-aware advisory warnings (CAWA) for automated driving to gently inform drivers. This will help them stay vigilant while engaging in NDRTs. The key innovation is that CAWA adapts warning modalities according to the context of NDRTs. We conducted a user study to investigate the effectiveness of CAWA. The study results show that CAWA has statistically significant effects on safer takeover behavior, improved driver situational awareness, less attention demand, and more positive user feedback, compared with uniformly distributed speech-based warnings across all NDRTs. △ Less

Submitted 29 August, 2022; originally announced August 2022.

Comments: Proceeding of the 14th International Conference on Automotive User Interfaces and Interactive Vehicular Applications (AutomotiveUI '22)

arXiv:2207.06626 [pdf, other]

doi 10.1109/ACCESS.2022.3190089

Continuous Facial Motion Deblurring

Authors: Tae Bok Lee, Sujy Han, Yong Seok Heo

Abstract: We introduce a novel framework for continuous facial motion deblurring that restores the continuous sharp moment latent in a single motion-blurred face image via a moment control factor. Although a motion-blurred image is the accumulated signal of continuous sharp moments during the exposure time, most existing single image deblurring approaches aim to restore a fixed number of frames using multip… ▽ More We introduce a novel framework for continuous facial motion deblurring that restores the continuous sharp moment latent in a single motion-blurred face image via a moment control factor. Although a motion-blurred image is the accumulated signal of continuous sharp moments during the exposure time, most existing single image deblurring approaches aim to restore a fixed number of frames using multiple networks and training stages. To address this problem, we propose a continuous facial motion deblurring network based on GAN (CFMD-GAN), which is a novel framework for restoring the continuous moment latent in a single motion-blurred face image with a single network and a single training stage. To stabilize the network training, we train the generator to restore continuous moments in the order determined by our facial motion-based reordering process (FMR) utilizing domain-specific knowledge of the face. Moreover, we propose an auxiliary regressor that helps our generator produce more accurate images by estimating continuous sharp moments. Furthermore, we introduce a control-adaptive (ContAda) block that performs spatially deformable convolution and channel-wise attention as a function of the control factor. Extensive experiments on the 300VW datasets demonstrate that the proposed framework generates a various number of continuous output frames by varying the moment control factor. Compared with the recent single-to-single image deblurring networks trained with the same 300VW training set, the proposed method show the superior performance in restoring the central sharp frame in terms of perceptual metrics, including LPIPS, FID and Arcface identity distance. The proposed method outperforms the existing single-to-video deblurring method for both qualitative and quantitative comparisons. △ Less

Submitted 13 July, 2022; originally announced July 2022.

Journal ref: IEEE Access (Early Access), 12 July 2022

arXiv:2204.03896 [pdf, other]

Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms

Authors: Wonkee Lee, Seong-Hwan Heo, Jong-Hyeok Lee

Abstract: Semi-supervised learning that leverages synthetic data for training has been widely adopted for develo** automatic post-editing (APE) models due to the lack of training data. With this aim, we focus on data-synthesis methods to create high-quality synthetic data. Given that APE takes as input a machine-translation result that might include errors, we present a data-synthesis method by which the… ▽ More Semi-supervised learning that leverages synthetic data for training has been widely adopted for develo** automatic post-editing (APE) models due to the lack of training data. With this aim, we focus on data-synthesis methods to create high-quality synthetic data. Given that APE takes as input a machine-translation result that might include errors, we present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data. We introduce a noising-based data-synthesis method by adapting the masked language model approach, generating a noisy text from a clean text by infilling masked tokens with erroneous tokens. Moreover, we propose selective corpus interleaving that combines two separate synthetic datasets by taking only the advantageous samples to enhance the quality of the synthetic data further. Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods. △ Less

Submitted 3 June, 2024; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Accepted to LREC-COLING 2024

arXiv:2203.12940 [pdf, other]

mcBERT: Momentum Contrastive Learning with BERT for Zero-Shot Slot Filling

Authors: Seong-Hwan Heo, WonKee Lee, Jong-Hyeok Lee

Abstract: Zero-shot slot filling has received considerable attention to cope with the problem of limited available data for the target domain. One of the important factors in zero-shot learning is to make the model learn generalized and reliable representations. For this purpose, we present mcBERT, which stands for momentum contrastive learning with BERT, to develop a robust zero-shot slot filling model. mc… ▽ More Zero-shot slot filling has received considerable attention to cope with the problem of limited available data for the target domain. One of the important factors in zero-shot learning is to make the model learn generalized and reliable representations. For this purpose, we present mcBERT, which stands for momentum contrastive learning with BERT, to develop a robust zero-shot slot filling model. mcBERT uses BERT to initialize the two encoders, the query encoder and key encoder, and is trained by applying momentum contrastive learning. Our experimental results on the SNIPS benchmark show that mcBERT substantially outperforms the previous models, recording a new state-of-the-art. Besides, we also show that each component composing mcBERT contributes to the performance improvement. △ Less

Submitted 28 June, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

Comments: Accepted to INTERSPEECH 2022

arXiv:2110.12555 [pdf, other]

doi 10.1007/978-3-030-87202-1_38

hSDB-instrument: Instrument Localization Database for Laparoscopic and Robotic Surgeries

Authors: Jihun Yoon, Jiwon Lee, Sunghwan Heo, Hayeong Yu, Jayeon Lim, Chi Hyun Song, SeulGi Hong, Seungbum Hong, Bokyung Park, SungHyun Park, Woo ** Hyung, Min-Kook Choi

Abstract: Automated surgical instrument localization is an important technology to understand the surgical process and in order to analyze them to provide meaningful guidance during surgery or surgical index after surgery to the surgeon. We introduce a new dataset that reflects the kinematic characteristics of surgical instruments for automated surgical instrument localization of surgical videos. The hSDB(h… ▽ More Automated surgical instrument localization is an important technology to understand the surgical process and in order to analyze them to provide meaningful guidance during surgery or surgical index after surgery to the surgeon. We introduce a new dataset that reflects the kinematic characteristics of surgical instruments for automated surgical instrument localization of surgical videos. The hSDB(hutom Surgery DataBase)-instrument dataset consists of instrument localization information from 24 cases of laparoscopic cholecystecomy and 24 cases of robotic gastrectomy. Localization information for all instruments is provided in the form of a bounding box for object detection. To handle class imbalance problem between instruments, synthesized instruments modeled in Unity for 3D models are included as training data. Besides, for 3D instrument data, a polygon annotation is provided to enable instance segmentation of the tool. To reflect the kinematic characteristics of all instruments, they are annotated with head and body parts for laparoscopic instruments, and with head, wrist, and body parts for robotic instruments separately. Annotation data of assistive tools (specimen bag, needle, etc.) that are frequently used for surgery are also included. Moreover, we provide statistical information on the hSDB-instrument dataset and the baseline localization performances of the object detection networks trained by the MMDetection library and resulting analyses. △ Less

Submitted 25 October, 2021; v1 submitted 24 October, 2021; originally announced October 2021.

Comments: https://hsdb-instrument.github.io

Journal ref: MICCAI 2021 pp 393-402

arXiv:2110.12172 [pdf, other]

Scalable Smartphone Cluster for Deep Learning

Authors: Byunggook Na, Jaehee Jang, Seongsik Park, Seijoon Kim, Joonoo Kim, Moon Sik Jeong, Kwang Choon Kim, Seon Heo, Yoonsang Kim, Sungroh Yoon

Abstract: Various deep learning applications on smartphones have been rapidly rising, but training deep neural networks (DNNs) has too large computational burden to be executed on a single smartphone. A portable cluster, which connects smartphones with a wireless network and supports parallel computation using them, can be a potential approach to resolve the issue. However, by our findings, the limitations… ▽ More Various deep learning applications on smartphones have been rapidly rising, but training deep neural networks (DNNs) has too large computational burden to be executed on a single smartphone. A portable cluster, which connects smartphones with a wireless network and supports parallel computation using them, can be a potential approach to resolve the issue. However, by our findings, the limitations of wireless communication restrict the cluster size to up to 30 smartphones. Such small-scale clusters have insufficient computational power to train DNNs from scratch. In this paper, we propose a scalable smartphone cluster enabling deep learning training by removing the portability to increase its computational efficiency. The cluster connects 138 Galaxy S10+ devices with a wired network using Ethernet. We implemented large-batch synchronous training of DNNs based on Caffe, a deep learning library. The smartphone cluster yielded 90% of the speed of a P100 when training ResNet-50, and approximately 43x speed-up of a V100 when training MobileNet-v1. △ Less

Submitted 23 October, 2021; originally announced October 2021.

Comments: 6 pages

arXiv:2012.15441 [pdf, other]

doi 10.1145/3411764.3445563

DeepTake: Prediction of Driver Takeover Behavior using Multimodal Data

Authors: Erfan Pakdamanian, Shili Sheng, Sonia Baee, Seongkook Heo, Sarit Kraus, Lu Feng

Abstract: Automated vehicles promise a future where drivers can engage in non-driving tasks without hands on the steering wheels for a prolonged period. Nevertheless, automated vehicles may still need to occasionally hand the control back to drivers due to technology limitations and legal requirements. While some systems determine the need for driver takeover using driver context and road condition to initi… ▽ More Automated vehicles promise a future where drivers can engage in non-driving tasks without hands on the steering wheels for a prolonged period. Nevertheless, automated vehicles may still need to occasionally hand the control back to drivers due to technology limitations and legal requirements. While some systems determine the need for driver takeover using driver context and road condition to initiate a takeover request, studies show that the driver may not react to it. We present DeepTake, a novel deep neural network-based framework that predicts multiple aspects of takeover behavior to ensure that the driver is able to safely take over the control when engaged in non-driving tasks. Using features from vehicle data, driver biometrics, and subjective measurements, DeepTake predicts the driver's intention, time, and quality of takeover. We evaluate DeepTake performance using multiple evaluation metrics. Results show that DeepTake reliably predicts the takeover intention, time, and quality, with an accuracy of 96%, 93%, and 83%, respectively. Results also indicate that DeepTake outperforms previous state-of-the-art methods on predicting driver takeover time and quality. Our findings have implications for the algorithm development of driver monitoring and state detection. △ Less

Submitted 15 January, 2021; v1 submitted 30 December, 2020; originally announced December 2020.

Comments: Accepted to CHI 2021

ACM Class: I.2.6; J.4

arXiv:2011.14885 [pdf, ps, other]

Look who's not talking

Authors: Youngki Kwon, Hee Soo Heo, Jaesung Huh, Bong-** Lee, Joon Son Chung

Abstract: The objective of this work is speaker diarisation of speech recordings 'in the wild'. The ability to determine speech segments is a crucial part of diarisation systems, accounting for a large proportion of errors. In this paper, we present a simple but effective solution for speech activity detection based on the speaker embeddings. In particular, we discover that the norm of the speaker embedding… ▽ More The objective of this work is speaker diarisation of speech recordings 'in the wild'. The ability to determine speech segments is a crucial part of diarisation systems, accounting for a large proportion of errors. In this paper, we present a simple but effective solution for speech activity detection based on the speaker embeddings. In particular, we discover that the norm of the speaker embedding is an extremely effective indicator of speech activity. The method does not require an independent model for speech activity detection, therefore allows speaker diarisation to be performed using a unified representation for both speaker modelling and speech activity detection. We perform a number of experiments on in-house and public datasets, in which our method outperforms popular baselines. △ Less

Submitted 30 November, 2020; originally announced November 2020.

Comments: SLT 2021

arXiv:2009.14153 [pdf, other]

Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020

Authors: Hee Soo Heo, Bong-** Lee, Jaesung Huh, Joon Son Chung

Abstract: This report describes our submission to the VoxCeleb Speaker Recognition Challenge (VoxSRC) at Interspeech 2020. We perform a careful analysis of speaker recognition models based on the popular ResNet architecture, and train a number of variants using a range of loss functions. Our results show significant improvements over most existing works without the use of model ensemble or post-processing.… ▽ More This report describes our submission to the VoxCeleb Speaker Recognition Challenge (VoxSRC) at Interspeech 2020. We perform a careful analysis of speaker recognition models based on the popular ResNet architecture, and train a number of variants using a range of loss functions. Our results show significant improvements over most existing works without the use of model ensemble or post-processing. We release the training code and pre-trained models as unofficial baselines for this year's challenge. △ Less

Submitted 29 September, 2020; originally announced September 2020.

arXiv:2007.12085 [pdf, other]

Augmentation adversarial training for self-supervised speaker recognition

Authors: Jaesung Huh, Hee Soo Heo, **gu Kang, Shinji Watanabe, Joon Son Chung

Abstract: The goal of this work is to train robust speaker recognition models without speaker labels. Recent works on unsupervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to… ▽ More The goal of this work is to train robust speaker recognition models without speaker labels. Recent works on unsupervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to separate the speaker information from the channel information. To this end, we propose augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied. Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general. Extensive experiments on the VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision, and the performance of our self-supervised models far exceed that of humans. △ Less

Submitted 30 October, 2020; v1 submitted 23 July, 2020; originally announced July 2020.

Comments: Workshop on Self-Supervised Learning for Speech and Audio Processing, NeurIPS

arXiv:2005.08606 [pdf, other]

End-to-End Lip Synchronisation Based on Pattern Classification

Authors: You ** Kim, Hee Soo Heo, Soo-Whan Chung, Bong-** Lee

Abstract: The goal of this work is to synchronise audio and video of a talking face using deep neural network models. Existing works have trained networks on proxy tasks such as cross-modal similarity learning, and then computed similarities between audio and video frames using a sliding window approach. While these methods demonstrate satisfactory performance, the networks are not trained directly on the t… ▽ More The goal of this work is to synchronise audio and video of a talking face using deep neural network models. Existing works have trained networks on proxy tasks such as cross-modal similarity learning, and then computed similarities between audio and video frames using a sliding window approach. While these methods demonstrate satisfactory performance, the networks are not trained directly on the task. To this end, we propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream. The similarity matrix between the two modalities is first computed from the features, then the inference of the offset can be considered to be a pattern recognition problem where the matrix is considered equivalent to an image. The feature extractor and the classifier are trained jointly. We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets. △ Less

Submitted 19 March, 2021; v1 submitted 18 May, 2020; originally announced May 2020.

Comments: slt 2021 accepted

arXiv:2003.11982 [pdf, ps, other]

doi 10.21437/Interspeech.2020-1064

In defence of metric learning for speaker recognition

Authors: Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-** Lee, Icksang Han

Abstract: The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper… ▽ More The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper, we present an extensive evaluation of most popular loss functions for speaker recognition on the VoxCeleb dataset. We demonstrate that the vanilla triplet loss shows competitive performance compared to classification-based losses, and those trained with our proposed metric learning objective outperform state-of-the-art methods. △ Less

Submitted 24 April, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

Comments: The code can be found at https://github.com/clovaai/voxceleb_trainer

arXiv:1803.06032 [pdf]

doi 10.1145/3173574.3174040

You Watch, You Give, and You Engage: A Study of Live Streaming Practices in China

Authors: Zhicong Lu, Haijun Xia, Seongkook Heo, Daniel Wigdor

Abstract: Despite gaining traction in North America, live streaming has not reached the popularity it has in China, where livestreaming has a tremendous impact on the social behaviors of users. To better understand this socio-technological phenomenon, we conducted a mixed methods study of live streaming practices in China. We present the results of an online survey of 527 live streaming users, focusing on t… ▽ More Despite gaining traction in North America, live streaming has not reached the popularity it has in China, where livestreaming has a tremendous impact on the social behaviors of users. To better understand this socio-technological phenomenon, we conducted a mixed methods study of live streaming practices in China. We present the results of an online survey of 527 live streaming users, focusing on their broadcasting or viewing practices and the experiences they find most engaging. We also interviewed 14 active users to explore their motivations and experiences. Our data revealed the different categories of content that was broadcasted and how varying aspects of this content engaged viewers. We also gained insight into the role reward systems and fan group-chat play in engaging users, while also finding evidence that both viewers and streamers desire deeper channels and mechanisms for interaction in addition to the commenting, gifting, and fan groups that are available today. △ Less

Submitted 15 March, 2018; originally announced March 2018.

Comments: Published at ACM CHI Conference on Human Factors in Computing Systems (CHI 2018). Please cite the CHI version

ACM Class: H.5.m

Journal ref: Zhicong Lu, Haijun Xia, Seongkook Heo, and Daniel Wigdor. 2018. You Watch, You Give, and You Engage: A Study of Live Streaming Practices in China. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18)

Showing 1–19 of 19 results for author: Heo, S