Search | arXiv e-print repository

EHR-Based Mobile and Web Platform for Chronic Disease Risk Prediction Using Large Language Multimodal Models

Authors: Chun-Chieh Liao, Wei-Ting Kuo, I-Hsuan Hu, Yen-Chen Shih, Jun-En Ding, Feng Liu, Fang-Ming Hung

Abstract: Traditional diagnosis of chronic diseases involves in-person consultations with physicians to identify the disease. However, there is a lack of research focused on predicting and develo** application systems using clinical notes and blood test values. We collected five years of Electronic Health Records (EHRs) from Taiwan's hospital database between 2017 and 2021 as an AI database. Furthermore,… ▽ More Traditional diagnosis of chronic diseases involves in-person consultations with physicians to identify the disease. However, there is a lack of research focused on predicting and develo** application systems using clinical notes and blood test values. We collected five years of Electronic Health Records (EHRs) from Taiwan's hospital database between 2017 and 2021 as an AI database. Furthermore, we developed an EHR-based chronic disease prediction platform utilizing Large Language Multimodal Models (LLMMs), successfully integrating with frontend web and mobile applications for prediction. This prediction platform can also connect to the hospital's backend database, providing physicians with real-time risk assessment diagnostics. The demonstration link can be found at https://www.youtube.com/watch?v=oqmL9DEDFgA. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.13578 [pdf, other]

Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration

Authors: Han-Cheng Yu, Yu-An Shih, Kin-Man Law, Kai-Yu Hsieh, Yu-Chen Cheng, Hsin-Chih Ho, Zih-An Lin, Wen-Chuan Hsu, Yao-Chung Fan

Abstract: In this paper, we tackle the task of distractor generation (DG) for multiple-choice questions. Our study introduces two key designs. First, we propose \textit{retrieval augmented pretraining}, which involves refining the language model pretraining to align it more closely with the downstream task of DG. Second, we explore the integration of knowledge graphs to enhance the performance of DG. Throug… ▽ More In this paper, we tackle the task of distractor generation (DG) for multiple-choice questions. Our study introduces two key designs. First, we propose \textit{retrieval augmented pretraining}, which involves refining the language model pretraining to align it more closely with the downstream task of DG. Second, we explore the integration of knowledge graphs to enhance the performance of DG. Through experiments with benchmarking datasets, we show that our models significantly outperform the state-of-the-art results. Our best-performing model advances the F1@3 score from 14.80 to 16.47 in MCQ dataset and from 15.92 to 16.50 in Sciq dataset. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Findings at ACL 2024

arXiv:2406.12209 [pdf, other]

Interface Design for Self-Supervised Speech Models

Authors: Yi-Jen Shih, David Harwath

Abstract: Self-supervised speech (SSL) models have recently become widely adopted for many downstream speech processing tasks. The general usage pattern is to employ SSL models as feature extractors, and then train a downstream prediction head to solve a specific task. However, different layers of SSL models have been shown to capture different types of information, and the methods of combining them are not… ▽ More Self-supervised speech (SSL) models have recently become widely adopted for many downstream speech processing tasks. The general usage pattern is to employ SSL models as feature extractors, and then train a downstream prediction head to solve a specific task. However, different layers of SSL models have been shown to capture different types of information, and the methods of combining them are not well studied. To this end, we extend the general framework for SSL model utilization by proposing the interface that connects the upstream and downstream. Under this view, the dominant technique of combining features via a layerwise weighted sum can be regarded as a specific interface. We propose several alternative interface designs and demonstrate that the weighted sum interface is suboptimal for many tasks. In particular, we show that a convolutional interface whose depth scales logarithmically with the depth of the upstream model consistently outperforms many other interface designs. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech2024

arXiv:2406.00936 [pdf, other]

A Survey of Useful LLM Evaluation

Authors: Ji-Lun Peng, Sijia Cheng, Egil Diau, Yung-Yu Shih, Po-Heng Chen, Yen-Ting Lin, Yun-Nung Chen

Abstract: LLMs have gotten attention across various research domains due to their exceptional performance on a wide range of complex tasks. Therefore, refined methods to evaluate the capabilities of LLMs are needed to determine the tasks and responsibility they should undertake. Our study mainly discussed how LLMs, as useful tools, should be effectively assessed. We proposed the two-stage framework: from ``… ▽ More LLMs have gotten attention across various research domains due to their exceptional performance on a wide range of complex tasks. Therefore, refined methods to evaluate the capabilities of LLMs are needed to determine the tasks and responsibility they should undertake. Our study mainly discussed how LLMs, as useful tools, should be effectively assessed. We proposed the two-stage framework: from ``core ability'' to ``agent'', clearly explaining how LLMs can be applied based on their specific capabilities, along with the evaluation methods in each stage. Core ability refers to the capabilities that LLMs need in order to generate high-quality natural language texts. After confirming LLMs possess core ability, they can solve real-world and complex tasks as agent. In the "core ability" stage, we discussed the reasoning ability, societal impact, and domain knowledge of LLMs. In the ``agent'' stage, we demonstrated embodied action, planning, and tool learning of LLMs agent applications. Finally, we examined the challenges currently confronting the evaluation methods for LLMs, as well as the directions for future development. △ Less

Submitted 2 June, 2024; originally announced June 2024.

arXiv:2405.02260 [pdf, other]

Leveraging Large Language Models to Enhance Domain Expert Inclusion in Data Science Workflows

Authors: Jasmine Y. Shih, Vishal Mohanty, Yannis Katsis, Hariharan Subramonyam

Abstract: Domain experts can play a crucial role in guiding data scientists to optimize machine learning models while ensuring contextual relevance for downstream use. However, in current workflows, such collaboration is challenging due to differing expertise, abstract documentation practices, and lack of access and visibility into low-level implementation artifacts. To address these challenges and enable d… ▽ More Domain experts can play a crucial role in guiding data scientists to optimize machine learning models while ensuring contextual relevance for downstream use. However, in current workflows, such collaboration is challenging due to differing expertise, abstract documentation practices, and lack of access and visibility into low-level implementation artifacts. To address these challenges and enable domain expert participation, we introduce CellSync, a collaboration framework comprising (1) a Jupyter Notebook extension that continuously tracks changes to dataframes and model metrics and (2) a Large Language Model powered visualization dashboard that makes those changes interpretable to domain experts. Through CellSync's cell-level dataset visualization with code summaries, domain experts can interactively examine how individual data and modeling operations impact different data segments. The chat features enable data-centric conversations and targeted feedback to data scientists. Our preliminary evaluation shows that CellSync provides transparency and promotes critical discussions about the intents and implications of data operations. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2403.14711 [pdf, other]

Human-in-the-Loop AI for Cheating Ring Detection

Authors: Yong-Siang Shih, Manqian Liao, Ruidong Liu, Mirza Basim Baig

Abstract: Online exams have become popular in recent years due to their accessibility. However, some concerns have been raised about the security of the online exams, particularly in the context of professional cheating services aiding malicious test takers in passing exams, forming so-called "cheating rings". In this paper, we introduce a human-in-the-loop AI cheating ring detection system designed to dete… ▽ More Online exams have become popular in recent years due to their accessibility. However, some concerns have been raised about the security of the online exams, particularly in the context of professional cheating services aiding malicious test takers in passing exams, forming so-called "cheating rings". In this paper, we introduce a human-in-the-loop AI cheating ring detection system designed to detect and deter these cheating rings. We outline the underlying logic of this human-in-the-loop AI system, exploring its design principles tailored to achieve its objectives of detecting cheaters. Moreover, we illustrate the methodologies used to evaluate its performance and fairness, aiming to mitigate the unintended risks associated with the AI system. The design and development of the system adhere to Responsible AI (RAI) standards, ensuring that ethical considerations are integrated throughout the entire development process. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Accepted to the AI4Ed Workshop at AAAI 2024 as a short paper

arXiv:2403.03218 [pdf, other]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Authors: Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer , et al. (32 additional authors not shown)

Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in develo** biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are develo** evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing furthe… ▽ More The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in develo** biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are develo** evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai △ Less

Submitted 15 May, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: See the project page at https://wmdp.ai

arXiv:2402.06959 [pdf, other]

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

Authors: Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath

Abstract: The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. On this basis, this paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture. Second, we propos… ▽ More The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. On this basis, this paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture. Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework. Our experimental evaluation is performed on the Flickr8k and SpokenCOCO datasets. The results show that in the speech keyword extraction task, the CIF-based cascaded SpeechCLIP model outperforms the previous cascaded SpeechCLIP model using a fixed number of CLS tokens. Furthermore, through our hybrid architecture, cascaded task learning boosts the performance of the parallel branch in image-speech retrieval tasks. △ Less

Submitted 10 February, 2024; originally announced February 2024.

Comments: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop

arXiv:2402.05819 [pdf, other]

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

Authors: Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath

Abstract: Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks. However, these models predominantly centered on frame-level training objectives, which can fall short in spoken language understanding tasks that require semantic comprehension. Existing works often rely on additional speech-text data as intermediate targets, which is costly in the real-wo… ▽ More Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks. However, these models predominantly centered on frame-level training objectives, which can fall short in spoken language understanding tasks that require semantic comprehension. Existing works often rely on additional speech-text data as intermediate targets, which is costly in the real-world setting. To address this challenge, we propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process, where the targets are derived from a visually-ground speech model, notably eliminating the need for speech-text paired data. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: Accepted to ICASSP 2024 workshop on Self-supervision in Audio, Speech, and Beyond (SASB)

arXiv:2401.01461 [pdf, other]

Efficient Hybrid Zoom using Camera Fusion on Mobile Phones

Authors: Xiaotong Wu, Wei-Sheng Lai, YiChang Shih, Charles Herrmann, Michael Krainin, Deqing Sun, Chia-Kai Liang

Abstract: DSLR cameras can achieve multiple zoom levels via shifting lens distances or swap** lens types. However, these techniques are not possible on smartphone devices due to space constraints. Most smartphone manufacturers adopt a hybrid zoom system: commonly a Wide (W) camera at a low zoom level and a Telephoto (T) camera at a high zoom level. To simulate zoom levels between W and T, these systems cr… ▽ More DSLR cameras can achieve multiple zoom levels via shifting lens distances or swap** lens types. However, these techniques are not possible on smartphone devices due to space constraints. Most smartphone manufacturers adopt a hybrid zoom system: commonly a Wide (W) camera at a low zoom level and a Telephoto (T) camera at a high zoom level. To simulate zoom levels between W and T, these systems crop and digitally upsample images from W, leading to significant detail loss. In this paper, we propose an efficient system for hybrid zoom super-resolution on mobile devices, which captures a synchronous pair of W and T shots and leverages machine learning models to align and transfer details from T to W. We further develop an adaptive blending method that accounts for depth-of-field mismatches, scene occlusion, flow uncertainty, and alignment errors. To minimize the domain gap, we design a dual-phone camera rig to capture real-world inputs and ground-truths for supervised training. Our method generates a 12-megapixel image in 500ms on a mobile platform and compares favorably against state-of-the-art methods under extensive evaluation on real-world scenarios. △ Less

Submitted 2 January, 2024; originally announced January 2024.

Comments: Accepted to SIGGRAPH Asia 2023 (ACM TOG). Project website: https://www.wslai.net/publications/fusion_zoom

arXiv:2311.18695 [pdf, other]

Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction

Authors: Cheng Sun, Wei-En Tai, Yu-Lin Shih, Kuan-Wei Chen, Yong-**g Syu, Kent Selwyn The, Yu-Chiang Frank Wang, Hwann-Tzong Chen

Abstract: State-of-the-art single-view 360-degree room layout reconstruction methods formulate the problem as a high-level 1D (per-column) regression task. On the other hand, traditional low-level 2D layout segmentation is simpler to learn and can represent occluded regions, but it requires complex post-processing for the targeting layout polygon and sacrifices accuracy. We present Seg2Reg to render 1D layo… ▽ More State-of-the-art single-view 360-degree room layout reconstruction methods formulate the problem as a high-level 1D (per-column) regression task. On the other hand, traditional low-level 2D layout segmentation is simpler to learn and can represent occluded regions, but it requires complex post-processing for the targeting layout polygon and sacrifices accuracy. We present Seg2Reg to render 1D layout depth regression from the 2D segmentation map in a differentiable and occlusion-aware way, marrying the merits of both sides. Specifically, our model predicts floor-plan density for the input equirectangular 360-degree image. Formulating the 2D layout representation as a density field enables us to employ `flattened' volume rendering to form 1D layout depth regression. In addition, we propose a novel 3D war** augmentation on layout to improve generalization. Finally, we re-implement recent room layout reconstruction methods into our codebase for benchmarking and explore modern backbones and training techniques to serve as the strong baseline. Our model significantly outperforms previous arts. The code will be made available upon publication. △ Less

Submitted 30 November, 2023; originally announced November 2023.

arXiv:2311.05477 [pdf, other]

Using ResNet to Utilize 4-class T2-FLAIR Slice Classification Based on the Cholinergic Pathways Hyperintensities Scale for Pathological Aging

Authors: Wei-Chun Kevin Tsai, Yi-Chien Liu, Ming-Chun Yu, Chia-Ju Chou, Sui-Hing Yan, Yang-Teng Fan, Yan-Hsiang Huang, Yen-Ling Chiu, Yi-Fang Chuang, Ran-Zan Wang, Yao-Chia Shih

Abstract: The Cholinergic Pathways Hyperintensities Scale (CHIPS) is a visual rating scale used to assess the extent of cholinergic white matter hyperintensities in T2-FLAIR images, serving as an indicator of dementia severity. However, the manual selection of four specific slices for rating throughout the entire brain is a time-consuming process. Our goal was to develop a deep learning-based model capable… ▽ More The Cholinergic Pathways Hyperintensities Scale (CHIPS) is a visual rating scale used to assess the extent of cholinergic white matter hyperintensities in T2-FLAIR images, serving as an indicator of dementia severity. However, the manual selection of four specific slices for rating throughout the entire brain is a time-consuming process. Our goal was to develop a deep learning-based model capable of automatically identifying the four slices relevant to CHIPS. To achieve this, we trained a 4-class slice classification model (BSCA) using the ADNI T2-FLAIR dataset (N=150) with the assistance of ResNet. Subsequently, we tested the model's performance on a local dataset (N=30). The results demonstrated the efficacy of our model, with an accuracy of 99.82% and an F1-score of 99.83%. This achievement highlights the potential impact of BSCA as an automatic screening tool, streamlining the selection of four specific T2-FLAIR slices that encompass white matter landmarks along the cholinergic pathways. Clinicians can leverage this tool to assess the risk of clinical dementia development efficiently. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Comments: 8 pages, 2 figures, 2 tables

arXiv:2309.10787 [pdf, other]

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Authors: Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee

Abstract: Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual a… ▽ More Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task. We release our benchmark with evaluation code and a model submission platform to encourage further research in audio-visual learning. △ Less

Submitted 19 March, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

Comments: Accepted to ICASSP 2024; Evaluation Code: https://github.com/roger-tseng/av-superb Submission Platform: https://av.superbbenchmark.org

arXiv:2307.08230 [pdf, other]

Image-based Regularization for Action Smoothness in Autonomous Miniature Racing Car with Deep Reinforcement Learning

Authors: Hoang-Giang Cao, I Lee, Bo-Jiun Hsu, Zheng-Yi Lee, Yu-Wei Shih, Hsueh-Cheng Wang, I-Chen Wu

Abstract: Deep reinforcement learning has achieved significant results in low-level controlling tasks. However, for some applications like autonomous driving and drone flying, it is difficult to control behavior stably since the agent may suddenly change its actions which often lowers the controlling system's efficiency, induces excessive mechanical wear, and causes uncontrollable, dangerous behavior to the… ▽ More Deep reinforcement learning has achieved significant results in low-level controlling tasks. However, for some applications like autonomous driving and drone flying, it is difficult to control behavior stably since the agent may suddenly change its actions which often lowers the controlling system's efficiency, induces excessive mechanical wear, and causes uncontrollable, dangerous behavior to the vehicle. Recently, a method called conditioning for action policy smoothness (CAPS) was proposed to solve the problem of jerkiness in low-dimensional features for applications such as quadrotor drones. To cope with high-dimensional features, this paper proposes image-based regularization for action smoothness (I-RAS) for solving jerky control in autonomous miniature car racing. We also introduce a control based on impact ratio, an adaptive regularization weight to control the smoothness constraint, called IR control. In the experiment, an agent with I-RAS and IR control significantly improves the success rate from 59% to 95%. In the real-world-track experiment, the agent also outperforms other methods, namely reducing the average finish lap time, while also improving the completion rate even without real world training. This is also justified by an agent based on I-RAS winning the 2022 AWS DeepRacer Final Championship Cup. △ Less

Submitted 10 August, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)2023

arXiv:2307.05564 [pdf, other]

Augmenters at SemEval-2023 Task 1: Enhancing CLIP in Handling Compositionality and Ambiguity for Zero-Shot Visual WSD through Prompt Augmentation and Text-To-Image Diffusion

Authors: Jie S. Li, Yow-Ting Shiue, Yong-Siang Shih, Jonas Gei**

Abstract: This paper describes our zero-shot approaches for the Visual Word Sense Disambiguation (VWSD) Task in English. Our preliminary study shows that the simple approach of matching candidate images with the phrase using CLIP suffers from the many-to-many nature of image-text pairs. We find that the CLIP text encoder may have limited abilities in capturing the compositionality in natural language. Conve… ▽ More This paper describes our zero-shot approaches for the Visual Word Sense Disambiguation (VWSD) Task in English. Our preliminary study shows that the simple approach of matching candidate images with the phrase using CLIP suffers from the many-to-many nature of image-text pairs. We find that the CLIP text encoder may have limited abilities in capturing the compositionality in natural language. Conversely, the descriptive focus of the phrase varies from instance to instance. We address these issues in our two systems, Augment-CLIP and Stable Diffusion Sampling (SD Sampling). Augment-CLIP augments the text prompt by generating sentences that contain the context phrase with the help of large language models (LLMs). We further explore CLIP models in other languages, as the an ambiguous word may be translated into an unambiguous one in the other language. SD Sampling uses text-to-image Stable Diffusion to generate multiple images from the given phrase, increasing the likelihood that a subset of images match the one that paired with the text. △ Less

Submitted 9 July, 2023; originally announced July 2023.

Comments: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

arXiv:2211.01180 [pdf, other]

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

Authors: Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-yi Lee, David Harwath

Abstract: This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages. We identify key differenc… ▽ More This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages. We identify key differences in model behavior and performance between English and non-English settings, attributable to the English-only pre-training of CLIP and HuBERT, and investigate how fine-tuning the pre-trained models impacts these differences. Finally, we show that our models can be used for mono- and cross-lingual speech-text retrieval and cross-lingual speech-speech retrieval, despite never having seen any parallel speech-text or speech-speech data during training. △ Less

Submitted 10 April, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: Accepted to ICASSP 2023

arXiv:2210.00705 [pdf, other]

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

Authors: Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath

Abstract: Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions… ▽ More Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions with minimal fine-tuning. SpeechCLIP outperforms prior state-of-the-art on image-speech retrieval and performs zero-shot speech-text retrieval without direct supervision from transcriptions. Moreover, SpeechCLIP can directly retrieve semantically related keywords from speech. △ Less

Submitted 25 October, 2022; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: Accepted to IEEE SLT 2022

arXiv:2207.11617 [pdf, other]

Face Deblurring using Dual Camera Fusion on Mobile Phones

Authors: Wei-Sheng Lai, YiChang Shih, Lun-Cheng Chu, Xiaotong Wu, Sung-Fang Tsai, Michael Krainin, Deqing Sun, Chia-Kai Liang

Abstract: Motion blur of fast-moving subjects is a longstanding problem in photography and very common on mobile phones due to limited light collection efficiency, particularly in low-light conditions. While we have witnessed great progress in image deblurring in recent years, most methods require significant computational power and have limitations in processing high-resolution photos with severe local mot… ▽ More Motion blur of fast-moving subjects is a longstanding problem in photography and very common on mobile phones due to limited light collection efficiency, particularly in low-light conditions. While we have witnessed great progress in image deblurring in recent years, most methods require significant computational power and have limitations in processing high-resolution photos with severe local motions. To this end, we develop a novel face deblurring system based on the dual camera fusion technique for mobile phones. The system detects subject motion to dynamically enable a reference camera, e.g., ultrawide angle camera commonly available on recent premium phones, and captures an auxiliary photo with faster shutter settings. While the main shot is low noise but blurry, the reference shot is sharp but noisy. We learn ML models to align and fuse these two shots and output a clear photo without motion blur. Our algorithm runs efficiently on Google Pixel 6, which takes 463 ms overhead per shot. Our experiments demonstrate the advantage and robustness of our system against alternative single-image, multi-frame, face-specific, and video deblurring algorithms as well as commercial products. To the best of our knowledge, our work is the first mobile solution for face motion deblurring that works reliably and robustly over thousands of images in diverse motion and lighting conditions. △ Less

Submitted 23 July, 2022; originally announced July 2022.

Comments: Accepted to SIGGRAPH 2022 (ACM TOG). Project websit: https://www.wslai.net/publications/fusion_deblur/

arXiv:2207.05736 [pdf, other]

Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

Authors: Kai-En Lin, Lin Yen-Chen, Wei-Sheng Lai, Tsung-Yi Lin, Yi-Chang Shih, Ravi Ramamoorthi

Abstract: Although neural radiance fields (NeRF) have shown impressive advances for novel view synthesis, most methods typically require multiple input images of the same scene with accurate camera poses. In this work, we seek to substantially reduce the inputs to a single unposed image. Existing approaches condition on local image features to reconstruct a 3D object, but often render blurry predictions at… ▽ More Although neural radiance fields (NeRF) have shown impressive advances for novel view synthesis, most methods typically require multiple input images of the same scene with accurate camera poses. In this work, we seek to substantially reduce the inputs to a single unposed image. Existing approaches condition on local image features to reconstruct a 3D object, but often render blurry predictions at viewpoints that are far away from the source view. To address this issue, we propose to leverage both the global and local features to form an expressive 3D representation. The global features are learned from a vision transformer, while the local features are extracted from a 2D convolutional network. To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering. This novel 3D representation allows the network to reconstruct unseen regions without enforcing constraints like symmetry or canonical coordinate systems. Our method can render novel views from only a single input image and generalize across multiple object categories using a single model. Quantitative and qualitative evaluations demonstrate that the proposed method achieves state-of-the-art performance and renders richer details than existing approaches. △ Less

Submitted 13 October, 2022; v1 submitted 12 July, 2022; originally announced July 2022.

Comments: WACV 2023 Project website: https://cseweb.ucsd.edu/~viscomp/projects/VisionNeRF/

arXiv:2203.04674 [pdf]

Deep learning-based reconstruction of highly accelerated 3D MRI

Authors: Sangtae Ahn, Uri Wollner, Graeme McKinnon, Isabelle Heukensfeldt Jansen, Rafi Brada, Dan Rettmann, Ty A. Cashen, John Huston, J. Kevin DeMarco, Robert Y. Shih, Joshua D. Trzasko, Christopher J. Hardy, Thomas K. F. Foo

Abstract: Purpose: To accelerate brain 3D MRI scans by using a deep learning method for reconstructing images from highly-undersampled multi-coil k-space data Methods: DL-Speed, an unrolled optimization architecture with dense skip-layer connections, was trained on 3D T1-weighted brain scan data to reconstruct complex-valued images from highly-undersampled k-space data. The trained model was evaluated on… ▽ More Purpose: To accelerate brain 3D MRI scans by using a deep learning method for reconstructing images from highly-undersampled multi-coil k-space data Methods: DL-Speed, an unrolled optimization architecture with dense skip-layer connections, was trained on 3D T1-weighted brain scan data to reconstruct complex-valued images from highly-undersampled k-space data. The trained model was evaluated on 3D MPRAGE brain scan data retrospectively-undersampled with a 10-fold acceleration, compared to a conventional parallel imaging method with a 2-fold acceleration. Scores of SNR, artifacts, gray/white matter contrast, resolution/sharpness, deep gray-matter, cerebellar vermis, anterior commissure, and overall quality, on a 5-point Likert scale, were assessed by experienced radiologists. In addition, the trained model was tested on retrospectively-undersampled 3D T1-weighted LAVA (Liver Acquisition with Volume Acceleration) abdominal scan data, and prospectively-undersampled 3D MPRAGE and LAVA scans in three healthy volunteers and one, respectively. Results: The qualitative scores for DL-Speed with a 10-fold acceleration were higher than or equal to those for the parallel imaging with 2-fold acceleration. DL-Speed outperformed a compressed sensing method in quantitative metrics on retrospectively-undersampled LAVA data. DL-Speed was demonstrated to perform reasonably well on prospectively-undersampled scan data, realizing a 2-5 times reduction in scan time. Conclusion: DL-Speed was shown to accelerate 3D MPRAGE and LAVA with up to a net 10-fold acceleration, achieving 2-5 times faster scans compared to conventional parallel imaging and acceleration, while maintaining diagnostic image quality and real-time reconstruction. The brain scan data-trained DL-Speed also performed well when reconstructing abdominal LAVA scan data, demonstrating versatility of the network. △ Less

Submitted 9 March, 2022; originally announced March 2022.

Comments: 8 pages, 8 figures

ACM Class: I.2.6; J.2

arXiv:2203.01581 [pdf, other]

A shallow physics-informed neural network for solving partial differential equations on surfaces

Authors: Wei-Fan Hu, Yi-Jun Shih, Te-Sheng Lin, Ming-Chih Lai

Abstract: In this paper, we introduce a shallow (one-hidden-layer) physics-informed neural network for solving partial differential equations on static and evolving surfaces. For the static surface case, with the aid of level set function, the surface normal and mean curvature used in the surface differential expressions can be computed easily. So instead of imposing the normal extension constraints used in… ▽ More In this paper, we introduce a shallow (one-hidden-layer) physics-informed neural network for solving partial differential equations on static and evolving surfaces. For the static surface case, with the aid of level set function, the surface normal and mean curvature used in the surface differential expressions can be computed easily. So instead of imposing the normal extension constraints used in literature, we write the surface differential operators in the form of traditional Cartesian differential operators and use them in the loss function directly. We perform a series of performance study for the present methodology by solving Laplace-Beltrami equation and surface diffusion equation on complex static surfaces. With just a moderate number of neurons used in the hidden layer, we are able to attain satisfactory prediction results. Then we extend the present methodology to solve the advection-diffusion equation on an evolving surface with given velocity. To track the surface, we additionally introduce a prescribed hidden layer to enforce the topological structure of the surface and use the network to learn the homeomorphism between the surface and the prescribed topology. The proposed network structure is designed to track the surface and solve the equation simultaneously. Again, the numerical results show comparable accuracy as the static cases. As an application, we simulate the surfactant transport on the droplet surface under shear flow and obtain some physically plausible results. △ Less

Submitted 20 January, 2023; v1 submitted 3 March, 2022; originally announced March 2022.

arXiv:2201.05706 [pdf, other]

Perspective Transformation Layer

Authors: Nishan Khatri, Agnibh Dasgupta, Yucong Shen, Xin Zhong, Frank Y. Shih

Abstract: Incorporating geometric transformations that reflect the relative position changes between an observer and an object into computer vision and deep learning models has attracted much attention in recent years. However, the existing proposals mainly focus on the affine transformation that is insufficient to reflect such geometric position changes. Furthermore, current solutions often apply a neural… ▽ More Incorporating geometric transformations that reflect the relative position changes between an observer and an object into computer vision and deep learning models has attracted much attention in recent years. However, the existing proposals mainly focus on the affine transformation that is insufficient to reflect such geometric position changes. Furthermore, current solutions often apply a neural network module to learn a single transformation matrix, which not only ignores the importance of multi-view analysis but also includes extra training parameters from the module apart from the transformation matrix parameters that increase the model complexity. In this paper, a perspective transformation layer is proposed in the context of deep learning. The proposed layer can learn homography, therefore reflecting the geometric positions between observers and objects. In addition, by directly training its transformation matrices, a single proposed layer can learn an adjustable number of multiple viewpoints without considering module parameters. The experiments and evaluations confirm the superiority of the proposed layer. △ Less

Submitted 30 October, 2022; v1 submitted 14 January, 2022; originally announced January 2022.

Comments: This paper has been accepted for publication by the 2022 International Conference on Computational Science & Computational Intelligence (CSCI'22), Research Track on Signal & Image Processing, Computer Vision & Pattern Recognition

arXiv:2111.09950 [pdf, other]

doi 10.1109/TIP.2021.3131047

Correcting Face Distortion in Wide-Angle Videos

Authors: Wei-Sheng Lai, YiChang Shih, Chia-Kai Liang, Ming-Hsuan Yang

Abstract: Video blogs and selfies are popular social media formats, which are often captured by wide-angle cameras to show human subjects and expanded background. Unfortunately, due to perspective projection, faces near corners and edges exhibit apparent distortions that stretch and squish the facial features, resulting in poor video quality. In this work, we present a video war** algorithm to correct the… ▽ More Video blogs and selfies are popular social media formats, which are often captured by wide-angle cameras to show human subjects and expanded background. Unfortunately, due to perspective projection, faces near corners and edges exhibit apparent distortions that stretch and squish the facial features, resulting in poor video quality. In this work, we present a video war** algorithm to correct these distortions. Our key idea is to apply stereographic projection locally on the facial regions. We formulate a mesh warp problem using spatial-temporal energy minimization and minimize background deformation using a line-preservation term to maintain the straight edges in the background. To address temporal coherency, we constrain the temporal smoothness on the war** meshes and facial trajectories through the latent variables. For performance evaluation, we develop a wide-angle video dataset with a wide range of focal lengths. The user study shows that 83.9% of users prefer our algorithm over other alternatives based on perspective projection. △ Less

Submitted 18 November, 2021; originally announced November 2021.

Comments: Project website: https://www.wslai.net/publications/video_face_correction/

arXiv:2111.04093 [pdf, other]

Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer

Authors: Yi-Jen Shih, Shih-Lun Wu, Frank Zalkow, Meinard Müller, Yi-Hsuan Yang

Abstract: Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with a user-specified sequence, a popular approach is to take that conditioning sequence as a priming sequence and ask a Transformer decoder to generate a continuation. However, this prompt-based conditioning cannot guarantee that the conditioning s… ▽ More Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with a user-specified sequence, a popular approach is to take that conditioning sequence as a priming sequence and ask a Transformer decoder to generate a continuation. However, this prompt-based conditioning cannot guarantee that the conditioning sequence would develop or even simply repeat itself in the generated continuation. In this paper, we propose an alternative conditioning approach, called theme-based conditioning, that explicitly trains the Transformer to treat the conditioning sequence as a thematic material that has to manifest itself multiple times in its generation result. This is achieved with two main technical contributions. First, we propose a deep learning-based approach that uses contrastive representation learning and clustering to automatically retrieve thematic materials from music pieces in the training data. Second, we propose a novel gated parallel attention module to be used in a sequence-to-sequence (seq2seq) encoder/decoder architecture to more effectively account for a given conditioning thematic material in the generation process of the Transformer decoder. We report on objective and subjective evaluations of variants of the proposed Theme Transformer and the conventional prompt-based baseline, showing that our best model can generate, to some extent, polyphonic pop piano music with repetition and plausible variations of a given condition. △ Less

Submitted 21 March, 2022; v1 submitted 7 November, 2021; originally announced November 2021.

Comments: to be published at IEEE Transactions on Multimedia

arXiv:2107.02314 [pdf, other]

The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification

Authors: Ujjwal Baid, Satyam Ghodasara, Suyash Mohan, Michel Bilello, Evan Calabrese, Errol Colak, Keyvan Farahani, Jayashree Kalpathy-Cramer, Felipe C. Kitamura, Sarthak Pati, Luciano M. Prevedello, Jeffrey D. Rudie, Chiharu Sako, Russell T. Shinohara, Timothy Bergquist, Rong Chai, James Eddy, Julia Elliott, Walter Reade, Thomas Schaffter, Thomas Yu, Jiaxin Zheng, Ahmed W. Moawad, Luiz Otavio Coelho, Olivia McDonnell , et al. (78 additional authors not shown)

Abstract: The BraTS 2021 challenge celebrates its 10th anniversary and is jointly organized by the Radiological Society of North America (RSNA), the American Society of Neuroradiology (ASNR), and the Medical Image Computing and Computer Assisted Interventions (MICCAI) society. Since its inception, BraTS has been focusing on being a common benchmarking venue for brain glioma segmentation algorithms, with wel… ▽ More The BraTS 2021 challenge celebrates its 10th anniversary and is jointly organized by the Radiological Society of North America (RSNA), the American Society of Neuroradiology (ASNR), and the Medical Image Computing and Computer Assisted Interventions (MICCAI) society. Since its inception, BraTS has been focusing on being a common benchmarking venue for brain glioma segmentation algorithms, with well-curated multi-institutional multi-parametric magnetic resonance imaging (mpMRI) data. Gliomas are the most common primary malignancies of the central nervous system, with varying degrees of aggressiveness and prognosis. The RSNA-ASNR-MICCAI BraTS 2021 challenge targets the evaluation of computational algorithms assessing the same tumor compartmentalization, as well as the underlying tumor's molecular characterization, in pre-operative baseline mpMRI data from 2,040 patients. Specifically, the two tasks that BraTS 2021 focuses on are: a) the segmentation of the histologically distinct brain tumor sub-regions, and b) the classification of the tumor's O[6]-methylguanine-DNA methyltransferase (MGMT) promoter methylation status. The performance evaluation of all participating algorithms in BraTS 2021 will be conducted through the Sage Bionetworks Synapse platform (Task 1) and Kaggle (Task 2), concluding in distributing to the top ranked participants monetary awards of $60,000 collectively. △ Less

Submitted 12 September, 2021; v1 submitted 5 July, 2021; originally announced July 2021.

Comments: 19 pages, 2 figures, 1 table

arXiv:2107.00820 [pdf, other]

Robust multigrid techniques for augmented Lagrangian preconditioning of incompressible Stokes equations with extreme viscosity variations

Authors: Yu-hsuan Shih, Georg Stadler, Florian Wechsung

Abstract: We present augmented Lagrangian Schur complement preconditioners and robust multigrid methods for incompressible Stokes problems with extreme viscosity variations. Such Stokes systems arise, for instance, upon linearization of nonlinear viscous flow problems, and they can have severely inhomogeneous and anisotropic coefficients. Using an augmented Lagrangian formulation for the incompressibility c… ▽ More We present augmented Lagrangian Schur complement preconditioners and robust multigrid methods for incompressible Stokes problems with extreme viscosity variations. Such Stokes systems arise, for instance, upon linearization of nonlinear viscous flow problems, and they can have severely inhomogeneous and anisotropic coefficients. Using an augmented Lagrangian formulation for the incompressibility constraint makes the Schur complement easier to approximate, but results in a nearly singular (1,1)-block in the Stokes system. We present eigenvalue estimates for the quality of the Schur complement approximation. To cope with the near-singularity of the (1,1)-block, we extend a multigrid scheme with a discretization-dependent smoother and transfer operators from triangular/tetrahedral to the quadrilateral/hexahedral finite element discretizations $[\mathbb{Q}_k]^d\times \mathbb{P}_{k-1}^{\text{disc}}$, $k\geq 2$, $d=2,3$. Using numerical examples with scalar and with anisotropic fourth-order tensor viscosity arising from linearization of a viscoplastic constitutive relation, we confirm the robustness of the multigrid scheme and the overall efficiency of the solver. We present scalability results using up to 28,672 parallel tasks for problems with up to 1.6 billion unknowns and a viscosity contrast up to ten orders of magnitude. △ Less

Submitted 2 November, 2021; v1 submitted 2 July, 2021; originally announced July 2021.

Comments: 27 pages, 6 figures

MSC Class: 65F08; 65F10; 65N55; 65Y05; 76D07

arXiv:2102.08463 [pdf, other]

cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs

Authors: Yu-hsuan Shih, Garrett Wright, Joakim Andén, Johannes Blaschke, Alex H. Barnett

Abstract: Nonuniform fast Fourier transforms dominate the computational cost in many applications including image reconstruction and signal processing. We thus present a general-purpose GPU-based CUDA library for type 1 (nonuniform to uniform) and type 2 (uniform to nonuniform) transforms in dimensions 2 and 3, in single or double precision. It achieves high performance for a given user-requested accuracy,… ▽ More Nonuniform fast Fourier transforms dominate the computational cost in many applications including image reconstruction and signal processing. We thus present a general-purpose GPU-based CUDA library for type 1 (nonuniform to uniform) and type 2 (uniform to nonuniform) transforms in dimensions 2 and 3, in single or double precision. It achieves high performance for a given user-requested accuracy, regardless of the distribution of nonuniform points, via cache-aware point reordering, and load-balanced blocked spreading in shared memory. At low accuracies, this gives on-GPU throughputs around $10^9$ nonuniform points per second, and (even including host-device transfer) is typically 4-10$\times$ faster than the latest parallel CPU code FINUFFT (at 28 threads). It is competitive with two established GPU codes, being up to 90$\times$ faster at high accuracy and/or type 1 clustered point distributions. Finally we demonstrate a 5-12$\times$ speedup versus CPU in an X-ray diffraction 3D iterative reconstruction task at $10^{-12}$ accuracy, observing excellent multi-GPU weak scaling up to one rank per GPU. △ Less

Submitted 25 March, 2021; v1 submitted 16 February, 2021; originally announced February 2021.

Comments: 10 pages, 9 figures

arXiv:2012.05903 [pdf, other]

Portrait Neural Radiance Fields from a Single Image

Authors: Chen Gao, Yichang Shih, Wei-Sheng Lai, Chia-Kai Liang, Jia-Bin Huang

Abstract: We present a method for estimating Neural Radiance Fields (NeRF) from a single headshot portrait. While NeRF has demonstrated high-quality view synthesis, it requires multiple images of static scenes and thus impractical for casual captures and moving subjects. In this work, we propose to pretrain the weights of a multilayer perceptron (MLP), which implicitly models the volumetric density and colo… ▽ More We present a method for estimating Neural Radiance Fields (NeRF) from a single headshot portrait. While NeRF has demonstrated high-quality view synthesis, it requires multiple images of static scenes and thus impractical for casual captures and moving subjects. In this work, we propose to pretrain the weights of a multilayer perceptron (MLP), which implicitly models the volumetric density and colors, with a meta-learning framework using a light stage portrait dataset. To improve the generalization to unseen faces, we train the MLP in the canonical coordinate space approximated by 3D face morphable models. We quantitatively evaluate the method using controlled captures and demonstrate the generalization to real portrait images, showing favorable results against state-of-the-arts. △ Less

Submitted 16 April, 2021; v1 submitted 10 December, 2020; originally announced December 2020.

Comments: Project webpage: https://portrait-nerf.github.io/

arXiv:2009.08083 [pdf, other]

doi 10.1145/3394171.3413624

Crossing You in Style: Cross-modal Style Transfer from Music to Visual Arts

Authors: Cheng-Che Lee, Wan-Yi Lin, Yen-Ting Shih, Pei-Yi Patricia Kuo, Li Su

Abstract: Music-to-visual style transfer is a challenging yet important cross-modal learning problem in the practice of creativity. Its major difference from the traditional image style transfer problem is that the style information is provided by music rather than images. Assuming that musical features can be properly mapped to visual contents through semantic links between the two domains, we solve the mu… ▽ More Music-to-visual style transfer is a challenging yet important cross-modal learning problem in the practice of creativity. Its major difference from the traditional image style transfer problem is that the style information is provided by music rather than images. Assuming that musical features can be properly mapped to visual contents through semantic links between the two domains, we solve the music-to-visual style transfer problem in two steps: music visualization and style transfer. The music visualization network utilizes an encoder-generator architecture with a conditional generative adversarial network to generate image-based music representations from music data. This network is integrated with an image style transfer method to accomplish the style transfer process. Experiments are conducted on WikiArt-IMSLP, a newly compiled dataset including Western music recordings and paintings listed by decades. By utilizing such a label to learn the semantic connection between paintings and music, we demonstrate that the proposed framework can generate diverse image style representations from a music piece, and these representations can unveil certain art forms of the same era. Subjective testing results also emphasize the role of the era label in improving the perceptual quality on the compatibility between music and visual content. △ Less

Submitted 17 September, 2020; originally announced September 2020.

arXiv:2007.02460 [pdf, other]

An Automated and Robust Image Watermarking Scheme Based on Deep Neural Networks

Authors: Xin Zhong, Pei-Chi Huang, Spyridon Mastorakis, Frank Y. Shih

Abstract: Digital image watermarking is the process of embedding and extracting a watermark covertly on a cover-image. To dynamically adapt image watermarking algorithms, deep learning-based image watermarking schemes have attracted increased attention during recent years. However, existing deep learning-based watermarking methods neither fully apply the fitting ability to learn and automate the embedding a… ▽ More Digital image watermarking is the process of embedding and extracting a watermark covertly on a cover-image. To dynamically adapt image watermarking algorithms, deep learning-based image watermarking schemes have attracted increased attention during recent years. However, existing deep learning-based watermarking methods neither fully apply the fitting ability to learn and automate the embedding and extracting algorithms, nor achieve the properties of robustness and blindness simultaneously. In this paper, a robust and blind image watermarking scheme based on deep learning neural networks is proposed. To minimize the requirement of domain knowledge, the fitting ability of deep neural networks is exploited to learn and generalize an automated image watermarking algorithm. A deep learning architecture is specially designed for image watermarking tasks, which will be trained in an unsupervised manner to avoid human intervention and annotation. To facilitate flexible applications, the robustness of the proposed scheme is achieved without requiring any prior knowledge or adversarial examples of possible attacks. A challenging case of watermark extraction from phone camera-captured images demonstrates the robustness and practicality of the proposal. The experiments, evaluation, and application cases confirm the superiority of the proposed scheme. △ Less

Submitted 5 July, 2020; originally announced July 2020.

Comments: This paper has been accepted for publication by the IEEE Transactions on Multimedia. The copyright is with the IEEE. DOI: 10.1109/TMM.2020.3006415

arXiv:1910.10479 [pdf, ps, other]

XL-Editor: Post-editing Sentences with XLNet

Authors: Yong-Siang Shih, Wei-Cheng Chang, Yiming Yang

Abstract: While neural sequence generation models achieve initial success for many NLP applications, the canonical decoding procedure with left-to-right generation order (i.e., autoregressive) in one-pass can not reflect the true nature of human revising a sentence to obtain a refined result. In this work, we propose XL-Editor, a novel training framework that enables state-of-the-art generalized autoregress… ▽ More While neural sequence generation models achieve initial success for many NLP applications, the canonical decoding procedure with left-to-right generation order (i.e., autoregressive) in one-pass can not reflect the true nature of human revising a sentence to obtain a refined result. In this work, we propose XL-Editor, a novel training framework that enables state-of-the-art generalized autoregressive pretraining methods, XLNet specifically, to revise a given sentence by the variable-length insertion probability. Concretely, XL-Editor can (1) estimate the probability of inserting a variable-length sequence into a specific position of a given sentence; (2) execute post-editing operations such as insertion, deletion, and replacement based on the estimated variable-length insertion probability; (3) complement existing sequence-to-sequence models to refine the generated sequences. Empirically, we first demonstrate better post-editing capabilities of XL-Editor over XLNet on the text insertion and deletion tasks, which validates the effectiveness of our proposed framework. Furthermore, we extend XL-Editor to the unpaired text style transfer task, where transferring the target style onto a given sentence can be naturally viewed as post-editing the sentence into the target style. XL-Editor achieves significant improvement in style transfer accuracy and also maintains coherent semantic of the original sentence, showing the broad applicability of our method. △ Less

Submitted 19 October, 2019; originally announced October 2019.

Comments: Under review

arXiv:1910.07800 [pdf, other]

Organ At Risk Segmentation with Multiple Modality

Authors: Kuan-Lun Tseng, Winston Hsu, Chun-ting Wu, Ya-Fang Shih, Fan-Yun Sun

Abstract: With the development of image segmentation in computer vision, biomedical image segmentation have achieved remarkable progress on brain tumor segmentation and Organ At Risk (OAR) segmentation. However, most of the research only uses single modality such as Computed Tomography (CT) scans while in real world scenario doctors often use multiple modalities to get more accurate result. To better levera… ▽ More With the development of image segmentation in computer vision, biomedical image segmentation have achieved remarkable progress on brain tumor segmentation and Organ At Risk (OAR) segmentation. However, most of the research only uses single modality such as Computed Tomography (CT) scans while in real world scenario doctors often use multiple modalities to get more accurate result. To better leverage different modalities, we have collected a large dataset consists of 136 cases with CT and MR images which diagnosed with nasopharyngeal cancer. In this paper, we propose to use Generative Adversarial Network to perform CT to MR transformation to synthesize MR images instead of aligning two modalities. The synthesized MR can be jointly trained with CT to achieve better performance. In addition, we use instance segmentation model to extend the OAR segmentation task to segment both organs and tumor region. The collected dataset will be made public soon. △ Less

Submitted 17 October, 2019; originally announced October 2019.

arXiv:1909.03380 [pdf]

doi 10.1142/S0218001421540057

Automatic Image Pixel Clustering based on Mussels Wandering Optimiz

Authors: Xin Zhong, Frank Y. Shih, Xiwang Guo

Abstract: Image segmentation as a clustering problem is to identify pixel groups on an image without any preliminary labels available. It remains a challenge in machine vision because of the variations in size and shape of image segments. Furthermore, determining the segment number in an image is NP-hard without prior knowledge of the image content. This paper presents an automatic color image pixel cluster… ▽ More Image segmentation as a clustering problem is to identify pixel groups on an image without any preliminary labels available. It remains a challenge in machine vision because of the variations in size and shape of image segments. Furthermore, determining the segment number in an image is NP-hard without prior knowledge of the image content. This paper presents an automatic color image pixel clustering scheme based on mussels wandering optimization. By applying an activation variable to determine the number of clusters along with the cluster centers optimization, an image is segmented with minimal prior knowledge and human intervention. By revising the within- and between-class sum of squares ratio for random natural image contents, we provide a novel fitness function for image pixel clustering tasks. Comprehensive empirical studies of the proposed scheme against other state-of-the-art competitors on synthetic data and the ASD dataset have demonstrated the promising performance of the proposed scheme. △ Less

Submitted 7 September, 2019; originally announced September 2019.

arXiv:1909.01532 [pdf]

Deep Morphological Neural Networks

Authors: Yucong Shen, Xin Zhong, Frank Y. Shih

Abstract: Mathematical morphology is a theory and technique to collect features like geometric and topological structures in digital images. Given a target image, determining suitable morphological operations and structuring elements is a cumbersome and time-consuming task. In this paper, a morphological neural network is proposed to address this problem. Serving as a nonlinear feature extracting layer in d… ▽ More Mathematical morphology is a theory and technique to collect features like geometric and topological structures in digital images. Given a target image, determining suitable morphological operations and structuring elements is a cumbersome and time-consuming task. In this paper, a morphological neural network is proposed to address this problem. Serving as a nonlinear feature extracting layer in deep learning frameworks, the efficiency of the proposed morphological layer is confirmed analytically and empirically. With a known target, a single-filter morphological layer learns the structuring element correctly, and an adaptive layer can automatically select appropriate morphological operations. For practical applications, the proposed morphological neural networks are tested on several classification datasets related to shape or geometric image features, and the experimental results have confirmed the high computational efficiency and high accuracy. △ Less

Submitted 3 September, 2019; originally announced September 2019.

arXiv:1908.11331 [pdf]

doi 10.1109/TMM.2020.3006415

A Robust Image Watermarking System Based on Deep Neural Networks

Authors: Xin Zhong, Frank Y. Shih

Abstract: Digital image watermarking is the process of embedding and extracting watermark covertly on a carrier image. Incorporating deep learning networks with image watermarking has attracted increasing attention during recent years. However, existing deep learning-based watermarking systems cannot achieve robustness, blindness, and automated embedding and extraction simultaneously. In this paper, a fully… ▽ More Digital image watermarking is the process of embedding and extracting watermark covertly on a carrier image. Incorporating deep learning networks with image watermarking has attracted increasing attention during recent years. However, existing deep learning-based watermarking systems cannot achieve robustness, blindness, and automated embedding and extraction simultaneously. In this paper, a fully automated image watermarking system based on deep neural networks is proposed to generalize the image watermarking processes. An unsupervised deep learning structure and a novel loss computation are proposed to achieve high capacity and high robustness without any prior knowledge of possible attacks. Furthermore, a challenging application of watermark extraction from camera-captured images is provided to validate the practicality as well as the robustness of the proposed system. Experimental results show the superiority performance of the proposed system as comparing against several currently available techniques. △ Less

Submitted 29 August, 2019; originally announced August 2019.

arXiv:1712.01262 [pdf, other]

Compatibility Family Learning for Item Recommendation and Generation

Authors: Yong-Siang Shih, Kai-Yueh Chang, Hsuan-Tien Lin, Min Sun

Abstract: Compatibility between items, such as clothes and shoes, is a major factor among customer's purchasing decisions. However, learning "compatibility" is challenging due to (1) broader notions of compatibility than those of similarity, (2) the asymmetric nature of compatibility, and (3) only a small set of compatible and incompatible items are observed. We propose an end-to-end trainable system to emb… ▽ More Compatibility between items, such as clothes and shoes, is a major factor among customer's purchasing decisions. However, learning "compatibility" is challenging due to (1) broader notions of compatibility than those of similarity, (2) the asymmetric nature of compatibility, and (3) only a small set of compatible and incompatible items are observed. We propose an end-to-end trainable system to embed each item into a latent vector and project a query item into K compatible prototypes in the same space. These prototypes reflect the broad notions of compatibility. We refer to both the embedding and prototypes as "Compatibility Family". In our learned space, we introduce a novel Projected Compatibility Distance (PCD) function which is differentiable and ensures diversity by aiming for at least one prototype to be close to a compatible item, whereas none of the prototypes are close to an incompatible item. We evaluate our system on a toy dataset, two Amazon product datasets, and Polyvore outfit dataset. Our method consistently achieves state-of-the-art performance. Finally, we show that we can visualize the candidate compatible prototypes using a Metric-regularized Conditional Generative Adversarial Network (MrCGAN), where the input is a projected prototype and the output is a generated image of a compatible item. We ask human evaluators to judge the relative compatibility between our generated images and images generated by CGANs conditioned directly on query items. Our generated images are significantly preferred, with roughly twice the number of votes as others. △ Less

Submitted 1 December, 2017; originally announced December 2017.

Comments: 9 pages, accepted to AAAI 2018

arXiv:0802.3071 [pdf]

Simulation of valveless micropump and mode analysis

Authors: W. P. Lan, J. S. Chang, K. C. Wu, Y. C. Shih

Abstract: In this work, a 3-D simulation is performed to study for the solid-fluid coupling effect driven by piezoelectric materials and utilizes asymmetric obstacles to control the flow direction. The result of simulation is also verified. For a micropump, it is crucial to find the optimal working frequency which produce maximum net flow rate. The PZT plate vibrates under the first mode, which is symmetr… ▽ More In this work, a 3-D simulation is performed to study for the solid-fluid coupling effect driven by piezoelectric materials and utilizes asymmetric obstacles to control the flow direction. The result of simulation is also verified. For a micropump, it is crucial to find the optimal working frequency which produce maximum net flow rate. The PZT plate vibrates under the first mode, which is symmetric. Adjusting the working frequency, the maximum flow rate can be obtained. For the micrpump we studied, the optimal working frequency is 3.2K Hz. At higher working frequency, say 20K Hz, the fluid-solid membrane may come out a intermediate mode, which is different from the first mode and the second mode. It is observed that the center of the mode drifts. Meanwhile, the result shows that a phase shift lagging when the excitation force exists in the vibration response. Finally, at even higher working frequency, say 30K Hz, a second vibration mode is observed. △ Less

Submitted 21 February, 2008; originally announced February 2008.

Comments: Submitted on behalf of EDA Publishing Association (http://irevues.inist.fr/EDA-Publishing)

Journal ref: Dans Symposium on Design, Test, Integration and Packaging of MEMS/MOEMS - DTIP 2007, Stresa, Lago Maggiore : Italie (2007)

Showing 1–37 of 37 results for author: Shih, Y