Search | arXiv e-print repository

On the Role of Visual Grounding in VQA

Abstract: Visual Grounding (VG) in VQA refers to a model's proclivity to infer answers based on question-relevant image regions. Conceptually, VG identifies as an axiomatic requirement of the VQA task. In practice, however, DNN-based VQA models are notorious for bypassing VG by way of shortcut (SC) learning without suffering obvious performance losses in standard benchmarks. To uncover the impact of SC lear… ▽ More Visual Grounding (VG) in VQA refers to a model's proclivity to infer answers based on question-relevant image regions. Conceptually, VG identifies as an axiomatic requirement of the VQA task. In practice, however, DNN-based VQA models are notorious for bypassing VG by way of shortcut (SC) learning without suffering obvious performance losses in standard benchmarks. To uncover the impact of SC learning, Out-of-Distribution (OOD) tests have been proposed that expose a lack of VG with low accuracy. These tests have since been at the center of VG research and served as basis for various investigations into VG's impact on accuracy. However, the role of VG in VQA still remains not fully understood and has not yet been properly formalized. In this work, we seek to clarify VG's role in VQA by formalizing it on a conceptual level. We propose a novel theoretical framework called "Visually Grounded Reasoning" (VGR) that uses the concepts of VG and Reasoning to describe VQA inference in ideal OOD testing. By consolidating fundamental insights into VG's role in VQA, VGR helps to reveal rampant VG-related SC exploitation in OOD testing, which explains why the relationship between VG and OOD accuracy has been difficult to define. Finally, we propose an approach to create OOD tests that properly emphasize a requirement for VG, and show how to improve performance on them. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.15119 [pdf, other]

Speech Emotion Recognition under Resource Constraints with Data Distillation

Authors: Yi Chang, Zhao Ren, Zhonghao Zhao, Thanh Tam Nguyen, Kun Qian, Tanja Schultz, Björn W. Schuller

Abstract: Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment… ▽ More Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment of SER models. To address these challenges, we propose a data distillation framework to facilitate efficient development of SER models in IoT applications using a synthesised, smaller, and distilled dataset. Our experiments demonstrate that the distilled dataset can be effectively utilised to train SER models with fixed initialisation, achieving performances comparable to those developed using the original full emotional speech dataset. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2405.08021 [pdf, other]

Diff-ETS: Learning a Diffusion Probabilistic Model for Electromyography-to-Speech Conversion

Authors: Zhao Ren, Kevin Scheck, Qinhan Hou, Stefano van Gogh, Michael Wand, Tanja Schultz

Abstract: Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available dat… ▽ More Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available data and noisy signals, the synthesised speech often exhibits a low level of naturalness. In this work, we propose Diff-ETS, an ETS model which uses a score-based diffusion probabilistic model to enhance the naturalness of synthesised speech. The diffusion model is applied to improve the quality of the acoustic features predicted by an EMG encoder. In our experiments, we evaluated fine-tuning the diffusion model on predictions of a pre-trained EMG encoder, and training both models in an end-to-end fashion. We compared Diff-ETS with a baseline ETS model without diffusion using objective metrics and a listening test. The results indicated the proposed Diff-ETS significantly improved speech naturalness over the baseline. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: Accepted by EMBC 2024

arXiv:2402.12298 [pdf, other]

Is Open-Source There Yet? A Comparative Study on Commercial and Open-Source LLMs in Their Ability to Label Chest X-Ray Reports

Authors: Felix J. Dorfner, Liv Jürgensen, Leonhard Donle, Fares Al Mohamad, Tobias R. Bodenmann, Mason C. Cleveland, Felix Busch, Lisa C. Adams, James Sato, Thomas Schultz, Albert E. Kim, Jameson Merkow, Keno K. Bressem, Christopher P. Bridge

Abstract: Introduction: With the rapid advances in large language models (LLMs), there have been numerous new open source as well as commercial models. While recent publications have explored GPT-4 in its application to extracting information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to different leading open-source models. Materials and Methods: Two different… ▽ More Introduction: With the rapid advances in large language models (LLMs), there have been numerous new open source as well as commercial models. While recent publications have explored GPT-4 in its application to extracting information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to different leading open-source models. Materials and Methods: Two different and independent datasets were used. The first dataset consists of 540 chest x-ray reports that were created at the Massachusetts General Hospital between July 2019 and July 2021. The second dataset consists of 500 chest x-ray reports from the ImaGenome dataset. We then compared the commercial models GPT-3.5 Turbo and GPT-4 from OpenAI to the open-source models Mistral-7B, Mixtral-8x7B, Llama2-13B, Llama2-70B, QWEN1.5-72B and CheXbert and CheXpert-labeler in their ability to accurately label the presence of multiple findings in x-ray text reports using different prompting techniques. Results: On the ImaGenome dataset, the best performing open-source model was Llama2-70B with micro F1-scores of 0.972 and 0.970 for zero- and few-shot prompts, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.984, respectively. On the institutional dataset, the best performing open-source model was QWEN1.5-72B with micro F1-scores of 0.952 and 0.965 for zero- and few-shot prompting, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.973, respectively. Conclusion: In this paper, we show that while GPT-4 is superior to open-source models in zero-shot report labeling, the implementation of few-shot prompting can bring open-source models on par with GPT-4. This shows that open-source models could be a performant and privacy preserving alternative to GPT-4 for the task of radiology report classification. △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2402.01227 [pdf, other]

STAA-Net: A Sparse and Transferable Adversarial Attack for Speech Emotion Recognition

Authors: Yi Chang, Zhao Ren, Zixing Zhang, Xin **g, Kun Qian, Xi Shao, Bin Hu, Tanja Schultz, Björn W. Schuller

Abstract: Speech contains rich information on the emotions of humans, and Speech Emotion Recognition (SER) has been an important topic in the area of human-computer interaction. The robustness of SER models is crucial, particularly in privacy-sensitive and reliability-demanding domains like private healthcare. Recently, the vulnerability of deep neural networks in the audio domain to adversarial attacks has… ▽ More Speech contains rich information on the emotions of humans, and Speech Emotion Recognition (SER) has been an important topic in the area of human-computer interaction. The robustness of SER models is crucial, particularly in privacy-sensitive and reliability-demanding domains like private healthcare. Recently, the vulnerability of deep neural networks in the audio domain to adversarial attacks has become a popular area of research. However, prior works on adversarial attacks in the audio domain primarily rely on iterative gradient-based techniques, which are time-consuming and prone to overfitting the specific threat model. Furthermore, the exploration of sparse perturbations, which have the potential for better stealthiness, remains limited in the audio domain. To address these challenges, we propose a generator-based attack method to generate sparse and transferable adversarial examples to deceive SER models in an end-to-end and efficient manner. We evaluate our method on two widely-used SER datasets, Database of Elicited Mood in Speech (DEMoS) and Interactive Emotional dyadic MOtion CAPture (IEMOCAP), and demonstrate its ability to generate successful sparse adversarial examples in an efficient manner. Moreover, our generated adversarial examples exhibit model-agnostic transferability, enabling effective adversarial attacks on advanced victim models. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.07803 [pdf, other]

Uncovering the Full Potential of Visual Grounding Methods in VQA

Authors: Daniel Reich, Tanja Schultz

Abstract: Visual Grounding (VG) methods in Visual Question Answering (VQA) attempt to improve VQA performance by strengthening a model's reliance on question-relevant visual information. The presence of such relevant information in the visual input is typically assumed in training and testing. This assumption, however, is inherently flawed when dealing with imperfect image representations common in large-sc… ▽ More Visual Grounding (VG) methods in Visual Question Answering (VQA) attempt to improve VQA performance by strengthening a model's reliance on question-relevant visual information. The presence of such relevant information in the visual input is typically assumed in training and testing. This assumption, however, is inherently flawed when dealing with imperfect image representations common in large-scale VQA, where the information carried by visual features frequently deviates from expected ground-truth contents. As a result, training and testing of VG-methods is performed with largely inaccurate data, which obstructs proper assessment of their potential benefits. In this study, we demonstrate that current evaluation schemes for VG-methods are problematic due to the flawed assumption of availability of relevant visual information. Our experiments show that these methods can be much more effective when evaluation conditions are corrected. Code is provided on GitHub. △ Less

Submitted 15 February, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

arXiv:2305.15015 [pdf, other]

doi 10.18653/v1/2023.findings-emnlp.206

Measuring Faithful and Plausible Visual Grounding in VQA

Authors: Daniel Reich, Felix Putze, Tanja Schultz

Abstract: Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. Although inference… ▽ More Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. Although inference capabilities of VQA models are often illustrated by a few qualitative illustrations, most systems are not quantitatively assessed for their VG properties. We believe, an easily calculated criterion for meaningfully measuring a system's VG can help remedy this shortcoming, as well as add another valuable dimension to model evaluations and analysis. To this end, we propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer, i.e., if its visual grounding is both "faithful" and "plausible". Our metric, called "Faithful and Plausible Visual Grounding" (FPVG), is straightforward to determine for most VQA model designs. We give a detailed description of FPVG and evaluate several reference systems spanning various VQA architectures. Code to support the metric calculations on the GQA data set is available on GitHub. △ Less

Submitted 14 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Journal ref: EMNLP 2023 Findings

arXiv:2211.08086 [pdf, other]

Visually Grounded VQA by Lattice-based Retrieval

Authors: Daniel Reich, Felix Putze, Tanja Schultz

Abstract: Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions. Systems with strong VG are considered intuitively interpretable and suggest an improved scene understanding. While VQA accuracy performances have seen impressive gains over the past few years, explicit improvements to VG performance and ev… ▽ More Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions. Systems with strong VG are considered intuitively interpretable and suggest an improved scene understanding. While VQA accuracy performances have seen impressive gains over the past few years, explicit improvements to VG performance and evaluation thereof have often taken a back seat on the road to overall accuracy improvements. A cause of this originates in the predominant choice of learning paradigm for VQA systems, which consists of training a discriminative classifier over a predetermined set of answer options. In this work, we break with the dominant VQA modeling paradigm of classification and investigate VQA from the standpoint of an information retrieval task. As such, the developed system directly ties VG into its core search procedure. Our system operates over a weighted, directed, acyclic graph, a.k.a. "lattice", which is derived from the scene graph of a given image in conjunction with region-referring expressions extracted from the question. We give a detailed analysis of our approach and discuss its distinctive properties and limitations. Our approach achieves the strongest VG performance among examined systems and exhibits exceptional generalization capabilities in a number of scenarios. △ Less

Submitted 15 November, 2022; originally announced November 2022.

arXiv:2106.14476 [pdf, other]

Adventurer's Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs

Authors: Daniel Reich, Felix Putze, Tanja Schultz

Abstract: With the expressed goal of improving system transparency and visual grounding in the reasoning process in VQA, we present a modular system for the task of compositional VQA based on scene graphs. Our system is called "Adventurer's Treasure Hunt" (or ATH), named after an analogy we draw between our model's search procedure for an answer and an adventurer's search for treasure. We developed ATH with… ▽ More With the expressed goal of improving system transparency and visual grounding in the reasoning process in VQA, we present a modular system for the task of compositional VQA based on scene graphs. Our system is called "Adventurer's Treasure Hunt" (or ATH), named after an analogy we draw between our model's search procedure for an answer and an adventurer's search for treasure. We developed ATH with three characteristic features in mind: 1. By design, ATH allows us to explicitly quantify the impact of each of the sub-components on overall VQA performance, as well as their performance on their individual sub-task. 2. By modeling the search task after a treasure hunt, ATH inherently produces an explicit, visually grounded inference path for the processed question. 3. ATH is the first GQA-trained VQA system that dynamically extracts answers by querying the visual knowledge base directly, instead of selecting one from a specially learned classifier's output distribution over a pre-fixed answer vocabulary. We report detailed results on all components and their contributions to overall VQA performance on the GQA dataset and show that ATH achieves the highest visual grounding score among all examined systems. △ Less

Submitted 28 June, 2021; originally announced June 2021.

arXiv:2103.03621 [pdf, other]

Low-latency auditory spatial attention detection based on spectro-spatial features from EEG

Authors: Siqi Cai, Pengcheng Sun, Tanja Schultz, Haizhou Li

Abstract: Detecting auditory attention based on brain signals enables many everyday applications, and serves as part of the solution to the cocktail party effect in speech processing. Several studies leverage the correlation between brain signals and auditory stimuli to detect the auditory attention of listeners. Recently, studies show that the alpha band (8-13 Hz) EEG signals enable the localization of aud… ▽ More Detecting auditory attention based on brain signals enables many everyday applications, and serves as part of the solution to the cocktail party effect in speech processing. Several studies leverage the correlation between brain signals and auditory stimuli to detect the auditory attention of listeners. Recently, studies show that the alpha band (8-13 Hz) EEG signals enable the localization of auditory stimuli. We believe that it is possible to detect auditory spatial attention without the need of auditory stimuli as references. In this work, we use alpha power signals for automatic auditory spatial attention detection. To the best of our knowledge, this is the first attempt to detect spatial attention based on alpha power neural signals. We propose a spectro-spatial feature extraction technique to detect the auditory spatial attention (left/right) based on the topographic specificity of alpha power. Experiments show that the proposed neural approach achieves 81.7% and 94.6% accuracy for 1-second and 10-second decision windows, respectively. Our comparative results show that this neural approach outperforms other competitive models by a large margin in all test cases. △ Less

Submitted 4 July, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2102.03684 [pdf, other]

Linking Labs: Interconnecting Experimental Environments

Authors: Tanja Schultz, Felix Putze, Thorsten Fehr, Moritz Meier, Celeste Mason, Florian Ahrens, Manfred Herrmann

Abstract: We introduce the concept of LabLinking: a technology-based interconnection of experimental laboratories across institutions, disciplines, cultures, languages, and time zones - in other words experiments without borders. In particular, we introduce LabLinking levels (LLL), which define the degree of tightness of empirical interconnection between labs. We describe the technological infrastructure in… ▽ More We introduce the concept of LabLinking: a technology-based interconnection of experimental laboratories across institutions, disciplines, cultures, languages, and time zones - in other words experiments without borders. In particular, we introduce LabLinking levels (LLL), which define the degree of tightness of empirical interconnection between labs. We describe the technological infrastructure in terms of hard- and software required for the respective LLLs and present examples of linked laboratories along with insights about the challenges and benefits. In sum, we argue that linked labs provide a unique platform for a continuous exchange between scientists and experimenters, thereby enabling a time synchronous execution of experiments performed with and by decentralized user and researchers, improving outreach and ease of subject recruitment, allowing to establish new experimental designs and to incorporate a panoply of complementary biosensors, devices, hard- and software solutions. △ Less

Submitted 6 February, 2021; originally announced February 2021.

arXiv:2009.01871 [pdf, other]

doi 10.1007/978-3-030-60548-3_18

Federated Learning for Breast Density Classification: A Real-World Implementation

Authors: Holger R. Roth, Ken Chang, Praveer Singh, Nir Neumark, Wenqi Li, Vikash Gupta, Sharut Gupta, Liangqiong Qu, Alvin Ihsani, Bernardo C. Bizzo, Yuhong Wen, Varun Buch, Meesam Shah, Felipe Kitamura, Matheus Mendonça, Vitor Lavor, Ahmed Harouni, Colin Compas, Jesse Tetreault, Prerna Dogra, Yan Cheng, Selnur Erdal, Richard White, Behrooz Hashemian, Thomas Schultz , et al. (18 additional authors not shown)

Abstract: Building robust deep learning-based models requires large quantities of diverse training data. In this study, we investigate the use of federated learning (FL) to build medical imaging classification models in a real-world collaborative setting. Seven clinical institutions from across the world joined this FL effort to train a model for breast density classification based on Breast Imaging, Report… ▽ More Building robust deep learning-based models requires large quantities of diverse training data. In this study, we investigate the use of federated learning (FL) to build medical imaging classification models in a real-world collaborative setting. Seven clinical institutions from across the world joined this FL effort to train a model for breast density classification based on Breast Imaging, Reporting & Data System (BI-RADS). We show that despite substantial differences among the datasets from all sites (mammography system, class distribution, and data set size) and without centralizing data, we can successfully train AI models in federation. The results show that models trained using FL perform 6.3% on average better than their counterparts trained on an institute's local data alone. Furthermore, we show a 45.8% relative improvement in the models' generalizability when evaluated on the other participating sites' testing data. △ Less

Submitted 20 October, 2020; v1 submitted 3 September, 2020; originally announced September 2020.

Comments: Accepted at the 1st MICCAI Workshop on "Distributed And Collaborative Learning"; add citation to Fig. 1 & 2 and update Fig. 5; fix typo in affiliations

Journal ref: In: Albarqouni S. et al. (eds) Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning. DART 2020, DCL 2020. Lecture Notes in Computer Science, vol 12444. Springer, Cham

arXiv:2006.10406 [pdf, other]

Fourth-Order Anisotropic Diffusion for Inpainting and Image Compression

Authors: Ikram Jumakulyyev, Thomas Schultz

Abstract: Edge-enhancing diffusion (EED) can reconstruct a close approximation of an original image from a small subset of its pixels. This makes it an attractive foundation for PDE based image compression. In this work, we generalize second-order EED to a fourth-order counterpart. It involves a fourth-order diffusion tensor that is constructed from the regularized image gradient in a similar way as in trad… ▽ More Edge-enhancing diffusion (EED) can reconstruct a close approximation of an original image from a small subset of its pixels. This makes it an attractive foundation for PDE based image compression. In this work, we generalize second-order EED to a fourth-order counterpart. It involves a fourth-order diffusion tensor that is constructed from the regularized image gradient in a similar way as in traditional second-order EED, permitting diffusion along edges, while applying a non-linear diffusivity function across them. We show that our fourth-order diffusion tensor formalism provides a unifying framework for all previous anisotropic fourth-order diffusion based methods, and that it provides additional flexibility. We achieve an efficient implementation using a fast semi-iterative scheme. Experimental results on natural and medical images suggest that our novel fourth-order method produces more accurate reconstructions compared to the existing second-order EED. △ Less

Submitted 18 June, 2020; originally announced June 2020.

Comments: Accepted for publication in Springer book "Anisotropy Across Fields and Scales"

ACM Class: I.4.2; I.4.4

arXiv:1711.08279 [pdf, other]

Visual Analytics of Group Differences in Tensor Fields: Application to Clinical DTI

Authors: Amin Abbasloo, Vitalis Wiens, Tobias Schmidt-Wilcke, Pia Sundgren, Reinhard Klein, Thomas Schultz

Abstract: We present a visual analytics system for exploring group differences in tensor fields with respect to all six degrees of freedom that are inherent in symmetric second-order tensors. Our framework closely integrates quantitative analysis, based on multivariate hypothesis testing and spatial cluster enhancement, with suitable visualization tools that facilitate interpretation of results, and forming… ▽ More We present a visual analytics system for exploring group differences in tensor fields with respect to all six degrees of freedom that are inherent in symmetric second-order tensors. Our framework closely integrates quantitative analysis, based on multivariate hypothesis testing and spatial cluster enhancement, with suitable visualization tools that facilitate interpretation of results, and forming of new hypotheses. Carefully chosen and linked spatial and abstract views show clusters of strong differences, and allow the analyst to relate them to the affected structures, to reveal the exact nature of the differences, and to investigate potential correlations. A mechanism for visually comparing the results of different tests or levels of smoothing is also provided. We carefully justify the need for such a visual analytics tool from a practical and theoretical point of view. In close collaboration with our clinical co-authors, we apply it to the results of a diffusion tensor imaging study of systemic lupus erythematosus, in which it revealed previously unknown group differences. △ Less

Submitted 22 November, 2017; originally announced November 2017.

Comments: Submitted to IEEE Transactions on Visualization and Computer Graphics

arXiv:1710.08878 [pdf, ps, other]

Classification on Large Networks: A Quantitative Bound via Motifs and Graphons

Authors: Andreas Haupt, Mohammad Khatami, Thomas Schultz, Ngoc Mai Tran

Abstract: When each data point is a large graph, graph statistics such as densities of certain subgraphs (motifs) can be used as feature vectors for machine learning. While intuitive, motif counts are expensive to compute and difficult to work with theoretically. Via graphon theory, we give an explicit quantitative bound for the ability of motif homomorphisms to distinguish large networks under both generat… ▽ More When each data point is a large graph, graph statistics such as densities of certain subgraphs (motifs) can be used as feature vectors for machine learning. While intuitive, motif counts are expensive to compute and difficult to work with theoretically. Via graphon theory, we give an explicit quantitative bound for the ability of motif homomorphisms to distinguish large networks under both generative and sampling noise. Furthermore, we give similar bounds for the graph spectrum and connect it to homomorphism densities of cycles. This results in an easily computable classifier on graph data with theoretical performance guarantee. Our method yields competitive results on classification tasks for the autoimmune disease Lupus Erythematosus. △ Less

Submitted 24 October, 2017; originally announced October 2017.

Comments: 17 pages, 2 figures, 1 table

MSC Class: 68T05; 05C80; 62G99

arXiv:1710.01809 [pdf, other]

doi 10.1109/TASLP.2015.2389622

Syntactic and Semantic Features For Code-Switching Factored Language Models

Authors: Heike Adel, Ngoc Thang Vu, Katrin Kirchhoff, Dominic Telaar, Tanja Schultz

Abstract: This paper presents our latest investigations on different features for factored language models for Code-Switching speech and their effect on automatic speech recognition (ASR) performance. We focus on syntactic and semantic features which can be extracted from Code-Switching text data and integrate them into factored language models. Different possible factors, such as words, part-of-speech tags… ▽ More This paper presents our latest investigations on different features for factored language models for Code-Switching speech and their effect on automatic speech recognition (ASR) performance. We focus on syntactic and semantic features which can be extracted from Code-Switching text data and integrate them into factored language models. Different possible factors, such as words, part-of-speech tags, Brown word clusters, open class words and clusters of open class word embeddings are explored. The experimental results reveal that Brown word clusters, part-of-speech tags and open-class words are the most effective at reducing the perplexity of factored language models on the Mandarin-English Code-Switching corpus SEAME. In ASR experiments, the model containing Brown word clusters and part-of-speech tags and the model also including clusters of open class word embeddings yield the best mixed error rate results. In summary, the best language model can significantly reduce the perplexity on the SEAME evaluation set by up to 10.8% relative and the mixed error rate by up to 3.4% relative. △ Less

Submitted 4 October, 2017; originally announced October 2017.

Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing (Volume: 23, Issue: 3, March 2015)

arXiv:1611.06906 [pdf, other]

doi 10.1007/s10851-017-0729-1

Multi-Scale Anisotropic Fourth-Order Diffusion Improves Ridge and Valley Localization

Authors: Shekoufeh Gorgi Zadeh, Stephan Didas, Maximilian W. M. Wintergerst, Thomas Schultz

Abstract: Ridge and valley enhancing filters are widely used in applications such as vessel detection in medical image computing. When images are degraded by noise or include vessels at different scales, such filters are an essential step for meaningful and stable vessel localization. In this work, we propose a novel multi-scale anisotropic fourth-order diffusion equation that allows us to smooth along vess… ▽ More Ridge and valley enhancing filters are widely used in applications such as vessel detection in medical image computing. When images are degraded by noise or include vessels at different scales, such filters are an essential step for meaningful and stable vessel localization. In this work, we propose a novel multi-scale anisotropic fourth-order diffusion equation that allows us to smooth along vessels, while sharpening them in the orthogonal direction. The proposed filter uses a fourth order diffusion tensor whose eigentensors and eigenvalues are determined from the local Hessian matrix, at a scale that is automatically selected for each pixel. We discuss efficient implementation using a Fast Explicit Diffusion scheme and demonstrate results on synthetic images and vessels in fundus images. Compared to previous isotropic and anisotropic fourth-order filters, as well as established second-order vessel enhancing filters, our newly proposed one better restores the centerlines in all cases. △ Less

Submitted 18 April, 2017; v1 submitted 21 November, 2016; originally announced November 2016.

Comments: 12 pages, 8 figures, 1 table

Journal ref: Journal of Mathematical Imaging and Vision, 1-13, 2017

arXiv:1307.3271 [pdf, other]

Fuzzy Fibers: Uncertainty in dMRI Tractography

Authors: Thomas Schultz, Anna Vilanova, Ralph Brecheisen, Gordon Kindlmann

Abstract: Fiber tracking based on diffusion weighted Magnetic Resonance Imaging (dMRI) allows for noninvasive reconstruction of fiber bundles in the human brain. In this chapter, we discuss sources of error and uncertainty in this technique, and review strategies that afford a more reliable interpretation of the results. This includes methods for computing and rendering probabilistic tractograms, which esti… ▽ More Fiber tracking based on diffusion weighted Magnetic Resonance Imaging (dMRI) allows for noninvasive reconstruction of fiber bundles in the human brain. In this chapter, we discuss sources of error and uncertainty in this technique, and review strategies that afford a more reliable interpretation of the results. This includes methods for computing and rendering probabilistic tractograms, which estimate precision in the face of measurement noise and artifacts. However, we also address aspects that have received less attention so far, such as model selection, partial voluming, and the impact of parameters, both in preprocessing and in fiber tracking itself. We conclude by giving impulses for future research. △ Less

Submitted 11 July, 2013; originally announced July 2013.

Showing 1–18 of 18 results for author: Schultz, T