Search | arXiv e-print repository

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

Authors: Prabhav Agrawal, Thilo Koehler, Prashant Serai, Qing He

Abstract: Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder o… ▽ More Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340 times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: Accepted for ICASSP 2024

arXiv:2309.09112 [pdf, other]

Rewriting History: Repurposing Domain-Specific CGRAs

Authors: Jackson Woodruff, Thomas Koehler, Alexander Brauckmann, Chris Cummins, Sam Ainsworth, Michael F. P. O'Boyle

Abstract: Coarse-grained reconfigurable arrays (CGRAs) are domain-specific devices promising both the flexibility of FPGAs and the performance of ASICs. However, with restricted domains comes a danger: designing chips that cannot accelerate enough current and future software to justify the hardware cost. We introduce FlexC, the first flexible CGRA compiler, which allows CGRAs to be adapted to operations the… ▽ More Coarse-grained reconfigurable arrays (CGRAs) are domain-specific devices promising both the flexibility of FPGAs and the performance of ASICs. However, with restricted domains comes a danger: designing chips that cannot accelerate enough current and future software to justify the hardware cost. We introduce FlexC, the first flexible CGRA compiler, which allows CGRAs to be adapted to operations they do not natively support. FlexC uses dataflow rewriting, replacing unsupported regions of code with equivalent operations that are supported by the CGRA. We use equality saturation, a technique enabling efficient exploration of a large space of rewrite rules, to effectively search through the program-space for supported programs. We applied FlexC to over 2,000 loop kernels, compiling to four different research CGRAs and 300 generated CGRAs and demonstrate a 2.2$\times$ increase in the number of loop kernels accelerated leading to 3$\times$ speedup compared to an Arm A5 CPU on kernels that would otherwise be unsupported by the accelerator. △ Less

Submitted 16 September, 2023; originally announced September 2023.

arXiv:2308.12584 [pdf, other]

LORD: Leveraging Open-Set Recognition with Unknown Data

Authors: Tobias Koch, Christian Riess, Thomas Köhler

Abstract: Handling entirely unknown data is a challenge for any deployed classifier. Classification models are typically trained on a static pre-defined dataset and are kept in the dark for the open unassigned feature space. As a result, they struggle to deal with out-of-distribution data during inference. Addressing this task on the class-level is termed open-set recognition (OSR). However, most OSR method… ▽ More Handling entirely unknown data is a challenge for any deployed classifier. Classification models are typically trained on a static pre-defined dataset and are kept in the dark for the open unassigned feature space. As a result, they struggle to deal with out-of-distribution data during inference. Addressing this task on the class-level is termed open-set recognition (OSR). However, most OSR methods are inherently limited, as they train closed-set classifiers and only adapt the downstream predictions to OSR. This work presents LORD, a framework to Leverage Open-set Recognition by exploiting unknown Data. LORD explicitly models open space during classifier training and provides a systematic evaluation for such approaches. We identify three model-agnostic training strategies that exploit background data and applied them to well-established classifiers. Due to LORD's extensive evaluation protocol, we consistently demonstrate improved recognition of unknown data. The benchmarks facilitate in-depth analysis across various requirement levels. To mitigate dependency on extensive and costly background datasets, we explore mixup as an off-the-shelf data generation technique. Our experiments highlight mixup's effectiveness as a substitute for background datasets. Lightweight constraints on mixup synthesis further improve OSR performance. △ Less

Submitted 24 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV 2023 Workshop (Out-Of-Distribution Generalization in Computer Vision)

arXiv:2212.12035 [pdf, other]

doi 10.5525/gla.thesis.83323

A Domain-Extensible Compiler with Controllable Automation of Optimisations

Authors: Thomas Koehler

Abstract: In high performance domains like image processing, physics simulation or machine learning, program performance is critical. Programmers called performance engineers are responsible for the challenging task of optimising programs. Two major challenges prevent modern compilers targeting heterogeneous architectures from reliably automating optimisation. First, domain-specific compilers such as Halide… ▽ More In high performance domains like image processing, physics simulation or machine learning, program performance is critical. Programmers called performance engineers are responsible for the challenging task of optimising programs. Two major challenges prevent modern compilers targeting heterogeneous architectures from reliably automating optimisation. First, domain-specific compilers such as Halide for image processing and TVM for machine learning are difficult to extend with the new optimisations required by new algorithms and hardware. Second, automatic optimisation is often unable to achieve the required performance, and performance engineers often fall back to painstaking manual optimisation. This thesis shows the potential of the Shine compiler to achieve domain-extensibility, controllable automation, and generate high performance code. Domain-extensibility facilitates adapting compilers to new algorithms and hardware. Controllable automation enables performance engineers to gradually take control of the optimisation process. The first research contribution is to add 3 code generation features to Shine, namely: synchronisation barrier insertion, kernel execution, and storage folding. The second research contribution is to demonstrate how extensibility and controllability are exploited to optimise a standard image processing pipeline for corner detection. The final research contribution is to introduce sketch-guided equality saturation, a semi-automated technique that allows performance engineers to guide program rewriting by specifying rewrite goals as sketches: program patterns that leave details unspecified. △ Less

Submitted 22 December, 2022; originally announced December 2022.

Comments: PhD Thesis made at the University of Glasgow, 163 pages

arXiv:2210.16045 [pdf, other]

Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

Authors: Jason Fong, Yun Wang, Prabhav Agrawal, Vimal Manohar, Jilong Wu, Thilo Köhler, Qing He

Abstract: Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms of clarity, speaker identity, and prosody. However, one limitation of prior work is the usage of finetuning to optimise performance: this requires further model… ▽ More Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms of clarity, speaker identity, and prosody. However, one limitation of prior work is the usage of finetuning to optimise performance: this requires further model training on data from the target speaker, which is a costly process that may incorporate potentially sensitive data into server-side models. In contrast, this work focuses on the zero-shot approach which avoids finetuning altogether, and instead uses pretrained speaker verification embeddings together with a jointly trained reference encoder to encode utterance-level information that helps capture aspects such as speaker identity and prosody. Subjective listening tests find that both utterance embeddings and a reference encoder improve the continuity of speaker identity and prosody between the edited synthetic speech and unedited original recording in the zero-shot setting. △ Less

Submitted 28 October, 2022; originally announced October 2022.

Comments: Submitted to ICASSP 2023

arXiv:2205.14892 [pdf, other]

Exploring the Open World Using Incremental Extreme Value Machines

Authors: Tobias Koch, Felix Liebezeit, Christian Riess, Vincent Christlein, Thomas Köhler

Abstract: Dynamic environments require adaptive applications. One particular machine learning problem in dynamic environments is open world recognition. It characterizes a continuously changing domain where only some classes are seen in one batch of the training data and such batches can only be learned incrementally. Open world recognition is a demanding task that is, to the best of our knowledge, addresse… ▽ More Dynamic environments require adaptive applications. One particular machine learning problem in dynamic environments is open world recognition. It characterizes a continuously changing domain where only some classes are seen in one batch of the training data and such batches can only be learned incrementally. Open world recognition is a demanding task that is, to the best of our knowledge, addressed by only a few methods. This work introduces a modification of the widely known Extreme Value Machine (EVM) to enable open world recognition. Our proposed method extends the EVM with a partial model fitting function by neglecting unaffected space during an update. This reduces the training time by a factor of 28. In addition, we provide a modified model reduction using weighted maximum K-set cover to strictly bound the model complexity and reduce the computational effort by a factor of 3.5 from 2.1 s to 0.6 s. In our experiments, we rigorously evaluate openness with two novel evaluation protocols. The proposed method achieves superior accuracy of about 12 % and computational efficiency in the tasks of image classification and face recognition. △ Less

Submitted 30 May, 2022; originally announced May 2022.

Comments: Accepted at ICPR 2022

arXiv:2205.11952 [pdf, other]

3D helical CT Reconstruction with a Memory Efficient Learned Primal-Dual Architecture

Authors: Jevgenija Rudzusika, Buda Bajić, Thomas Koehler, Ozan Öktem

Abstract: Deep learning based computed tomography (CT) reconstruction has demonstrated outstanding performance on simulated 2D low-dose CT data. This applies in particular to domain adapted neural networks, which incorporate a handcrafted physics model for CT imaging. Empirical evidence shows that employing such architectures reduces the demand for training data and improves upon generalisation. However, th… ▽ More Deep learning based computed tomography (CT) reconstruction has demonstrated outstanding performance on simulated 2D low-dose CT data. This applies in particular to domain adapted neural networks, which incorporate a handcrafted physics model for CT imaging. Empirical evidence shows that employing such architectures reduces the demand for training data and improves upon generalisation. However, their training requires large computational resources that quickly become prohibitive in 3D helical CT, which is the most common acquisition geometry used for medical imaging. Furthermore, clinical data also comes with other challenges not accounted for in simulations, like errors in flux measurement, resolution mismatch and, most importantly, the absence of the real ground truth. The necessity to have a computationally feasible training combined with the need to address these issues has made it difficult to evaluate deep learning based reconstruction on clinical 3D helical CT. This paper modifies a domain adapted neural network architecture, the Learned Primal-Dual (LPD), so that it can be trained and applied to reconstruction in this setting. We achieve this by splitting the helical trajectory into sections and applying the unrolled LPD iterations to those sections sequentially. To the best of our knowledge, this work is the first to apply an unrolled deep learning architecture for reconstruction on full-sized clinical data, like those in the Low dose CT image and projection data set (LDCT). Moreover, training and testing is done on a single GPU card with 24GB of memory. △ Less

Submitted 28 November, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

arXiv:2201.03611 [pdf, other]

RISE & Shine: Language-Oriented Compiler Design

Authors: Michel Steuwer, Thomas Koehler, Bastian Köpcke, Federico Pizzuti

Abstract: The trend towards specialization of software and hardware - fuelled by the end of Moore's law and the still accelerating interest in domain-specific computing, such as machine learning - forces us to radically rethink our compiler designs. The era of a universal compiler framework built around a single one-size-fits-all intermediate representation (IR) is over. This realization has sparked the cre… ▽ More The trend towards specialization of software and hardware - fuelled by the end of Moore's law and the still accelerating interest in domain-specific computing, such as machine learning - forces us to radically rethink our compiler designs. The era of a universal compiler framework built around a single one-size-fits-all intermediate representation (IR) is over. This realization has sparked the creation of the MLIR compiler framework that empowers compiler engineers to design and integrate IRs capturing specific abstractions. MLIR provides a generic framework for SSA-based IRs, but it doesn't help us to decide how we should design IRs that are easy to develop, to work with and to combine into working compilers. To address the challenge of IR design, we advocate for a language-oriented compiler design that understands IRs as formal programming languages and enforces their correct use via an accompanying type system. We argue that programming language techniques directly guide extensible IR designs and provide a formal framework to reason about transforming between multiple IRs. In this paper, we discuss the design of the Shine compiler that compiles the high-level functional pattern-based data-parallel language RISE via a hybrid functional-imperative intermediate language to C, OpenCL, and OpenMP. We compare our work directly with the closely related pattern-based Lift IR and compiler. We demonstrate that our language-oriented compiler design results in a more robust and predictable compiler that is extensible at various abstraction levels. Our experimental evaluation shows that this compiler design is able to generate high-performance GPU code. △ Less

Submitted 10 January, 2022; originally announced January 2022.

arXiv:2112.05876 [pdf, other]

The Past as a Stochastic Process

Authors: David H. Wolpert, Michael H. Price, Stefani A. Crabtree, Timothy A. Kohler, Jurgen Jost, James Evans, Peter F. Stadler, Hajime Shimao, Manfred D. Laubichler

Abstract: Historical processes manifest remarkable diversity. Nevertheless, scholars have long attempted to identify patterns and categorize historical actors and influences with some success. A stochastic process framework provides a structured approach for the analysis of large historical datasets that allows for detection of sometimes surprising patterns, identification of relevant causal actors both end… ▽ More Historical processes manifest remarkable diversity. Nevertheless, scholars have long attempted to identify patterns and categorize historical actors and influences with some success. A stochastic process framework provides a structured approach for the analysis of large historical datasets that allows for detection of sometimes surprising patterns, identification of relevant causal actors both endogenous and exogenous to the process, and comparison between different historical cases. The combination of data, analytical tools and the organizing theoretical framework of stochastic processes complements traditional narrative approaches in history and archaeology. △ Less

Submitted 10 December, 2021; originally announced December 2021.

Comments: 20 pages, 4 figures

arXiv:2111.13040 [pdf, other]

Sketch-Guided Equality Saturation: Scaling Equality Saturation to Complex Optimizations of Functional Programs

Authors: Thomas Koehler, Phil Trinder, Michel Steuwer

Abstract: Generating high-performance code for diverse hardware and application domains is challenging. Functional array programming languages with patterns like map and reduce have been successfully combined with term rewriting to define and explore optimization spaces. However, deciding what sequence of rewrites to apply is hard and has a huge impact on the performance of the rewritten program. Equality s… ▽ More Generating high-performance code for diverse hardware and application domains is challenging. Functional array programming languages with patterns like map and reduce have been successfully combined with term rewriting to define and explore optimization spaces. However, deciding what sequence of rewrites to apply is hard and has a huge impact on the performance of the rewritten program. Equality saturation avoids the issue by exploring many possible ways to apply rewrites, efficiently representing many equivalent programs in an e-graph data structure. Equality saturation has some limitations when rewriting functional language terms, as currently naive encodings of the lambda calculus are used. We present new techniques for encoding polymorphically typed lambda calculi, and show that the efficient encoding reduces the runtime and memory consumption of equality saturation by orders of magnitude. Moreover, equality saturation does not yet scale to complex compiler optimizations. These emerge from long rewrite sequences of thousands of rewrite steps, and may use pathological combinations of rewrite rules that cause the e-graph to quickly grow too large. This paper introduces \emph{sketch-guided equality saturation}, a semi-automatic technique that allows programmers to provide program sketches to guide rewriting. Sketch-guided equality saturation is evaluated for seven complex matrix multiplication optimizations, including loop blocking, vectorization, and multi-threading. Even with efficient lambda calculus encoding, unguided equality saturation can locate only the two simplest of these optimizations, the remaining five are undiscovered even with an hour of compilation time and 60GB of RAM. By specifying three or fewer sketch guides all seven optimizations are found in seconds of compilation time, using under 1GB of RAM, and generating high performance code. △ Less

Submitted 3 June, 2022; v1 submitted 25 November, 2021; originally announced November 2021.

Comments: 23 pages excluding references, submitted to OOPLSA 2022

arXiv:2108.11730 [pdf, other]

doi 10.1137/21M1445697

Deep learning based dictionary learning and tomographic image reconstruction

Authors: Jevgenija Rudzusika, Thomas Koehler, Ozan Öktem

Abstract: This work presents an approach for image reconstruction in clinical low-dose tomography that combines principles from sparse signal processing with ideas from deep learning. First, we describe sparse signal representation in terms of dictionaries from a statistical perspective and interpret dictionary learning as a process of aligning distribution that arises from a generative model with empirical… ▽ More This work presents an approach for image reconstruction in clinical low-dose tomography that combines principles from sparse signal processing with ideas from deep learning. First, we describe sparse signal representation in terms of dictionaries from a statistical perspective and interpret dictionary learning as a process of aligning distribution that arises from a generative model with empirical distribution of true signals. As a result we can see that sparse coding with learned dictionaries resembles a specific variational autoencoder, where the decoder is a linear function and the encoder is a sparse coding algorithm. Next, we show that dictionary learning can also benefit from computational advancements introduced in the context of deep learning, such as parallelism and as stochastic optimization. Finally, we show that regularization by dictionaries achieves competitive performance in computed tomography (CT) reconstruction comparing to state-of-the-art model based and data driven approaches. △ Less

Submitted 26 August, 2021; originally announced August 2021.

Comments: 34 pages, 5 figures

Journal ref: SIAM Journal on Imaging Sciences, Vol 15, Iss 4. (2002)

arXiv:2104.00705 [pdf, other]

Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

Authors: Qing He, Thilo Koehler, Jilong Wu

Abstract: Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually incorporate the encoder-decoder architecture with self-attention or bi-directional long short-term (BLSTM) units. While these models can produce high quality speech,… ▽ More Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually incorporate the encoder-decoder architecture with self-attention or bi-directional long short-term (BLSTM) units. While these models can produce high quality speech, they often incur O($L$) increase in both latency and real-time factor (RTF) with respect to input length $L$. In other words, longer inputs leads to longer delay and slower synthesis speed, limiting its use in real-time applications. In this paper, we propose a multi-rate attention architecture that breaks the latency and RTF bottlenecks by computing a compact representation during encoding and recurrently generating the attention vector in a streaming manner during decoding. The proposed architecture achieves high audio quality (MOS of 4.31 compared to groundtruth 4.48), low latency, and low RTF at the same time. Meanwhile, both latency and RTF of the proposed system stay constant regardless of input lengths, making it ideal for real-time applications. △ Less

Submitted 1 April, 2021; originally announced April 2021.

arXiv:2011.12985 [pdf, other]

FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

Authors: Bichen Wu, Qing He, Peizhao Zhang, Thilo Koehler, Kurt Keutzer, Peter Vajda

Abstract: Nowadays more and more applications can benefit from edge-based text-to-speech (TTS). However, most existing TTS models are too computationally expensive and are not flexible enough to be deployed on the diverse variety of edge devices with their equally diverse computational capacities. To address this, we propose FBWave, a family of efficient and scalable neural vocoders that can achieve optimal… ▽ More Nowadays more and more applications can benefit from edge-based text-to-speech (TTS). However, most existing TTS models are too computationally expensive and are not flexible enough to be deployed on the diverse variety of edge devices with their equally diverse computational capacities. To address this, we propose FBWave, a family of efficient and scalable neural vocoders that can achieve optimal performance-efficiency trade-offs for different edge devices. FBWave is a hybrid flow-based generative model that combines the advantages of autoregressive and non-autoregressive models. It produces high quality audio and supports streaming during inference while remaining highly computationally efficient. Our experiments show that FBWave can achieve similar audio quality to WaveRNN while reducing MACs by 40x. More efficient variants of FBWave can achieve up to 109x fewer MACs while still delivering acceptable audio quality. Audio demos are available at https://bichenwu09.github.io/vocoder_demos. △ Less

Submitted 25 November, 2020; originally announced November 2020.

arXiv:2011.05003 [pdf, other]

doi 10.1109/JPHOTOV.2021.3072229

Joint Super-Resolution and Rectification for Solar Cell Inspection

Authors: Mathis Hoffmann, Thomas Köhler, Bernd Doll, Frank Schebesch, Florian Talkenberg, Ian Marius Peters, Christoph J. Brabec, Andreas Maier, Vincent Christlein

Abstract: Visual inspection of solar modules is an important monitoring facility in photovoltaic power plants. Since a single measurement of fast CMOS sensors is limited in spatial resolution and often not sufficient to reliably detect small defects, we apply multi-frame super-resolution (MFSR) to a sequence of low resolution measurements. In addition, the rectification and removal of lens distortion simpli… ▽ More Visual inspection of solar modules is an important monitoring facility in photovoltaic power plants. Since a single measurement of fast CMOS sensors is limited in spatial resolution and often not sufficient to reliably detect small defects, we apply multi-frame super-resolution (MFSR) to a sequence of low resolution measurements. In addition, the rectification and removal of lens distortion simplifies subsequent analysis. Therefore, we propose to fuse this pre-processing with standard MFSR algorithms. This is advantageous, because we omit a separate processing step, the motion estimation becomes more stable and the spacing of high-resolution (HR) pixels on the rectified module image becomes uniform w. r. t. the module plane, regardless of perspective distortion. We present a comprehensive user study showing that MFSR is beneficial for defect recognition by human experts and that the proposed method performs better than the state of the art. Furthermore, we apply automated crack segmentation and show that the proposed method performs 3x better than bicubic upsampling and 2x better than the state of the art for automated inspection. △ Less

Submitted 7 April, 2021; v1 submitted 10 November, 2020; originally announced November 2020.

arXiv:2002.06758 [pdf, other]

Interactive Text-to-Speech System via Joint Style Analysis

Authors: Yang Gao, Weiyi Zheng, Zhaojun Yang, Thilo Kohler, Christian Fuegen, Qing He

Abstract: While modern TTS technologies have made significant advancements in audio quality, there is still a lack of behavior naturalness compared to conversing with people. We propose a style-embedded TTS system that generates styled responses based on the speech query style. To achieve this, the system includes a style extraction model that extracts a style embedding from the speech query, which is then… ▽ More While modern TTS technologies have made significant advancements in audio quality, there is still a lack of behavior naturalness compared to conversing with people. We propose a style-embedded TTS system that generates styled responses based on the speech query style. To achieve this, the system includes a style extraction model that extracts a style embedding from the speech query, which is then used by the TTS to produce a matching response. We faced two main challenges: 1) only a small portion of the TTS training dataset has style labels, which is needed to train a multi-style TTS that respects different style embeddings during inference. 2) The TTS system and the style extraction model have disjoint training datasets. We need consistent style labels across these two datasets so that the TTS can learn to respect the labels produced by the style extraction model during inference. To solve these, we adopted a semi-supervised approach that uses the style extraction model to create style labels for the TTS dataset and applied transfer learning to learn the style embedding jointly. Our experiment results show user preference for the styled TTS responses and demonstrate the style-embedded TTS system's capability of mimicking the speech query style. △ Less

Submitted 21 September, 2020; v1 submitted 16 February, 2020; originally announced February 2020.

Comments: Accepted by Interspeech 2020

arXiv:2002.02268 [pdf, other]

A Language for Describing Optimization Strategies

Authors: Bastian Hagedorn, Johannes Lenfers, Thomas Koehler, Sergei Gorlatch, Michel Steuwer

Abstract: Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for many applications. The predominantly used imperative languages - like C or OpenCL - force the programmer to intertwine the code describing functionality and optimizations. This results in a nightmare for portability which is particularly problematic given the accelerating trend towards specialized hardware d… ▽ More Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for many applications. The predominantly used imperative languages - like C or OpenCL - force the programmer to intertwine the code describing functionality and optimizations. This results in a nightmare for portability which is particularly problematic given the accelerating trend towards specialized hardware devices to further increase efficiency. Many emerging DSLs used in performance demanding domains such as deep learning, automatic differentiation, or image processing attempt to simplify or even fully automate the optimization process. Using a high-level - often functional - language, programmers focus on describing functionality in a declarative way. In some systems such as Halide or TVM, a separate schedule specifies how the program should be optimized. Unfortunately, these schedules are not written in well-defined programming languages. Instead, they are implemented as a set of ad-hoc predefined APIs that the compiler writers have exposed. In this paper, we present Elevate: a functional language for describing optimization strategies. Elevate follows a tradition of prior systems used in different contexts that express optimization strategies as composition of rewrites. In contrast to systems with scheduling APIs, in Elevate programmers are not restricted to a set of built-in optimizations but define their own optimization strategies freely in a composable way. We show how user-defined optimization strategies in Elevate enable the effective optimization of programs expressed in a functional data-parallel language demonstrating competitive performance with Halide and TVM. △ Less

Submitted 6 February, 2020; originally announced February 2020.

Comments: https://elevate-lang.org/ https://github.com/elevate-lang

arXiv:1911.04762 [pdf, other]

Merging-ISP: Multi-Exposure High Dynamic Range Image Signal Processing

Authors: Prashant Chaudhari, Franziska Schirrmacher, Andreas Maier, Christian Riess, Thomas Köhler

Abstract: High dynamic range (HDR) imaging combines multiple images with different exposure times into a single high-quality image. The image signal processing pipeline (ISP) is a core component in digital cameras to perform these operations. It includes demosaicing of raw color filter array (CFA) data at different exposure times, alignment of the exposures, conversion to HDR domain, and exposure merging in… ▽ More High dynamic range (HDR) imaging combines multiple images with different exposure times into a single high-quality image. The image signal processing pipeline (ISP) is a core component in digital cameras to perform these operations. It includes demosaicing of raw color filter array (CFA) data at different exposure times, alignment of the exposures, conversion to HDR domain, and exposure merging into an HDR image. Traditionally, such pipelines cascade algorithms that address these individual subtasks. However, cascaded designs suffer from error propagation, since simply combining multiple steps is not necessarily optimal for the entire imaging task. This paper proposes a multi-exposure HDR image signal processing pipeline (Merging-ISP) to jointly solve all these subtasks. Our pipeline is modeled by a deep neural network architecture. As such, it is end-to-end trainable, circumvents the use of hand-crafted and potentially complex algorithms, and mitigates error propagation. Merging-ISP enables direct reconstructions of HDR images of dynamic scenes from multiple raw CFA images with different exposures. We compare Merging-ISP against several state-of-the-art cascaded pipelines. The proposed method provides HDR reconstructions of high perceptual quality and it quantitatively outperforms competing ISPs by more than 1 dB in terms of PSNR. △ Less

Submitted 4 October, 2021; v1 submitted 12 November, 2019; originally announced November 2019.

Comments: Computational Photography, DAGM GCPR 2021

arXiv:1910.12612 [pdf, other]

G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

Authors: Duc Le, Thilo Koehler, Christian Fuegen, Michael L. Seltzer

Abstract: Grapheme-based acoustic modeling has recently been shown to outperform phoneme-based approaches in both hybrid and end-to-end automatic speech recognition (ASR), even on non-phonemic languages like English. However, graphemic ASR still has problems with rare long-tail words that do not follow the standard spelling conventions seen in training, such as entity names. In this work, we present a novel… ▽ More Grapheme-based acoustic modeling has recently been shown to outperform phoneme-based approaches in both hybrid and end-to-end automatic speech recognition (ASR), even on non-phonemic languages like English. However, graphemic ASR still has problems with rare long-tail words that do not follow the standard spelling conventions seen in training, such as entity names. In this work, we present a novel method to train a statistical grapheme-to-grapheme (G2G) model on text-to-speech data that can rewrite an arbitrary character sequence into more phonetically consistent forms. We show that using G2G to provide alternative pronunciations during decoding reduces Word Error Rate by 3% to 11% relative over a strong graphemic baseline and bridges the gap on rare name recognition with an equivalent phonetic setup. Unlike many previously proposed methods, our method does not require any change to the acoustic model training procedure. This work reaffirms the efficacy of grapheme-based modeling and shows that specialized linguistic knowledge, when available, can be leveraged to improve graphemic ASR. △ Less

Submitted 13 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

Comments: To appear at ICASSP 2020

arXiv:1812.09375 [pdf, other]

Multi-Frame Super-Resolution Reconstruction with Applications to Medical Imaging

Authors: Thomas Köhler

Abstract: The optical resolution of a digital camera is one of its most crucial parameters with broad relevance for consumer electronics, surveillance systems, remote sensing, or medical imaging. However, resolution is physically limited by the optics and sensor characteristics. In addition, practical and economic reasons often stipulate the use of out-dated or low-cost hardware. Super-resolution is a class… ▽ More The optical resolution of a digital camera is one of its most crucial parameters with broad relevance for consumer electronics, surveillance systems, remote sensing, or medical imaging. However, resolution is physically limited by the optics and sensor characteristics. In addition, practical and economic reasons often stipulate the use of out-dated or low-cost hardware. Super-resolution is a class of retrospective techniques that aims at high-resolution imagery by means of software. Multi-frame algorithms approach this task by fusing multiple low-resolution frames to reconstruct high-resolution images. This work covers novel super-resolution methods along with new applications in medical imaging. △ Less

Submitted 21 December, 2018; originally announced December 2018.

Comments: Ph.D. thesis at the Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg; source code is available at https://www5.cs.fau.de/de/forschung/software/multi-frame-super-resolution-toolbox/ . https://opus4.kobv.de/opus4-fau/frontdoor/index/index/docId/9145

arXiv:1809.06420 [pdf, other]

Toward Bridging the Simulated-to-Real Gap: Benchmarking Super-Resolution on Real Data

Authors: Thomas Köhler, Michel Bätz, Farzad Naderi, André Kaup, Andreas Maier, Christian Riess

Abstract: Capturing ground truth data to benchmark super-resolution (SR) is challenging. Therefore, current quantitative studies are mainly evaluated on simulated data artificially sampled from ground truth images. We argue that such evaluations overestimate the actual performance of SR methods compared to their behavior on real images. Toward bridging this simulated-to-real gap, we introduce the Super-Reso… ▽ More Capturing ground truth data to benchmark super-resolution (SR) is challenging. Therefore, current quantitative studies are mainly evaluated on simulated data artificially sampled from ground truth images. We argue that such evaluations overestimate the actual performance of SR methods compared to their behavior on real images. Toward bridging this simulated-to-real gap, we introduce the Super-Resolution Erlangen (SupER) database, the first comprehensive laboratory SR database of all-real acquisitions with pixel-wise ground truth. It consists of more than 80k images of 14 scenes combining different facets: CMOS sensor noise, real sampling at four resolution levels, nine scene motion types, two photometric conditions, and lossy video coding at five levels. As such, the database exceeds existing benchmarks by an order of magnitude in quality and quantity. This paper also benchmarks 19 popular single-image and multi-frame algorithms on our data. The benchmark comprises a quantitative study by exploiting ground truth data and qualitative evaluations in a large-scale observer study. We also rigorously investigate agreements between both evaluations from a statistical perspective. One interesting result is that top-performing methods on simulated data may be surpassed by others on real data. Our insights can spur further algorithm development, and the publicy available dataset can foster future evaluations. △ Less

Submitted 16 June, 2019; v1 submitted 17 September, 2018; originally announced September 2018.

Comments: To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence; data and source code available at https://superresolution.tf.fau.de/

arXiv:1806.04385 [pdf, other]

doi 10.1145/3229591.3229593

P4CEP: Towards In-Network Complex Event Processing

Authors: Thomas Kohler, Ruben Mayer, Frank Dürr, Marius Maaß, Sukanya Bhowmik, Kurt Rothermel

Abstract: In-network computing using programmable networking hardware is a strong trend in networking that promises to reduce latency and consumption of server resources through offloading to network elements (programmable switches and smart NICs). In particular, the data plane programming language P4 together with powerful P4 networking hardware has spawned projects offloading services into the network, e.… ▽ More In-network computing using programmable networking hardware is a strong trend in networking that promises to reduce latency and consumption of server resources through offloading to network elements (programmable switches and smart NICs). In particular, the data plane programming language P4 together with powerful P4 networking hardware has spawned projects offloading services into the network, e.g., consensus services or caching services. In this paper, we present a novel case for in-network computing, namely, Complex Event Processing (CEP). CEP processes streams of basic events, e.g., stemming from networked sensors, into meaningful complex events. Traditionally, CEP processing has been performed on servers or overlay networks. However, we argue in this paper that CEP is a good candidate for in-network computing along the communication path avoiding detouring streams to distant servers to minimize communication latency while also exploiting processing capabilities of novel networking hardware. We show that it is feasible to express CEP operations in P4 and also present a tool to compile CEP operations, formulated in our P4CEP rule specification language, to P4 code. Moreover, we identify challenges and problems that we have encountered to show future research directions for implementing full-fledged in-network CEP systems. △ Less

Submitted 12 June, 2018; originally announced June 2018.

Comments: 6 pages. Author's version

arXiv:1802.05518 [pdf, ps, other]

Learning from a Handful Volumes: MRI Resolution Enhancement with Volumetric Super-Resolution Forests

Authors: Aline Sindel, Katharina Breininger, Johannes Käßer, Andreas Hess, Andreas Maier, Thomas Köhler

Abstract: Magnetic resonance imaging (MRI) enables 3-D imaging of anatomical structures. However, the acquisition of MR volumes with high spatial resolution leads to long scan times. To this end, we propose volumetric super-resolution forests (VSRF) to enhance MRI resolution retrospectively. Our method learns a locally linear map** between low-resolution and high-resolution volumetric image patches by emp… ▽ More Magnetic resonance imaging (MRI) enables 3-D imaging of anatomical structures. However, the acquisition of MR volumes with high spatial resolution leads to long scan times. To this end, we propose volumetric super-resolution forests (VSRF) to enhance MRI resolution retrospectively. Our method learns a locally linear map** between low-resolution and high-resolution volumetric image patches by employing random forest regression. We customize features suitable for volumetric MRI to train the random forest and propose a median tree ensemble for robust regression. VSRF outperforms state-of-the-art example-based super-resolution in term of image quality and efficiency for model training and inference in different MRI datasets. It is also superior to unsupervised methods with just a handful or even a single volume to assemble training data. △ Less

Submitted 15 February, 2018; originally announced February 2018.

Comments: Preprint submitted to ICIP 2018

arXiv:1802.03943 [pdf, other]

Temporal and volumetric denoising via quantile sparse image prior

Authors: Franziska Schirrmacher, Thomas Köhler, Tobias Lindenberger, Lennart Husvogt, Jürgen Endres, James G. Fujimoto, Joachim Hornegger, Arnd Dörfler, Philip Hoelter, Andreas K. Maier

Abstract: This paper introduces an universal and structure-preserving regularization term, called quantile sparse image (QuaSI) prior. The prior is suitable for denoising images from various medical imaging modalities. We demonstrate its effectiveness on volumetric optical coherence tomography (OCT) and computed tomography (CT) data, which show different noise and image characteristics. OCT offers high-reso… ▽ More This paper introduces an universal and structure-preserving regularization term, called quantile sparse image (QuaSI) prior. The prior is suitable for denoising images from various medical imaging modalities. We demonstrate its effectiveness on volumetric optical coherence tomography (OCT) and computed tomography (CT) data, which show different noise and image characteristics. OCT offers high-resolution scans of the human retina but is inherently impaired by speckle noise. CT on the other hand has a lower resolution and shows high-frequency noise. For the purpose of denoising, we propose a variational framework based on the QuaSI prior and a Huber data fidelity model that can handle 3-D and 3-D+t data. Efficient optimization is facilitated through the use of an alternating direction method of multipliers (ADMM) scheme and the linearization of the quantile filter. Experiments on multiple datasets emphasize the excellent performance of the proposed method. △ Less

Submitted 17 June, 2019; v1 submitted 12 February, 2018; originally announced February 2018.

Comments: Accepted for MICCAI2017 special issue

arXiv:1709.04881 [pdf, other]

Benchmarking Super-Resolution Algorithms on Real Data

Authors: Thomas Köhler, Michel Bätz, Farzad Naderi, André Kaup, Andreas K. Maier, Christian Riess

Abstract: Over the past decades, various super-resolution (SR) techniques have been developed to enhance the spatial resolution of digital images. Despite the great number of methodical contributions, there is still a lack of comparative validations of SR under practical conditions, as capturing real ground truth data is a challenging task. Therefore, current studies are either evaluated 1) on simulated dat… ▽ More Over the past decades, various super-resolution (SR) techniques have been developed to enhance the spatial resolution of digital images. Despite the great number of methodical contributions, there is still a lack of comparative validations of SR under practical conditions, as capturing real ground truth data is a challenging task. Therefore, current studies are either evaluated 1) on simulated data or 2) on real data without a pixel-wise ground truth. To facilitate comprehensive studies, this paper introduces the publicly available Super-Resolution Erlangen (SupER) database that includes real low-resolution images along with high-resolution ground truth data. Our database comprises image sequences with more than 20k images captured from 14 scenes under various types of motions and photometric conditions. The datasets cover four spatial resolution levels using camera hardware binning. With this database, we benchmark 15 single-image and multi-frame SR algorithms. Our experiments quantitatively analyze SR accuracy and robustness under realistic conditions including independent object and camera motion or photometric variations. △ Less

Submitted 8 September, 2017; originally announced September 2017.

arXiv:1703.02942 [pdf, other]

QuaSI: Quantile Sparse Image Prior for Spatio-Temporal Denoising of Retinal OCT Data

Authors: Franziska Schirrmacher, Thomas Köhler, Lennart Husvogt, James G. Fujimoto, Joachim Hornegger, Andreas K. Maier

Abstract: Optical coherence tomography (OCT) enables high-resolution and non-invasive 3D imaging of the human retina but is inherently impaired by speckle noise. This paper introduces a spatio-temporal denoising algorithm for OCT data on a B-scan level using a novel quantile sparse image (QuaSI) prior. To remove speckle noise while preserving image structures of diagnostic relevance, we implement our QuaSI… ▽ More Optical coherence tomography (OCT) enables high-resolution and non-invasive 3D imaging of the human retina but is inherently impaired by speckle noise. This paper introduces a spatio-temporal denoising algorithm for OCT data on a B-scan level using a novel quantile sparse image (QuaSI) prior. To remove speckle noise while preserving image structures of diagnostic relevance, we implement our QuaSI prior via median filter regularization coupled with a Huber data fidelity model in a variational approach. For efficient energy minimization, we develop an alternating direction method of multipliers (ADMM) scheme using a linearization of median filtering. Our spatio-temporal method can handle both, denoising of single B-scans and temporally consecutive B-scans, to gain volumetric OCT data with enhanced signal-to-noise ratio. Our algorithm based on 4 B-scans only achieved comparable performance to averaging 13 B-scans and outperformed other current denoising methods. △ Less

Submitted 8 March, 2017; originally announced March 2017.

Comments: submitted to MICCAI'17

arXiv:1702.04449 [pdf, other]

Modeling Social Organizations as Communication Networks

Authors: David Wolpert, Justin Grana, Brendan Tracey, Tim Kohler, Artemy Kolchinsky

Abstract: We identify the "organization" of a human social group as the communication network(s) within that group. We then introduce three theoretical approaches to analyzing what determines the structures of human organizations. All three approaches adopt a group-selection perspective, so that the group's network structure is (approximately) optimal, given the information-processing limitations of agents… ▽ More We identify the "organization" of a human social group as the communication network(s) within that group. We then introduce three theoretical approaches to analyzing what determines the structures of human organizations. All three approaches adopt a group-selection perspective, so that the group's network structure is (approximately) optimal, given the information-processing limitations of agents within the social group, and the exogenous welfare function of the overall group. In the first approach we use a new sub-field of telecommunications theory called network coding, and focus on a welfare function that involves the ability of the organization to convey information among the agents. In the second approach we focus on a scenario where agents within the organization must allocate their future communication resources when the state of the future environment is uncertain. We show how this formulation can be solved with a linear program. In the third approach, we introduce an information synthesis problem in which agents within an organization receive information from various sources and must decide how to transform such information and transmit the results to other agents in the organization. We propose leveraging the computational power of neural networks to solve such problems. These three approaches formalize and synthesize work in fields including anthropology, archeology, economics and psychology that deal with organization structure, theory of the firm, span of control and cognitive limits on communication. △ Less

Submitted 14 February, 2017; originally announced February 2017.

arXiv:1610.04421 [pdf, other]

ZeroSDN: A Message Bus for Flexible and Light-weight Network Control Distribution in SDN

Authors: Frank Dürr, Thomas Kohler, Jonas Grunert, Andre Kutzleb

Abstract: Recent years have seen an evolution of SDN control plane architectures, starting from simple monolithic controllers, over modular monolithic controllers, to distributed controllers. We observe, however, that today's distributed controllers still exhibit inflexibility with respect to the distribution of control logic. Therefore, we propose a novel architecture of a distributed SDN controller in thi… ▽ More Recent years have seen an evolution of SDN control plane architectures, starting from simple monolithic controllers, over modular monolithic controllers, to distributed controllers. We observe, however, that today's distributed controllers still exhibit inflexibility with respect to the distribution of control logic. Therefore, we propose a novel architecture of a distributed SDN controller in this paper, providing maximum flexibility with respect to distribution. Our architecture splits control logic into light-weight control modules, called controllets, based on a micro-kernel approach, reducing common controllet functionality to a bare minimum and factoring out all higher-level functionality. Light-weight controllets also allow for pushing control logic onto switches to minimize latency and communication overhead. Controllets are interconnected through a message bus supporting the publish/subscribe communication paradigm with specific extensions for content-based OpenFlow message filtering. Publish/subscribe allows for complete decoupling of controllets to further facilitate control plane distribution. △ Less

Submitted 14 October, 2016; originally announced October 2016.

Report number: TR-2016-06 ACM Class: C.2.1; C.2.3

arXiv:1609.01524 [pdf, other]

doi 10.1109/ICIP.2016.7532535

Confidence-aware Levenberg-Marquardt optimization for joint motion estimation and super-resolution

Authors: Cosmin Bercea, Andreas Maier, Thomas Köhler

Abstract: Motion estimation across low-resolution frames and the reconstruction of high-resolution images are two coupled subproblems of multi-frame super-resolution. This paper introduces a new joint optimization approach for motion estimation and image reconstruction to address this interdependence. Our method is formulated via non-linear least squares optimization and combines two principles of robust su… ▽ More Motion estimation across low-resolution frames and the reconstruction of high-resolution images are two coupled subproblems of multi-frame super-resolution. This paper introduces a new joint optimization approach for motion estimation and image reconstruction to address this interdependence. Our method is formulated via non-linear least squares optimization and combines two principles of robust super-resolution. First, to enhance the robustness of the joint estimation, we propose a confidence-aware energy minimization framework augmented with sparse regularization. Second, we develop a tailor-made Levenberg-Marquardt iteration scheme to jointly estimate motion parameters and the high-resolution image along with the corresponding model confidence parameters. Our experiments on simulated and real images confirm that the proposed approach outperforms decoupled motion estimation and image reconstruction as well as related state-of-the-art joint estimation algorithms. △ Less

Submitted 6 September, 2016; originally announced September 2016.

Comments: accepted for ICIP 2016

Journal ref: 2016 IEEE International Conference on Image Processing (ICIP)

arXiv:1602.03458 [pdf, other]

Super-Resolved Retinal Image Mosaicing

Authors: Thomas Köhler, Axel Heinrich, Andreas Maier, Joachim Hornegger, Ralf P. Tornow

Abstract: The acquisition of high-resolution retinal fundus images with a large field of view (FOV) is challenging due to technological, physiological and economic reasons. This paper proposes a fully automatic framework to reconstruct retinal images of high spatial resolution and increased FOV from multiple low-resolution images captured with non-mydriatic, mobile and video-capable but low-cost cameras. Wi… ▽ More The acquisition of high-resolution retinal fundus images with a large field of view (FOV) is challenging due to technological, physiological and economic reasons. This paper proposes a fully automatic framework to reconstruct retinal images of high spatial resolution and increased FOV from multiple low-resolution images captured with non-mydriatic, mobile and video-capable but low-cost cameras. Within the scope of one examination, we scan different regions on the retina by exploiting eye motion conducted by a patient guidance. Appropriate views for our mosaicing method are selected based on optic disk tracking to trace eye movements. For each view, one super-resolved image is reconstructed by fusion of multiple video frames. Finally, all super-resolved views are registered to a common reference using a novel polynomial registration scheme and combined by means of image mosaicing. We evaluated our framework for a mobile and low-cost video fundus camera. In our experiments, we reconstructed retinal images of up to 30° FOV from 10 complementary views of 15° FOV. An evaluation of the mosaics by human experts as well as a quantitative comparison to conventional color fundus images encourage the clinical usability of our framework. △ Less

Submitted 10 February, 2016; originally announced February 2016.

Comments: accepted for 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI 2016)

Showing 1–29 of 29 results for author: Kohler, T