Search | arXiv e-print repository

GFN: A graph feedforward network for resolution-invariant reduced operator learning in multifidelity applications

Authors: Oisín M. Morrison, Federico Pichi, Jan S. Hesthaven

Abstract: This work presents a novel resolution-invariant model order reduction strategy for multifidelity applications. We base our architecture on a novel neural network layer developed in this work, the graph feedforward network, which extends the concept of feedforward networks to graph-structured data by creating a direct link between the weights of a neural network and the nodes of a mesh, enhancing t… ▽ More This work presents a novel resolution-invariant model order reduction strategy for multifidelity applications. We base our architecture on a novel neural network layer developed in this work, the graph feedforward network, which extends the concept of feedforward networks to graph-structured data by creating a direct link between the weights of a neural network and the nodes of a mesh, enhancing the interpretability of the network. We exploit the method's capability of training and testing on different mesh sizes in an autoencoder-based reduction strategy for parametrised partial differential equations. We show that this extension comes with provable guarantees on the performance via error bounds. The capabilities of the proposed methodology are tested on three challenging benchmarks, including advection-dominated phenomena and problems with a high-dimensional parameter space. The method results in a more lightweight and highly flexible strategy when compared to state-of-the-art models, while showing excellent generalisation performance in both single fidelity and multifidelity scenarios. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2402.17735 [pdf, other]

High-Fidelity Neural Phonetic Posteriorgrams

Authors: Cameron Churchwell, Max Morrison, Bryan Pardo

Abstract: A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units of speech (e.g., phonemes). PPGs are a popular representation in speech generation due to their ability to disentangle pronunciation features from speaker identity, allowing accurate reconstruction of pronunciation (e.g., voice conversion) and coarse-grained pronunciation editing (e.g., foreign accent con… ▽ More A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units of speech (e.g., phonemes). PPGs are a popular representation in speech generation due to their ability to disentangle pronunciation features from speaker identity, allowing accurate reconstruction of pronunciation (e.g., voice conversion) and coarse-grained pronunciation editing (e.g., foreign accent conversion). In this paper, we demonstrably improve the quality of PPGs to produce a state-of-the-art interpretable PPG representation. We train an off-the-shelf speech synthesizer using our PPG representation and show that high-quality PPGs yield independent control over pitch and pronunciation. We further demonstrate novel uses of PPGs, such as an acoustic pronunciation distance and fine-grained pronunciation control. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: Accepted to ICASSP 2024 Workshop on Explainable Machine Learning for Speech and Audio

arXiv:2402.11151 [pdf]

A Landscape Study of Open Source and Proprietary Tools for Software Bill of Materials (SBOM)

Authors: Mehdi Mirakhorli, Derek Garcia, Schuyler Dillon, Kevin Laporte, Matthew Morrison, Henry Lu, Viktoria Koscinski, Christopher Enoch

Abstract: Modern software applications heavily rely on diverse third-party components, libraries, and frameworks sourced from various vendors and open source repositories, presenting a complex challenge for securing the software supply chain. To address this complexity, the adoption of a Software Bill of Materials (SBOM) has emerged as a promising solution, offering a centralized repository that inventories… ▽ More Modern software applications heavily rely on diverse third-party components, libraries, and frameworks sourced from various vendors and open source repositories, presenting a complex challenge for securing the software supply chain. To address this complexity, the adoption of a Software Bill of Materials (SBOM) has emerged as a promising solution, offering a centralized repository that inventories all third-party components and dependencies used in an application. Recent supply chain breaches, exemplified by the SolarWinds attack, underscore the urgent need to enhance software security and mitigate vulnerability risks, with SBOMs playing a pivotal role in this endeavor by revealing potential vulnerabilities, outdated components, and unsupported elements. This research paper conducts an extensive empirical analysis to assess the current landscape of open-source and proprietary tools related to SBOM. We investigate emerging use cases in software supply chain security and identify gaps in SBOM technologies. Our analysis encompasses 84 tools, providing a snapshot of the current market and highlighting areas for improvement. △ Less

Submitted 16 February, 2024; originally announced February 2024.

arXiv:2401.11042 [pdf]

Does Using ChatGPT Result in Human Cognitive Augmentation?

Authors: Ron Fulbright, Miranda Morrison

Abstract: Human cognitive performance is enhanced by the use of tools. For example, a human can produce a much greater, and more accurate, volume of mathematical calculation in a unit of time using a calculator or a spreadsheet application on a computer. Such tools have taken over the burden of lower level cognitive grunt work but the human still serves the role of the expert performing higher level thinkin… ▽ More Human cognitive performance is enhanced by the use of tools. For example, a human can produce a much greater, and more accurate, volume of mathematical calculation in a unit of time using a calculator or a spreadsheet application on a computer. Such tools have taken over the burden of lower level cognitive grunt work but the human still serves the role of the expert performing higher level thinking and reasoning. Recently, however, unsupervised, deep, machine learning has produced cognitive systems able to outperform humans in several domains. When humans use these tools in a human cog ensemble, the cognitive ability of the human is augmented. In some cases, even non experts can achieve, and even exceed, the performance of experts in a particular domain, synthetic expertise. A new cognitive system, ChatGPT, has burst onto the scene during the past year. This paper investigates human cognitive augmentation due to using ChatGPT by presenting the results of two experiments comparing responses created using ChatGPT with results created not using ChatGPT. We find using ChatGPT does not always result in cognitive augmentation and does not yet replace human judgement, discernment, and evaluation in certain types of tasks. In fact, ChatGPT was observed to result in misleading users resulting in negative cognitive augmentation. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Comments: 12 pages, 5 figures

arXiv:2310.08464 [pdf, other]

Crowdsourced and Automatic Speech Prominence Estimation

Authors: Max Morrison, Pranav Pawar, Nathan Pruyne, Jennifer Cole, Bryan Pardo

Abstract: The prominence of a spoken word is the degree to which an average native listener perceives the word as salient or emphasized relative to its context. Speech prominence estimation is the process of assigning a numeric value to the prominence of each word in an utterance. These prominence labels are useful for linguistic analysis, as well as training automated systems to perform emphasis-controlled… ▽ More The prominence of a spoken word is the degree to which an average native listener perceives the word as salient or emphasized relative to its context. Speech prominence estimation is the process of assigning a numeric value to the prominence of each word in an utterance. These prominence labels are useful for linguistic analysis, as well as training automated systems to perform emphasis-controlled text-to-speech or emotion recognition. Manually annotating prominence is time-consuming and expensive, which motivates the development of automated methods for speech prominence estimation. However, develo** such an automated system using machine-learning methods requires human-annotated training data. Using our system for acquiring such human annotations, we collect and open-source crowdsourced annotations of a portion of the LibriTTS dataset. We use these annotations as ground truth to train a neural speech prominence estimator that generalizes to unseen speakers, datasets, and speaking styles. We investigate design decisions for neural prominence estimation as well as how neural prominence estimation improves as a function of two key factors of annotation cost: dataset size and the number of annotations per utterance. △ Less

Submitted 22 December, 2023; v1 submitted 12 October, 2023; originally announced October 2023.

Comments: Published as a conference paper at ICASSP 2024

arXiv:2301.12258 [pdf, other]

Cross-domain Neural Pitch and Periodicity Estimation

Authors: Max Morrison, Caedon Hsieh, Nathan Pruyne, Bryan Pardo

Abstract: Pitch is a foundational aspect of our perception of audio signals. Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks, including music transcription, singing voice synthesis, and prosody editing. In this paper, we describe a set of techniques for improving the accuracy of widely-used neural pitch and periodicity estimators to achieve sta… ▽ More Pitch is a foundational aspect of our perception of audio signals. Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks, including music transcription, singing voice synthesis, and prosody editing. In this paper, we describe a set of techniques for improving the accuracy of widely-used neural pitch and periodicity estimators to achieve state-of-the-art performance on both speech and music. We also introduce a novel entropy-based method for extracting periodicity and per-frame voiced-unvoiced classifications from statistical inference-based pitch estimators (e.g., neural networks), and show how to train a neural pitch estimator to simultaneously handle both speech and music data (i.e., cross-domain estimation) without performance degradation. While neural pitch trackers have historically been significantly slower than signal processing based pitch trackers, our estimator implementations approach the speed of state-of-the-art DSP-based pitch estimators on a standard CPU, but with significantly more accurate pitch and periodicity estimation. Our experiments show that an accurate, cross-domain pitch and periodicity estimator written in PyTorch with a hopsize of ten milliseconds can run 11.2x faster than real-time on a Intel i9-9820X 10-core 3.30 GHz CPU or 408x faster than real-time on a NVIDIA GeForce RTX 3090 GPU, without hardware optimization. We release all of our code and models as Pitch-Estimating Neural Networks (penn), an open-source, pip-installable Python module for training, evaluating, and performing inference with pitch- and periodicity-estimating neural networks. The code for penn is available at https://github.com/interactiveaudiolab/penn. △ Less

Submitted 9 June, 2023; v1 submitted 28 January, 2023; originally announced January 2023.

arXiv:2208.12387 [pdf, other]

Music Separation Enhancement with Generative Modeling

Authors: Noah Schaffer, Boaz Cogan, Ethan Manilow, Max Morrison, Prem Seetharaman, Bryan Pardo

Abstract: Despite phenomenal progress in recent years, state-of-the-art music separation systems produce source estimates with significant perceptual shortcomings, such as adding extraneous noise or removing harmonics. We propose a post-processing model (the Make it Sound Good (MSG) post-processor) to enhance the output of music source separation systems. We apply our post-processing model to state-of-the-a… ▽ More Despite phenomenal progress in recent years, state-of-the-art music separation systems produce source estimates with significant perceptual shortcomings, such as adding extraneous noise or removing harmonics. We propose a post-processing model (the Make it Sound Good (MSG) post-processor) to enhance the output of music source separation systems. We apply our post-processing model to state-of-the-art waveform-based and spectrogram-based music source separators, including a separator unseen by MSG during training. Our analysis of the errors produced by source separators shows that waveform models tend to introduce more high-frequency noise, while spectrogram models tend to lose transients and high frequency content. We introduce objective measures to quantify both kinds of errors and show MSG improves the source reconstruction of both kinds of errors. Crowdsourced subjective evaluations demonstrate that human listeners prefer source estimates of bass and drums that have been post-processed by MSG. △ Less

Submitted 25 August, 2022; originally announced August 2022.

Comments: Accepted to ISMIR 2022

arXiv:2203.04451 [pdf, other]

Transitions between peace and systemic war as bifurcations in a signed network dynamical system

Authors: Megan Morrison, J. Nathan Kutz, Michael Gabbay

Abstract: We investigate structural features and processes associated with the onset of systemic conflict using an approach which integrates complex systems theory with network modeling and analysis. We present a signed network model of cooperation and conflict dynamics in the context of international relations between states. The model evolves ties between nodes under the influence of a structural balance… ▽ More We investigate structural features and processes associated with the onset of systemic conflict using an approach which integrates complex systems theory with network modeling and analysis. We present a signed network model of cooperation and conflict dynamics in the context of international relations between states. The model evolves ties between nodes under the influence of a structural balance force and a dyad-specific force. Model simulations exhibit a sharp bifurcation from peace to systemic war as structural balance pressures increase, a bistable regime in which both peace and war stable equilibria exist, and a hysteretic reverse bifurcation from war to peace. We show how the analytical expression we derive for the peace-to-war bifurcation condition implies that polarized network structure increases susceptibility to systemic war. We develop a framework for identifying patterns of relationship perturbations that are most destabilizing and apply it to the network of European great powers before World War I. We also show that the model exhibits critical slowing down, in which perturbations to the peace equilibrium take longer to decay as the system draws closer to the bifurcation. We discuss how our results relate to international relations theories on the causes and catalysts of systemic war. △ Less

Submitted 8 March, 2022; originally announced March 2022.

MSC Class: 91D30; 37G99; 37N99; 91C20; 34H20 ACM Class: J.4

arXiv:2203.04444 [pdf, other]

Reproducible Subjective Evaluation

Authors: Max Morrison, Brian Tang, Gefei Tan, Bryan Pardo

Abstract: Human perceptual studies are the gold standard for the evaluation of many research tasks in machine learning, linguistics, and psychology. However, these studies require significant time and cost to perform. As a result, many researchers use objective measures that can correlate poorly with human evaluation. When subjective evaluations are performed, they are often not reported with sufficient det… ▽ More Human perceptual studies are the gold standard for the evaluation of many research tasks in machine learning, linguistics, and psychology. However, these studies require significant time and cost to perform. As a result, many researchers use objective measures that can correlate poorly with human evaluation. When subjective evaluations are performed, they are often not reported with sufficient detail to ensure reproducibility. We propose Reproducible Subjective Evaluation (ReSEval), an open-source framework for quickly deploying crowdsourced subjective evaluations directly from Python. ReSEval lets researchers launch A/B, ABX, Mean Opinion Score (MOS) and MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) tests on audio, image, text, or video data from a command-line interface or using one line of Python, making it as easy to run as objective evaluation. With ReSEval, researchers can reproduce each other's subjective evaluations by sharing a configuration file and the audio, image, text, or video files. △ Less

Submitted 8 March, 2022; originally announced March 2022.

Comments: Submitted to ICLR 2022 Workshop on Setting up ML Evaluation Standards to Accelerate Progress

arXiv:2110.10139 [pdf, other]

Chunked Autoregressive GAN for Conditional Waveform Synthesis

Authors: Max Morrison, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron Courville, Yoshua Bengio

Abstract: Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. Ho… ▽ More Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram inversion. In this paper, we demonstrate that these artifacts correspond with an inability for the generator to learn accurate pitch and periodicity. We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression. We discuss the inductive bias that autoregression provides for learning the relationship between instantaneous frequency and phase, and show that this inductive bias holds even when autoregressively sampling large chunks of the waveform during each forward pass. Relative to prior state-of-the-art GAN-based models, our proposed model, Chunked Autoregressive GAN (CARGAN) reduces pitch error by 40-60%, reduces training time by 58%, maintains a fast generation speed suitable for real-time or interactive applications, and maintains or improves subjective quality. △ Less

Submitted 3 March, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

Comments: Published as a conference paper at ICLR 2022

arXiv:2110.02360 [pdf, other]

Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet

Authors: Max Morrison, Zeyu **, Nicholas J. Bryan, Juan-Pablo Caceres, Bryan Pardo

Abstract: Modifying the pitch and timing of an audio signal are fundamental audio editing operations with applications in speech manipulation, audio-visual synchronization, and singing voice editing and synthesis. Thus far, methods for pitch-shifting and time-stretching that use digital signal processing (DSP) have been favored over deep learning approaches due to their speed and relatively higher quality.… ▽ More Modifying the pitch and timing of an audio signal are fundamental audio editing operations with applications in speech manipulation, audio-visual synchronization, and singing voice editing and synthesis. Thus far, methods for pitch-shifting and time-stretching that use digital signal processing (DSP) have been favored over deep learning approaches due to their speed and relatively higher quality. However, even existing DSP-based methods for pitch-shifting and time-stretching induce artifacts that degrade audio quality. In this paper, we propose Controllable LPCNet (CLPCNet), an improved LPCNet vocoder capable of pitch-shifting and time-stretching of speech. For objective evaluation, we show that CLPCNet performs pitch-shifting of speech on unseen datasets with high accuracy relative to prior neural methods. For subjective evaluation, we demonstrate that the quality and naturalness of pitch-shifting and time-stretching with CLPCNet on unseen datasets meets or exceeds competitive neural- or DSP-based approaches. △ Less

Submitted 5 October, 2021; originally announced October 2021.

Comments: Submitted to ICASSP 2022

arXiv:2102.08328 [pdf, other]

Context-Aware Prosody Correction for Text-Based Speech Editing

Authors: Max Morrison, Lucas Rencker, Zeyu **, Nicholas J. Bryan, Juan-Pablo Caceres, Bryan Pardo

Abstract: Text-based speech editors expedite the process of editing speech recordings by permitting editing via intuitive cut, copy, and paste operations on a speech transcript. A major drawback of current systems, however, is that edited recordings often sound unnatural because of prosody mismatches around edited regions. In our work, we propose a new context-aware method for more natural sounding text-bas… ▽ More Text-based speech editors expedite the process of editing speech recordings by permitting editing via intuitive cut, copy, and paste operations on a speech transcript. A major drawback of current systems, however, is that edited recordings often sound unnatural because of prosody mismatches around edited regions. In our work, we propose a new context-aware method for more natural sounding text-based editing of speech. To do so, we 1) use a series of neural networks to generate salient prosody features that are dependent on the prosody of speech surrounding the edit and amenable to fine-grained user control 2) use the generated features to control a standard pitch-shift and time-stretch method and 3) apply a denoising neural network to remove artifacts induced by the signal manipulation to yield a high-fidelity result. We evaluate our approach using a subjective listening test, provide a detailed comparative analysis, and conclude several interesting insights. △ Less

Submitted 16 February, 2021; originally announced February 2021.

Comments: To appear in proceedings of ICASSP 2021

arXiv:2010.03660 [pdf, other]

Fast Stencil-Code Computation on a Wafer-Scale Processor

Authors: Kamil Rocki, Dirk Van Essendelft, Ilya Sharapov, Robert Schreiber, Michael Morrison, Vladimir Kibardin, Andrey Portnoy, Jean Francois Dietiker, Madhava Syamlal, Michael James

Abstract: The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory band… ▽ More The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale system for the solution by BiCGStab of a linear system arising from a 7-point finite difference stencil on a 600 X 595 X 1536 mesh, achieving about one third of the machine's peak performance. We explain the system, its architecture and programming, and its performance on this problem and related problems. We discuss issues of memory capacity and floating point precision. We outline plans to extend this work towards full applications. △ Less

Submitted 7 October, 2020; originally announced October 2020.

Comments: SC 20: The International Conference for High Performance Computing, Networking, Storage, and Analysis, to appear

arXiv:2008.03388 [pdf, other]

Controllable Neural Prosody Synthesis

Authors: Max Morrison, Zeyu **, Justin Salamon, Nicholas J. Bryan, Gautham J. Mysore

Abstract: Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with diverse speaker excitement levels and emotions. We… ▽ More Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with diverse speaker excitement levels and emotions. We address these limitations with a user-controllable, context-aware neural prosody generator. Given a real or synthesized speech recording, our model allows a user to input prosody constraints for certain time frames and generates the remaining time frames from input text and contextual prosody. We also propose a pitch-shifting neural vocoder to modify input speech to match the synthesized prosody. Through objective and subjective evaluations we show that we can successfully incorporate user control into our prosody generation model without sacrificing the overall naturalness of the synthesized speech. △ Less

Submitted 11 August, 2020; v1 submitted 7 August, 2020; originally announced August 2020.

Comments: To appear in proceedings of INTERSPEECH 2020

arXiv:1912.07772 [pdf, other]

doi 10.1103/PhysRevE.102.012304

Community detectability and structural balance dynamics in signed networks

Authors: Megan Morrison, Michael Gabbay

Abstract: We investigate signed networks with community structure with respect to their spectrum and their evolution under a dynamical model of structural balance, a prominent theory of signed social networks. The spectrum of the adjacency matrix generated by a stochastic block model with two equal size communities shows detectability transitions in which the community structure becomes manifest when its si… ▽ More We investigate signed networks with community structure with respect to their spectrum and their evolution under a dynamical model of structural balance, a prominent theory of signed social networks. The spectrum of the adjacency matrix generated by a stochastic block model with two equal size communities shows detectability transitions in which the community structure becomes manifest when its signal eigenvalue appears outside the main spectral band. The spectrum also exhibits "sociality" transitions involving the homogeneous structure representing the average tie value. We derive expressions for the eigenvalues associated with the community and homogeneous structure as well as the transition boundaries, all in good agreement with numerical results. Using the stochastically-generated networks as initial conditions for a simple model of structural balance dynamics yields three outcome regimes: two hostile factions that correspond with the initial communities, two hostile factions uncorrelated with those communities, and a single harmonious faction of all nodes. The detectability transition predicts the boundary between the assortative and mixed two-faction states and the sociality transition predicts that between the mixed and harmonious states. Our results may yield insight into the dynamics of cooperation and conflict among actors with distinct social identities. △ Less

Submitted 16 December, 2019; originally announced December 2019.

MSC Class: 91C20; 15B52; 91D30

Journal ref: Phys. Rev. E 102, 012304 (2020)

arXiv:1911.02073 [pdf, other]

OtoMechanic: Auditory Automobile Diagnostics via Query-by-Example

Authors: Max Morrison, Bryan Pardo

Abstract: Early detection and repair of failing components in automobiles reduces the risk of vehicle failure in life-threatening situations. Many automobile components in need of repair produce characteristic sounds. For example, loose drive belts emit a high-pitched squeaking sound, and bad starter motors have a characteristic whirring or clicking noise. Often drivers can tell that the sound of their car… ▽ More Early detection and repair of failing components in automobiles reduces the risk of vehicle failure in life-threatening situations. Many automobile components in need of repair produce characteristic sounds. For example, loose drive belts emit a high-pitched squeaking sound, and bad starter motors have a characteristic whirring or clicking noise. Often drivers can tell that the sound of their car is not normal, but may not be able to identify the cause. To mitigate this knowledge gap, we have developed OtoMechanic, a web application to detect and diagnose vehicle component issues from their corresponding sounds. It compares a user's recording of a problematic sound to a database of annotated sounds caused by failing automobile components. OtoMechanic returns the most similar sounds, and provides weblinks for more information on the diagnosis associated with each sound, along with an estimate of the similarity of each retrieved sound. In user studies, we find that OtoMechanic significantly increases diagnostic accuracy relative to a baseline accuracy of consumer performance. △ Less

Submitted 5 November, 2019; originally announced November 2019.

Comments: Submitted to Workshop on Detection and Classification of Acoustic Scenes and Events 2019 (DCASE2019)

Showing 1–16 of 16 results for author: Morrison, M