-
LSTM-LM with Long-Term History for First-Pass Decoding in Conversational Speech Recognition
Authors:
Xie Chen,
Sarangarajan Parthasarathy,
William Gale,
Shuangyu Chang,
Michael Zeng
Abstract:
LSTM language models (LSTM-LMs) have been proven to be powerful and yielded significant performance improvements over count based n-gram LMs in modern speech recognition systems. Due to its infinite history states and computational load, most previous studies focus on applying LSTM-LMs in the second-pass for rescoring purpose. Recent work shows that it is feasible and computationally affordable to…
▽ More
LSTM language models (LSTM-LMs) have been proven to be powerful and yielded significant performance improvements over count based n-gram LMs in modern speech recognition systems. Due to its infinite history states and computational load, most previous studies focus on applying LSTM-LMs in the second-pass for rescoring purpose. Recent work shows that it is feasible and computationally affordable to adopt the LSTM-LMs in the first-pass decoding within a dynamic (or tree based) decoder framework. In this work, the LSTM-LM is composed with a WFST decoder on-the-fly for the first-pass decoding. Furthermore, motivated by the long-term history nature of LSTM-LMs, the use of context beyond the current utterance is explored for the first-pass decoding in conversational speech recognition. The context information is captured by the hidden states of LSTM-LMs across utterance and can be used to guide the first-pass search effectively. The experimental results in our internal meeting transcription system show that significant performance improvements can be obtained by incorporating the contextual information with LSTM-LMs in the first-pass decoding, compared to applying the contextual information in the second-pass rescoring.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
Long-span language modeling for speech recognition
Authors:
Sarangarajan Parthasarathy,
William Gale,
Xie Chen,
George Polovets,
Shuangyu Chang
Abstract:
We explore neural language modeling for speech recognition where the context spans multiple sentences. Rather than encode history beyond the current sentence using a cache of words or document-level features, we focus our study on the ability of LSTM and Transformer language models to implicitly learn to carry over context across sentence boundaries. We introduce a new architecture that incorporat…
▽ More
We explore neural language modeling for speech recognition where the context spans multiple sentences. Rather than encode history beyond the current sentence using a cache of words or document-level features, we focus our study on the ability of LSTM and Transformer language models to implicitly learn to carry over context across sentence boundaries. We introduce a new architecture that incorporates an attention mechanism into LSTM to combine the benefits of recurrent and attention architectures. We conduct language modeling and speech recognition experiments on the publicly available LibriSpeech corpus. We show that conventional training on a paragraph-level corpus results in significant reductions in perplexity compared to training on a sentence-level corpus. We also describe speech recognition experiments using long-span language models in second-pass re-ranking, and provide insights into the ability of such models to take advantage of context beyond the current sentence.
△ Less
Submitted 11 November, 2019;
originally announced November 2019.
-
Deep Learning Predicts Hip Fracture using Confounding Patient and Healthcare Variables
Authors:
Marcus A. Badgeley,
John R. Zech,
Luke Oakden-Rayner,
Benjamin S. Glicksberg,
Manway Liu,
William Gale,
Michael V. McConnell,
Beth Percha,
Thomas M. Snyder,
Joel T. Dudley
Abstract:
Hip fractures are a leading cause of death and disability among older adults. Hip fractures are also the most commonly missed diagnosis on pelvic radiographs. Computer-Aided Diagnosis (CAD) algorithms have shown promise for hel** radiologists detect fractures, but the image features underpinning their predictions are notoriously difficult to understand. In this study, we trained deep learning mo…
▽ More
Hip fractures are a leading cause of death and disability among older adults. Hip fractures are also the most commonly missed diagnosis on pelvic radiographs. Computer-Aided Diagnosis (CAD) algorithms have shown promise for hel** radiologists detect fractures, but the image features underpinning their predictions are notoriously difficult to understand. In this study, we trained deep learning models on 17,587 radiographs to classify fracture, five patient traits, and 14 hospital process variables. All 20 variables could be predicted from a radiograph (p < 0.05), with the best performances on scanner model (AUC=1.00), scanner brand (AUC=0.98), and whether the order was marked "priority" (AUC=0.79). Fracture was predicted moderately well from the image (AUC=0.78) and better when combining image features with patient data (AUC=0.86, p=2e-9) or patient data plus hospital process features (AUC=0.91, p=1e-21). The model performance on a test set with matched patient variables was significantly lower than a random test set (AUC=0.67, p=0.003); and when the test set was matched on patient and image acquisition variables, the model performed randomly (AUC=0.52, 95% CI 0.46-0.58), indicating that these variables were the main source of the model's predictive ability overall. We also used Naive Bayes to combine evidence from image models with patient and hospital data and found their inclusion improved performance, but that this approach was nevertheless inferior to directly modeling all variables. If CAD algorithms are inexplicably leveraging patient and process variables in their predictions, it is unclear how radiologists should interpret their predictions in the context of other known patient data. Further research is needed to illuminate deep learning decision processes so that computers and clinicians can effectively cooperate.
△ Less
Submitted 8 November, 2018;
originally announced November 2018.
-
Producing radiologist-quality reports for interpretable artificial intelligence
Authors:
William Gale,
Luke Oakden-Rayner,
Gustavo Carneiro,
Andrew P Bradley,
Lyle J Palmer
Abstract:
Current approaches to explaining the decisions of deep learning systems for medical tasks have focused on visualising the elements that have contributed to each decision. We argue that such approaches are not enough to "open the black box" of medical decision making systems because they are missing a key component that has been used as a standard communication tool between doctors for centuries: l…
▽ More
Current approaches to explaining the decisions of deep learning systems for medical tasks have focused on visualising the elements that have contributed to each decision. We argue that such approaches are not enough to "open the black box" of medical decision making systems because they are missing a key component that has been used as a standard communication tool between doctors for centuries: language. We propose a model-agnostic interpretability method that involves training a simple recurrent neural network model to produce descriptive sentences to clarify the decision of deep learning classifiers.
We test our method on the task of detecting hip fractures from frontal pelvic x-rays. This process requires minimal additional labelling despite producing text containing elements that the original deep learning classification model was not specifically trained to detect.
The experimental results show that: 1) the sentences produced by our method consistently contain the desired information, 2) the generated sentences are preferred by doctors compared to current tools that create saliency maps, and 3) the combination of visualisations and generated text is better than either alone.
△ Less
Submitted 1 June, 2018;
originally announced June 2018.
-
Detecting hip fractures with radiologist-level performance using deep neural networks
Authors:
William Gale,
Luke Oakden-Rayner,
Gustavo Carneiro,
Andrew P. Bradley,
Lyle J. Palmer
Abstract:
We developed an automated deep learning system to detect hip fractures from frontal pelvic x-rays, an important and common radiological task. Our system was trained on a decade of clinical x-rays (~53,000 studies) and can be applied to clinical data, automatically excluding inappropriate and technically unsatisfactory studies. We demonstrate diagnostic performance equivalent to a human radiologist…
▽ More
We developed an automated deep learning system to detect hip fractures from frontal pelvic x-rays, an important and common radiological task. Our system was trained on a decade of clinical x-rays (~53,000 studies) and can be applied to clinical data, automatically excluding inappropriate and technically unsatisfactory studies. We demonstrate diagnostic performance equivalent to a human radiologist and an area under the ROC curve of 0.994. Translated to clinical practice, such a system has the potential to increase the efficiency of diagnosis, reduce the need for expensive additional testing, expand access to expert level medical image interpretation, and improve overall patient outcomes.
△ Less
Submitted 17 November, 2017;
originally announced November 2017.
-
A Dynamical Model of Decision-Making Behaviour in a Network of Consumers with Applications to Energy Choices
Authors:
Nick. J. McCullen,
Mikhail. V. Ivanchenko,
Vladimir. D. Shalfeev,
William. F. Gale
Abstract:
A consumer Behaviour model is considered in the context of a network of interacting individuals in an energy market. We propose and analyse a simple dynamical model of an ensemble of coupled active elements mimicking consumers' Behaviour, where ``word-of-mouth'' interactions between individuals is important. A single element is modelled using the automatic control system framework. Assuming local…
▽ More
A consumer Behaviour model is considered in the context of a network of interacting individuals in an energy market. We propose and analyse a simple dynamical model of an ensemble of coupled active elements mimicking consumers' Behaviour, where ``word-of-mouth'' interactions between individuals is important. A single element is modelled using the automatic control system framework. Assuming local (nearest neighbour) coupling we study the evolution of chains and lattices of the model consumers on variation of the coupling strength and initial conditions. The results are interpreted as the dynamics of the decision-making process by the energy-market consumers. We demonstrate that a pitchfork bifurcation to the homogeneous solution leads to bistability of stationary regimes, while the autonomous system is always monostable. In presence of inhomogeneities this results in the formation of clusters of sharply positive and negative opinions. We also find that, depending on the coupling strength, the perturbations caused by inhomogeneities can be exponentially Localised in space or de-Localised. In the latter case the coarse-graining of opinion clusters occurs.
△ Less
Submitted 28 January, 2014;
originally announced January 2014.
-
Multi-parameter models of innovation diffusion on complex networks
Authors:
Nicholas J. McCullen,
Alastair M. Rucklidge,
Catherine S. E. Bale,
Tim J. Foxon,
William F. Gale
Abstract:
A model, applicable to a range of innovation diffusion applications with a strong peer to peer component, is developed and studied, along with methods for its investigation and analysis. A particular application is to individual households deciding whether to install an energy efficiency measure in their home. The model represents these individuals as nodes on a network, each with a variable repre…
▽ More
A model, applicable to a range of innovation diffusion applications with a strong peer to peer component, is developed and studied, along with methods for its investigation and analysis. A particular application is to individual households deciding whether to install an energy efficiency measure in their home. The model represents these individuals as nodes on a network, each with a variable representing their current state of adoption of the innovation. The motivation to adopt is composed of three terms, representing personal preference, an average of each individual's network neighbours' states and a system average, which is a measure of the current social trend. The adoption state of a node changes if a weighted linear combination of these factors exceeds some threshold. Numerical simulations have been carried out, computing the average uptake after a sufficient number of time-steps over many realisations at a range of model parameter values, on various network topologies, including random (Erdos-Renyi), small world (Watts-Strogatz) and (Newman's) highly clustered, community-based networks. An analytical and probabilistic approach has been developed to account for the observed behaviour, which explains the results of the numerical calculations.
△ Less
Submitted 20 July, 2012;
originally announced July 2012.
-
Tagging French Without Lexical Probabilities -- Combining Linguistic Knowledge And Statistical Learning
Authors:
Evelyne Tzoukermann,
Dragomir R. Radev,
William A. Gale
Abstract:
This paper explores morpho-syntactic ambiguities for French to develop a strategy for part-of-speech disambiguation that a) reflects the complexity of French as an inflected language, b) optimizes the estimation of probabilities, c) allows the user flexibility in choosing a tagset. The problem in extracting lexical probabilities from a limited training corpus is that the statistical model may no…
▽ More
This paper explores morpho-syntactic ambiguities for French to develop a strategy for part-of-speech disambiguation that a) reflects the complexity of French as an inflected language, b) optimizes the estimation of probabilities, c) allows the user flexibility in choosing a tagset. The problem in extracting lexical probabilities from a limited training corpus is that the statistical model may not necessarily represent the use of a particular word in a particular context. In a highly morphologically inflected language, this argument is particularly serious since a word can be tagged with a large number of parts of speech. Due to the lack of sufficient training data, we argue against estimating lexical probabilities to disambiguate parts of speech in unrestricted texts. Instead, we use the strength of contextual probabilities along with a feature we call ``genotype'', a set of tags associated with a word. Using this knowledge, we have built a part-of-speech tagger that combines linguistic and statistical approaches: contextual information is disambiguated by linguistic rules and n-gram probabilities on parts of speech only are estimated in order to disambiguate the remaining ambiguous tags.
△ Less
Submitted 10 October, 1997;
originally announced October 1997.
-
A Sequential Algorithm for Training Text Classifiers
Authors:
David D. Lewis,
William A. Gale
Abstract:
The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task. This method, which we call uncertain…
▽ More
The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task. This method, which we call uncertainty sampling, reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.
△ Less
Submitted 24 July, 1994; v1 submitted 24 July, 1994;
originally announced July 1994.
-
A Stochastic Finite-State Word-Segmentation Algorithm for Chinese
Authors:
Richard Sproat,
Chilin Shih,
William Gale,
Nancy Chang
Abstract:
We present a stochastic finite-state model for segmenting Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single segmentation.
We present a stochastic finite-state model for segmenting Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single segmentation.
△ Less
Submitted 5 May, 1994; v1 submitted 3 May, 1994;
originally announced May 1994.