Search | arXiv e-print repository

AstroPT: Scaling Large Observation Models for Astronomy

Authors: Michael J. Smith, Ryan J. Roberts, Eirini Angeloudi, Marc Huertas-Company

Abstract: This work presents AstroPT, an autoregressive pretrained transformer developed with astronomical use-cases in mind. The AstroPT models presented here have been pretrained on 8.6 million $512 \times 512$ pixel $grz$-band galaxy postage stamp observations from the DESI Legacy Survey DR8. We train a selection of foundation models of increasing size from 1 million to 2.1 billion parameters, and find t… ▽ More This work presents AstroPT, an autoregressive pretrained transformer developed with astronomical use-cases in mind. The AstroPT models presented here have been pretrained on 8.6 million $512 \times 512$ pixel $grz$-band galaxy postage stamp observations from the DESI Legacy Survey DR8. We train a selection of foundation models of increasing size from 1 million to 2.1 billion parameters, and find that AstroPT follows a similar saturating log-log scaling law to textual models. We also find that the models' performances on downstream tasks as measured by linear probing improves with model size up to the model parameter saturation point. We believe that collaborative community development paves the best route towards realising an open source `Large Observation Model' -- a model trained on data taken from the observational sciences at the scale seen in natural language processing. To this end, we release the source code, weights, and dataset for AstroPT under the MIT license, and invite potential collaborators to join us in collectively building and researching these models. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 12 pages, 4 figures, 1 table. Code available at https://github.com/Smith42/astroPT

arXiv:2401.01916 [pdf, other]

AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets

Authors: Ernest Perkowski, Rui Pan, Tuan Dung Nguyen, Yuan-Sen Ting, Sandor Kruk, Tong Zhang, Charlie O'Neill, Maja Jablonska, Zechang Sun, Michael J. Smith, Huiling Liu, Kevin Schawinski, Kartheik Iyer, Ioana Ciucă for UniverseTBD

Abstract: We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training. By employing a compact 7B-parameter LLaMA-2 model and focusing exclusively on a curated set of astronomy corpora -- comprising abstracts, introductions, and conclusions -- we achieve notable improvements in specialized topic comprehension. While general LLMs like… ▽ More We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training. By employing a compact 7B-parameter LLaMA-2 model and focusing exclusively on a curated set of astronomy corpora -- comprising abstracts, introductions, and conclusions -- we achieve notable improvements in specialized topic comprehension. While general LLMs like GPT-4 excel in broader question-answering scenarios due to superior reasoning capabilities, our findings suggest that continual pre-training with limited resources can still enhance model performance on specialized topics. Additionally, we present an extension of AstroLLaMA: the fine-tuning of the 7B LLaMA model on a domain-specific conversational dataset, culminating in the release of the chat-enabled AstroLLaMA for community use. Comprehensive quantitative benchmarking is currently in progress and will be detailed in an upcoming full paper. The model, AstroLLaMA-Chat, is now available at https://huggingface.co/universeTBD, providing the first open-source conversational AI tool tailored for the astronomy community. △ Less

Submitted 5 January, 2024; v1 submitted 2 January, 2024; originally announced January 2024.

Comments: 4 pages, 1 figure, model is available at https://huggingface.co/universeTBD, published in RNAAS

arXiv:2309.07207 [pdf, other]

EarthPT: a time series foundation model for Earth Observation

Authors: Michael J. Smith, Luke Fleming, James E. Geach

Abstract: We introduce EarthPT -- an Earth Observation (EO) pretrained transformer. EarthPT is a 700 million parameter decoding transformer foundation model trained in an autoregressive self-supervised manner and developed specifically with EO use-cases in mind. We demonstrate that EarthPT is an effective forecaster that can accurately predict future pixel-level surface reflectances across the 400-2300 nm r… ▽ More We introduce EarthPT -- an Earth Observation (EO) pretrained transformer. EarthPT is a 700 million parameter decoding transformer foundation model trained in an autoregressive self-supervised manner and developed specifically with EO use-cases in mind. We demonstrate that EarthPT is an effective forecaster that can accurately predict future pixel-level surface reflectances across the 400-2300 nm range well into the future. For example, forecasts of the evolution of the Normalised Difference Vegetation Index (NDVI) have a typical error of approximately 0.05 (over a natural range of -1 -> 1) at the pixel level over a five month test set horizon, out-performing simple phase-folded models based on historical averaging. We also demonstrate that embeddings learnt by EarthPT hold semantically meaningful information and could be exploited for downstream tasks such as highly granular, dynamic land use classification. Excitingly, we note that the abundance of EO data provides us with -- in theory -- quadrillions of training tokens. Therefore, if we assume that EarthPT follows neural scaling laws akin to those derived for Large Language Models (LLMs), there is currently no data-imposed limit to scaling EarthPT and other similar `Large Observation Models.' △ Less

Submitted 11 January, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: 7 pages, 4 figures, accepted to NeurIPS CCAI workshop at https://www.climatechange.ai/papers/neurips2023/2 . Code available at https://github.com/aspiaspace/EarthPT

arXiv:2302.12537 [pdf, other]

Why Target Networks Stabilise Temporal Difference Methods

Authors: Mattie Fellows, Matthew J. A. Smith, Shimon Whiteson

Abstract: Integral to recent successes in deep reinforcement learning has been a class of temporal difference methods that use infrequently updated target values for policy evaluation in a Markov Decision Process. Yet a complete theoretical explanation for the effectiveness of target networks remains elusive. In this work, we provide an analysis of this popular class of algorithms, to finally answer the que… ▽ More Integral to recent successes in deep reinforcement learning has been a class of temporal difference methods that use infrequently updated target values for policy evaluation in a Markov Decision Process. Yet a complete theoretical explanation for the effectiveness of target networks remains elusive. In this work, we provide an analysis of this popular class of algorithms, to finally answer the question: `why do target networks stabilise TD learning'? To do so, we formalise the notion of a partially fitted policy evaluation method, which describes the use of target networks and bridges the gap between fitted methods and semigradient temporal difference algorithms. Using this framework we are able to uniquely characterise the so-called deadly triad - the use of TD updates with (nonlinear) function approximation and off-policy data - which often leads to nonconvergent algorithms. This insight leads us to conclude that the use of target networks can mitigate the effects of poor conditioning in the Jacobian of the TD update. Instead, we show that under mild regularity conditions and a well tuned target network update frequency, convergence can be guaranteed even in the extremely challenging off-policy sampling and nonlinear function approximation setting. △ Less

Submitted 11 August, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

Comments: Found a small error in Appendix (Proposition 1, Appendix B3, penultimate line) that affects results presented in the original submission. These have been fixed and this version is the one accepted at ICML 2023

Journal ref: ICML 2023

arXiv:2302.08091 [pdf, other]

Do We Still Need Clinical Language Models?

Authors: Eric Lehman, Evan Hernandez, Diwakar Mahajan, Jonas Wulff, Micah J. Smith, Zachary Ziegler, Daniel Nadler, Peter Szolovits, Alistair Johnson, Emily Alsentzer

Abstract: Although recent advances in scaling large language models (LLMs) have resulted in improvements on many NLP tasks, it remains unclear whether these models trained primarily with general web text are the right tool in highly specialized, safety critical domains such as clinical text. Recent results have suggested that LLMs encode a surprising amount of medical knowledge. This raises an important que… ▽ More Although recent advances in scaling large language models (LLMs) have resulted in improvements on many NLP tasks, it remains unclear whether these models trained primarily with general web text are the right tool in highly specialized, safety critical domains such as clinical text. Recent results have suggested that LLMs encode a surprising amount of medical knowledge. This raises an important question regarding the utility of smaller domain-specific language models. With the success of general-domain LLMs, is there still a need for specialized clinical models? To investigate this question, we conduct an extensive empirical analysis of 12 language models, ranging from 220M to 175B parameters, measuring their performance on 3 different clinical tasks that test their ability to parse and reason over electronic health records. As part of our experiments, we train T5-Base and T5-Large models from scratch on clinical notes from MIMIC III and IV to directly investigate the efficiency of clinical tokens. We show that relatively small specialized clinical models substantially outperform all in-context learning approaches, even when finetuned on limited annotated data. Further, we find that pretraining on clinical tokens allows for smaller, more parameter-efficient models that either match or outperform much larger language models trained on general text. We release the code and the models used under the PhysioNet Credentialed Health Data license and data use agreement. △ Less

Submitted 16 February, 2023; originally announced February 2023.

arXiv:2211.03796 [pdf, other]

doi 10.1098/rsos.221454

Astronomia ex machina: a history, primer, and outlook on neural networks in astronomy

Authors: Michael J. Smith, James E. Geach

Abstract: In this review, we explore the historical development and future prospects of artificial intelligence (AI) and deep learning in astronomy. We trace the evolution of connectionism in astronomy through its three waves, from the early use of multilayer perceptrons, to the rise of convolutional and recurrent neural networks, and finally to the current era of unsupervised and generative deep learning m… ▽ More In this review, we explore the historical development and future prospects of artificial intelligence (AI) and deep learning in astronomy. We trace the evolution of connectionism in astronomy through its three waves, from the early use of multilayer perceptrons, to the rise of convolutional and recurrent neural networks, and finally to the current era of unsupervised and generative deep learning methods. With the exponential growth of astronomical data, deep learning techniques offer an unprecedented opportunity to uncover valuable insights and tackle previously intractable problems. As we enter the anticipated fourth wave of astronomical connectionism, we argue for the adoption of GPT-like foundation models fine-tuned for astronomical applications. Such models could harness the wealth of high-quality, multimodal astronomical data to serve state-of-the-art downstream tasks. To keep pace with advancements driven by Big Tech, we propose a collaborative, open-source approach within the astronomy community to develop and maintain these foundation models, fostering a symbiotic relationship between AI and astronomy that capitalizes on the unique strengths of both fields. △ Less

Submitted 12 May, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

Comments: 75 pages, 327 references, 32 figures. Review accepted in Royal Society Open Science

arXiv:2111.01713 [pdf, other]

doi 10.1093/mnras/stac130

Realistic galaxy image simulation via score-based generative models

Authors: Michael J. Smith, James E. Geach, Ryan A. Jackson, Nikhil Arora, Connor Stone, Stéphane Courteau

Abstract: We show that a Denoising Diffusion Probabalistic Model (DDPM), a class of score-based generative model, can be used to produce realistic mock images that mimic observations of galaxies. Our method is tested with Dark Energy Spectroscopic Instrument (DESI) grz imaging of galaxies from the Photometry and Rotation curve OBservations from Extragalactic Surveys (PROBES) sample and galaxies selected fro… ▽ More We show that a Denoising Diffusion Probabalistic Model (DDPM), a class of score-based generative model, can be used to produce realistic mock images that mimic observations of galaxies. Our method is tested with Dark Energy Spectroscopic Instrument (DESI) grz imaging of galaxies from the Photometry and Rotation curve OBservations from Extragalactic Surveys (PROBES) sample and galaxies selected from the Sloan Digital Sky Survey. Subjectively, the generated galaxies are highly realistic when compared with samples from the real dataset. We quantify the similarity by borrowing from the deep generative learning literature, using the `Fréchet Inception Distance' to test for subjective and morphological similarity. We also introduce the `Synthetic Galaxy Distance' metric to compare the emergent physical properties (such as total magnitude, colour and half light radius) of a ground truth parent and synthesised child dataset. We argue that the DDPM approach produces sharper and more realistic images than other generative methods such as Adversarial Networks (with the downside of more costly inference), and could be used to produce large samples of synthetic observations tailored to a specific imaging survey. We demonstrate two potential uses of the DDPM: (1) accurate in-painting of occluded data, such as satellite trails, and (2) domain transfer, where new input images can be processed to mimic the properties of the DDPM training set. Here we `DESI-fy' cartoon images as a proof of concept for domain transfer. Finally, we suggest potential applications for score-based approaches that could motivate further research on this topic within the astronomical community. △ Less

Submitted 31 January, 2022; v1 submitted 2 November, 2021; originally announced November 2021.

Comments: 11 pages, 8 figures. Code: https://github.com/smith42/astroddpm . Follow the Twitter bot @ThisIsNotAnApod for DDPM-generated APODs

arXiv:2103.15787 [pdf, other]

Meeting in the notebook: a notebook-based environment for micro-submissions in data science collaborations

Authors: Micah J. Smith, Jürgen Cito, Kalyan Veeramachaneni

Abstract: Developers in data science and other domains frequently use computational notebooks to create exploratory analyses and prototype models. However, they often struggle to incorporate existing software engineering tooling into these notebook-based workflows, leading to fragile development processes. We introduce Assemblé, a new development environment for collaborative data science projects, in which… ▽ More Developers in data science and other domains frequently use computational notebooks to create exploratory analyses and prototype models. However, they often struggle to incorporate existing software engineering tooling into these notebook-based workflows, leading to fragile development processes. We introduce Assemblé, a new development environment for collaborative data science projects, in which promising code fragments of data science pipelines can be contributed as pull requests to an upstream repository entirely from within JupyterLab, abstracting away low-level version control tool usage. We describe the design and implementation of Assemblé and report on a user study of 23 data scientists. △ Less

Submitted 29 March, 2021; originally announced March 2021.

arXiv:2012.07816 [pdf, other]

doi 10.1145/3479575

Enabling Collaborative Data Science Development with the Ballet Framework

Authors: Micah J. Smith, Jürgen Cito, Kelvin Lu, Kalyan Veeramachaneni

Abstract: While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, a lightweight fra… ▽ More While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, a lightweight framework for collaborative, open-source data science through a focus on feature engineering, and an accompanying cloud-based development environment. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to an ML performance evaluation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct a case study analysis of an income prediction problem with 27 collaborators, and discuss implications for future designers of collaborative projects. △ Less

Submitted 22 October, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

Journal ref: Proc. ACM Hum.-Comput. Interact. 5, CSCW2, Article 431 (October 2021), 39 pages

arXiv:2010.10777 [pdf, other]

AutoML to Date and Beyond: Challenges and Opportunities

Authors: Shubhra Kanti Karmaker Santu, Md. Mahadi Hassan, Micah J. Smith, Lei Xu, ChengXiang Zhai, Kalyan Veeramachaneni

Abstract: As big data becomes ubiquitous across domains, and more and more stakeholders aspire to make the most of their data, demand for machine learning tools has spurred researchers to explore the possibilities of automated machine learning (AutoML). AutoML tools aim to make machine learning accessible for non-machine learning experts (domain experts), to improve the efficiency of machine learning, and t… ▽ More As big data becomes ubiquitous across domains, and more and more stakeholders aspire to make the most of their data, demand for machine learning tools has spurred researchers to explore the possibilities of automated machine learning (AutoML). AutoML tools aim to make machine learning accessible for non-machine learning experts (domain experts), to improve the efficiency of machine learning, and to accelerate machine learning research. But although automation and efficiency are among AutoML's main selling points, the process still requires human involvement at a number of vital steps, including understanding the attributes of domain-specific data, defining prediction problems, creating a suitable training data set, and selecting a promising machine learning technique. These steps often require a prolonged back-and-forth that makes this process inefficient for domain experts and data scientists alike, and keeps so-called AutoML systems from being truly automatic. In this review article, we introduce a new classification system for AutoML systems, using a seven-tiered schematic to distinguish these systems based on their level of autonomy. We begin by describing what an end-to-end machine learning pipeline actually looks like, and which subtasks of the machine learning pipeline have been automated so far. We highlight those subtasks which are still done manually - generally by a data scientist - and explain how this limits domain experts' access to machine learning. Next, we introduce our novel level-based taxonomy for AutoML systems and define each level according to the scope of automation support provided. Finally, we lay out a roadmap for the future, pinpointing the research required to further automate the end-to-end machine learning pipeline and discussing important challenges that stand in the way of this ambitious goal. △ Less

Submitted 19 May, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

Comments: 35 pages, survey article, 3 figures

ACM Class: I.2

arXiv:2010.00622 [pdf, other]

doi 10.1093/mnras/stab424

Pix2Prof: fast extraction of sequential information from galaxy imagery via a deep natural language 'captioning' model

Authors: Michael J. Smith, Nikhil Arora, Connor Stone, Stéphane Courteau, James E. Geach

Abstract: We present 'Pix2Prof', a deep learning model that can eliminate any manual steps taken when extracting galaxy profiles. We argue that a galaxy profile of any sort is conceptually similar to a natural language image caption. This idea allows us to leverage image captioning methods from the field of natural language processing, and so we design Pix2Prof as a float sequence 'captioning' model suitabl… ▽ More We present 'Pix2Prof', a deep learning model that can eliminate any manual steps taken when extracting galaxy profiles. We argue that a galaxy profile of any sort is conceptually similar to a natural language image caption. This idea allows us to leverage image captioning methods from the field of natural language processing, and so we design Pix2Prof as a float sequence 'captioning' model suitable for galaxy profile inference. We demonstrate the technique by approximating a galaxy surface brightness (SB) profile fitting method that contains several manual steps. Pix2Prof processes $\sim$1 image per second on an Intel Xeon E5 2650 v3 CPU, improving on the speed of the manual interactive method by more than two orders of magnitude. Crucially, Pix2Prof requires no manual interaction, and since galaxy profile estimation is an embarrassingly parallel problem, we can further increase the throughput by running many Pix2Prof instances simultaneously. In perspective, Pix2Prof would take under an hour to infer profiles for $10^5$ galaxies on a single NVIDIA DGX-2 system. A single human expert would take approximately two years to complete the same task. Automated methodology such as this will accelerate the analysis of the next generation of large area sky surveys expected to yield hundreds of millions of targets. In such instances, all manual approaches -- even those involving a large number of experts -- will be impractical. △ Less

Submitted 28 April, 2021; v1 submitted 1 October, 2020; originally announced October 2020.

Comments: Accepted for publication in MNRAS. 10 pages, and 8 figures. Code: https://github.com/Smith42/pix2prof

arXiv:1905.08942 [pdf, other]

doi 10.1145/3318464.3386146

The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development

Authors: Micah J. Smith, Carles Sala, James Max Kanter, Kalyan Veeramachaneni

Abstract: As machine learning is applied more widely, data scientists often struggle to find or create end-to-end machine learning systems for specific tasks. The proliferation of libraries and frameworks and the complexity of the tasks have led to the emergence of "pipeline jungles" - brittle, ad hoc ML systems. To address these problems, we introduce the Machine Learning Bazaar, a new framework for develo… ▽ More As machine learning is applied more widely, data scientists often struggle to find or create end-to-end machine learning systems for specific tasks. The proliferation of libraries and frameworks and the complexity of the tasks have led to the emergence of "pipeline jungles" - brittle, ad hoc ML systems. To address these problems, we introduce the Machine Learning Bazaar, a new framework for develo** machine learning and automated machine learning software systems. First, we introduce ML primitives, a unified API and specification for data processing and ML components from different software libraries. Next, we compose primitives into usable ML pipelines, abstracting away glue code, data flow, and data storage. We further pair these pipelines with a hierarchy of AutoML strategies - Bayesian optimization and bandit learning. We use these components to create a general-purpose, multi-task, end-to-end AutoML system that provides solutions to a variety of data modalities (image, text, graph, tabular, relational, etc.) and problem types (classification, regression, anomaly detection, graph matching, etc.). We demonstrate 5 real-world use cases and 2 case studies of our approach. Finally, we present an evaluation suite of 456 real-world ML tasks and describe the characteristics of 2.5 million pipelines searched over this task suite. △ Less

Submitted 7 April, 2020; v1 submitted 21 May, 2019; originally announced May 2019.

Comments: To appear in SIGMOD '20

Journal ref: In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 785-800

arXiv:1902.05009 [pdf, other]

doi 10.1145/3290605.3300911

ATMSeer: Increasing Transparency and Controllability in Automated Machine Learning

Authors: Qianwen Wang, Yao Ming, Zhihua **, Qiaomu Shen, Dongyu Liu, Micah J. Smith, Kalyan Veeramachaneni, Huamin Qu

Abstract: To relieve the pain of manually selecting machine learning algorithms and tuning hyperparameters, automated machine learning (AutoML) methods have been developed to automatically search for good models. Due to the huge model search space, it is impossible to try all models. Users tend to distrust automatic results and increase the search budget as much as they can, thereby undermining the efficien… ▽ More To relieve the pain of manually selecting machine learning algorithms and tuning hyperparameters, automated machine learning (AutoML) methods have been developed to automatically search for good models. Due to the huge model search space, it is impossible to try all models. Users tend to distrust automatic results and increase the search budget as much as they can, thereby undermining the efficiency of AutoML. To address these issues, we design and implement ATMSeer, an interactive visualization tool that supports users in refining the search space of AutoML and analyzing the results. To guide the design of ATMSeer, we derive a workflow of using AutoML based on interviews with machine learning experts. A multi-granularity visualization is proposed to enable users to monitor the AutoML process, analyze the searched models, and refine the search space in real time. We demonstrate the utility and usability of ATMSeer through two case studies, expert interviews, and a user study with 13 end users. △ Less

Submitted 13 February, 2019; originally announced February 2019.

Comments: Published in the ACM Conference on Human Factors in Computing Systems (CHI), 2019, Glasgow, Scotland UK

Journal ref: In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). Association for Computing Machinery, New York, NY, USA, Paper 681, 1-12

arXiv:1610.08171 [pdf, other]

doi 10.4204/EPTCS.227.6

MELA: Modelling in Ecology with Location Attributes

Authors: Ludovica Luisa Vissat, Jane Hillston, Glenn Marion, Matthew J. Smith

Abstract: Ecology studies the interactions between individuals, species and the environment. The ability to predict the dynamics of ecological systems would support the design and monitoring of control strategies and would help to address pressing global environmental issues. It is also important to plan for efficient use of natural resources and maintenance of critical ecosystem services. The mathematical… ▽ More Ecology studies the interactions between individuals, species and the environment. The ability to predict the dynamics of ecological systems would support the design and monitoring of control strategies and would help to address pressing global environmental issues. It is also important to plan for efficient use of natural resources and maintenance of critical ecosystem services. The mathematical modelling of ecological systems often includes nontrivial specifications of processes that influence the birth, death, development and movement of individuals in the environment, that take into account both biotic and abiotic interactions. To assist in the specification of such models, we introduce MELA, a process algebra for Modelling in Ecology with Location Attributes. Process algebras allow the modeller to describe concurrent systems in a high-level language. A key feature of concurrent systems is that they are composed of agents that can progress simultaneously but also interact - a good match to ecological systems. MELA aims to provide ecologists with a straightforward yet flexible tool for modelling ecological systems, with particular emphasis on the description of space and the environment. Here we present four example MELA models, illustrating the different spatial arrangements which can be accommodated and demonstrating the use of MELA in epidemiological and predator-prey scenarios. △ Less

Submitted 26 October, 2016; originally announced October 2016.

Comments: In Proceedings QAPL'16, arXiv:1610.07696

Journal ref: EPTCS 227, 2016, pp. 82-97

arXiv:1602.08132 [pdf, ps, other]

doi 10.1109/CISP.2011.6100685

Adaptive Frequency Cepstral Coefficients for Word Mispronunciation Detection

Authors: Zhenhao Ge, Sudhendu R. Sharma, Mark J. T. Smith

Abstract: Systems based on automatic speech recognition (ASR) technology can provide important functionality in computer assisted language learning applications. This is a young but growing area of research motivated by the large number of students studying foreign languages. Here we propose a Hidden Markov Model (HMM)-based method to detect mispronunciations. Exploiting the specific dialog scripting employ… ▽ More Systems based on automatic speech recognition (ASR) technology can provide important functionality in computer assisted language learning applications. This is a young but growing area of research motivated by the large number of students studying foreign languages. Here we propose a Hidden Markov Model (HMM)-based method to detect mispronunciations. Exploiting the specific dialog scripting employed in language learning software, HMMs are trained for different pronunciations. New adaptive features have been developed and obtained through an adaptive war** of the frequency scale prior to computing the cepstral coefficients. The optimization criterion used for the war** function is to maximize separation of two major groups of pronunciations (native and non-native) in terms of classification rate. Experimental results show that the adaptive frequency scale yields a better coefficient representation leading to higher classification rates in comparison with conventional HMMs using Mel-frequency cepstral coefficients. △ Less

Submitted 25 February, 2016; originally announced February 2016.

Comments: 4th International Congress on Image and Signal Processing (CISP) 2011

arXiv:1602.08128 [pdf, ps, other]

doi 10.1117/12.884155

PCA Method for Automated Detection of Mispronounced Words

Authors: Zhenhao Ge, Sudhendu R. Sharma, Mark J. T. Smith

Abstract: This paper presents a method for detecting mispronunciations with the aim of improving Computer Assisted Language Learning (CALL) tools used by foreign language learners. The algorithm is based on Principle Component Analysis (PCA). It is hierarchical with each successive step refining the estimate to classify the test word as being either mispronounced or correct. Preprocessing before detection,… ▽ More This paper presents a method for detecting mispronunciations with the aim of improving Computer Assisted Language Learning (CALL) tools used by foreign language learners. The algorithm is based on Principle Component Analysis (PCA). It is hierarchical with each successive step refining the estimate to classify the test word as being either mispronounced or correct. Preprocessing before detection, like normalization and time-scale modification, is implemented to guarantee uniformity of the feature vectors input to the detection system. The performance using various features including spectrograms and Mel-Frequency Cepstral Coefficients (MFCCs) are compared and evaluated. Best results were obtained using MFCCs, achieving up to 99% accuracy in word verification and 93% in native/non-native classification. Compared with Hidden Markov Models (HMMs) which are used pervasively in recognition application, this particular approach is computational efficient and effective when training data is limited. △ Less

Submitted 25 February, 2016; originally announced February 2016.

Comments: SPIE Defense, Security, and Sensing

arXiv:1602.08045 [pdf, other]

doi 10.1117/12.919235

PCA/LDA Approach for Text-Independent Speaker Recognition

Authors: Zhenhao Ge, Sudhendu R. Sharma, Mark J. T. Smith

Abstract: Various algorithms for text-independent speaker recognition have been developed through the decades, aiming to improve both accuracy and efficiency. This paper presents a novel PCA/LDA-based approach that is faster than traditional statistical model-based methods and achieves competitive results. First, the performance based on only PCA and only LDA is measured; then a mixed model, taking advantag… ▽ More Various algorithms for text-independent speaker recognition have been developed through the decades, aiming to improve both accuracy and efficiency. This paper presents a novel PCA/LDA-based approach that is faster than traditional statistical model-based methods and achieves competitive results. First, the performance based on only PCA and only LDA is measured; then a mixed model, taking advantages of both methods, is introduced. A subset of the TIMIT corpus composed of 200 male speakers, is used for enrollment, validation and testing. The best results achieve 100%; 96% and 95% classification rate at population level 50; 100 and 200, using 39-dimensional MFCC features with delta and double delta. These results are based on 12-second text-independent speech for training and 4-second data for test. These are comparable to the conventional MFCC-GMM methods, but require significantly less time to train and operate. △ Less

Submitted 25 February, 2016; originally announced February 2016.

Comments: Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series

arXiv:1602.05292 [pdf, other]

Authorship Attribution Using a Neural Network Language Model

Authors: Zhenhao Ge, Yufang Sun, Mark J. T. Smith

Abstract: In practice, training language models for individual authors is often expensive because of limited data resources. In such cases, Neural Network Language Models (NNLMs), generally outperform the traditional non-parametric N-gram models. Here we investigate the performance of a feed-forward NNLM on an authorship attribution problem, with moderate author set size and relatively limited data. We also… ▽ More In practice, training language models for individual authors is often expensive because of limited data resources. In such cases, Neural Network Language Models (NNLMs), generally outperform the traditional non-parametric N-gram models. Here we investigate the performance of a feed-forward NNLM on an authorship attribution problem, with moderate author set size and relatively limited data. We also consider how the text topics impact performance. Compared with a well-constructed N-gram baseline method with Kneser-Ney smoothing, the proposed method achieves nearly 2:5% reduction in perplexity and increases author classification accuracy by 3:43% on average, given as few as 5 test sentences. The performance is very competitive with the state of the art in terms of accuracy and demand on test data. The source code, preprocessed datasets, a detailed description of the methodology and results are available at https://github.com/zge/authorship-attribution. △ Less

Submitted 16 February, 2016; originally announced February 2016.

Comments: Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI'16)

arXiv:1312.6122 [pdf, other]

Shadow networks: Discovering hidden nodes with models of information flow

Authors: James P. Bagrow, Suma Desu, Morgan R. Frank, Narine Manukyan, Lewis Mitchell, Andrew Reagan, Eric E. Bloedorn, Lashon B. Booker, Luther K. Branting, Michael J. Smith, Brian F. Tivnan, Christopher M. Danforth, Peter S. Dodds, Joshua C. Bongard

Abstract: Complex, dynamic networks underlie many systems, and understanding these networks is the concern of a great span of important scientific and engineering problems. Quantitative description is crucial for this understanding yet, due to a range of measurement problems, many real network datasets are incomplete. Here we explore how accidentally missing or deliberately hidden nodes may be detected in n… ▽ More Complex, dynamic networks underlie many systems, and understanding these networks is the concern of a great span of important scientific and engineering problems. Quantitative description is crucial for this understanding yet, due to a range of measurement problems, many real network datasets are incomplete. Here we explore how accidentally missing or deliberately hidden nodes may be detected in networks by the effect of their absence on predictions of the speed with which information flows through the network. We use Symbolic Regression (SR) to learn models relating information flow to network topology. These models show localized, systematic, and non-random discrepancies when applied to test networks with intentionally masked nodes, demonstrating the ability to detect the presence of missing nodes and where in the network those nodes are likely to reside. △ Less

Submitted 20 December, 2013; originally announced December 2013.

Comments: 12 pages, 3 figures

arXiv:1209.6578 [pdf, other]

Roadmap Document on Stochastic Analysis

Authors: Bo Friis Nielsen, Flemming Nielson, Henrik Pilegaard, Michael James Andrew Smith, Ender Yüksel, Kebin Zeng, Lijun Zhang

Abstract: This document was prepared as part of the MT-LAB research centre. The research centre studies the Modelling of Information Technology and is a VKR Centre of Excellence funded for five years by the VILLUM Foundation. You can read more about MT-LAB at its webpage www.MT-LAB.dk. The goal of the document is to serve as an introduction to new PhD students addressing the research goals of MT-LAB. As s… ▽ More This document was prepared as part of the MT-LAB research centre. The research centre studies the Modelling of Information Technology and is a VKR Centre of Excellence funded for five years by the VILLUM Foundation. You can read more about MT-LAB at its webpage www.MT-LAB.dk. The goal of the document is to serve as an introduction to new PhD students addressing the research goals of MT-LAB. As such it aims to provide an overview of a number of selected approaches to the modelling of stochastic systems. It should be readable not only by computers scientists with a background in formal methods but also by PhD students in stochastics that are interested in understanding the computer science approach to stochastic model checking. We have no intention of being encyclopedic in our treatment of the approaches or the literature. Rather we have made the selection of material based on the competences of the groups involved in or closely affiliated to MT-LAB, so as to ease the task of the PhD students in navigating an otherwise vast amount of literature. We have decided to publish the document in case other young researchers may find it helpful. The list of authors reflect those that have at times played a significant role in the production of the document. △ Less

Submitted 27 September, 2012; originally announced September 2012.

Comments: This work has been supported by MT-LAB, a VKR Centre of Excellence for the Modelling of Information Technology

Showing 1–20 of 20 results for author: Smith, M J