Search | arXiv e-print repository

Representing and Computing Uncertainty in Phonological Reconstruction

Authors: Johann-Mattis List, Nathan W. Hill, Robert Forkel, Frederic Blum

Abstract: Despite the inherently fuzzy nature of reconstructions in historical linguistics, most scholars do not represent their uncertainty when proposing proto-forms. With the increasing success of recently proposed approaches to automating certain aspects of the traditional comparative method, the formal representation of proto-forms has also improved. This formalization makes it possible to address both… ▽ More Despite the inherently fuzzy nature of reconstructions in historical linguistics, most scholars do not represent their uncertainty when proposing proto-forms. With the increasing success of recently proposed approaches to automating certain aspects of the traditional comparative method, the formal representation of proto-forms has also improved. This formalization makes it possible to address both the representation and the computation of uncertainty. Building on recent advances in supervised phonological reconstruction, during which an algorithm learns how to reconstruct words in a given proto-language relying on previously annotated data, and inspired by improved methods for automated word prediction from cognate sets, we present a new framework that allows for the representation of uncertainty in linguistic reconstruction and also includes a workflow for the computation of fuzzy reconstructions from linguistic data. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: To appear in: Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

arXiv:2208.02362 [pdf, other]

Bayesian regularization of empirical MDPs

Authors: Samarth Gupta, Daniel N. Hill, Lexing Ying, Inderjit Dhillon

Abstract: In most applications of model-based Markov decision processes, the parameters for the unknown underlying model are often estimated from the empirical data. Due to noise, the policy learnedfrom the estimated model is often far from the optimal policy of the underlying model. When applied to the environment of the underlying model, the learned policy results in suboptimal performance, thus calling f… ▽ More In most applications of model-based Markov decision processes, the parameters for the unknown underlying model are often estimated from the empirical data. Due to noise, the policy learnedfrom the estimated model is often far from the optimal policy of the underlying model. When applied to the environment of the underlying model, the learned policy results in suboptimal performance, thus calling for solutions with better generalization performance. In this work we take a Bayesian perspective and regularize the objective function of the Markov decision process with prior information in order to obtain more robust policies. Two approaches are proposed, one based on $L^1$ regularization and the other on relative entropic regularization. We evaluate our proposed algorithms on synthetic simulations and on real-world search logs of a large scale online shop** store. Our results demonstrate the robustness of regularized MDP policies against the noise present in the models. △ Less

Submitted 20 September, 2022; v1 submitted 3 August, 2022; originally announced August 2022.

arXiv:2207.14624 [pdf, other]

Post-processing of coronary and myocardial spatial data

Authors: Jay Aodh Mackenzie, Megan Jeanne Miller, Nicholas Hill, Mette Olufsen

Abstract: Numerical simulations of real-world phenomenon are implemented with at least two parts: the computational scheme and the computational domain. In the context of hemodynamics, the computational domain of a simulation represents the blood vessel network through which blood flows. Such blood vessel networks can contain millions of individual vessels that are joined together to form a in series and pa… ▽ More Numerical simulations of real-world phenomenon are implemented with at least two parts: the computational scheme and the computational domain. In the context of hemodynamics, the computational domain of a simulation represents the blood vessel network through which blood flows. Such blood vessel networks can contain millions of individual vessels that are joined together to form a in series and parallel to form the network. It is computationally unfeasible to explicitly simulate blood flow in all blood vessels. Here, from imaged data of a single porcine left coronary arterial tree, we develop a data-pipeline to obtain computational domains for hemodynmaic simulations from a graph representing the coronary vascular tree. Further, we develop a method to ascertain which subregions of the left ventricle are most likely to be perfused via a given artery using a comparison with the American Heart Association division of the left ventricle as a sense check. △ Less

Submitted 15 April, 2024; v1 submitted 29 July, 2022; originally announced July 2022.

Comments: 21 pages, 22 figures

arXiv:2204.10936 [pdf, other]

doi 10.1145/3477495.3531958

Counterfactual Learning To Rank for Utility-Maximizing Query Autocompletion

Authors: Adam Block, Rahul Kidambi, Daniel N. Hill, Thorsten Joachims, Inderjit S. Dhillon

Abstract: Conventional methods for query autocompletion aim to predict which completed query a user will select from a list. A shortcoming of this approach is that users often do not know which query will provide the best retrieval performance on the current information retrieval system, meaning that any query autocompletion methods trained to mimic user behavior can lead to suboptimal query suggestions. To… ▽ More Conventional methods for query autocompletion aim to predict which completed query a user will select from a list. A shortcoming of this approach is that users often do not know which query will provide the best retrieval performance on the current information retrieval system, meaning that any query autocompletion methods trained to mimic user behavior can lead to suboptimal query suggestions. To overcome this limitation, we propose a new approach that explicitly optimizes the query suggestions for downstream retrieval performance. We formulate this as a problem of ranking a set of rankings, where each query suggestion is represented by the downstream item ranking it produces. We then present a learning method that ranks query suggestions by the quality of their item rankings. The algorithm is based on a counterfactual learning approach that is able to leverage feedback on the items (e.g., clicks, purchases) to evaluate query suggestions through an unbiased estimator, thus avoiding the assumption that users write or select optimal queries. We establish theoretical support for the proposed approach and provide learning-theoretic guarantees. We also present empirical results on publicly available datasets, and demonstrate real-world applicability using data from an online shop** store. △ Less

Submitted 22 April, 2022; originally announced April 2022.

arXiv:2204.04619 [pdf, other]

A New Framework for Fast Automated Phonological Reconstruction Using Trimmed Alignments and Sound Correspondence Patterns

Authors: Johann-Mattis List, Robert Forkel, Nathan W. Hill

Abstract: Computational approaches in historical linguistics have been increasingly applied during the past decade and many new methods that implement parts of the traditional comparative method have been proposed. Despite these increased efforts, there are not many easy-to-use and fast approaches for the task of phonological reconstruction. Here we present a new framework that combines state-of-the-art tec… ▽ More Computational approaches in historical linguistics have been increasingly applied during the past decade and many new methods that implement parts of the traditional comparative method have been proposed. Despite these increased efforts, there are not many easy-to-use and fast approaches for the task of phonological reconstruction. Here we present a new framework that combines state-of-the-art techniques for automated sequence comparison with novel techniques for phonetic alignment analysis and sound correspondence pattern detection to allow for the supervised reconstruction of word forms in ancestral languages. We test the method on a new dataset covering six groups from three different language families. The results show that our method yields promising results while at the same time being not only fast but also easy to apply and expand. △ Less

Submitted 10 April, 2022; originally announced April 2022.

Comments: To appear at the 3rd Workshop on Computational Approaches to Historical Language Change, co-located with the ACL 2022 conference. https://www.aclweb.org/portal/content/3rd-workshop-computational-approaches-historical-language-change

arXiv:2101.03027 [pdf, other]

User-friendly automatic transcription of low-resource languages: Plugging ESPnet into Elpis

Authors: Oliver Adams, Benjamin Galliot, Guillaume Wisniewski, Nicholas Lambourne, Ben Foley, Rahasya Sanders-Dwyer, Janet Wiles, Alexis Michaud, Séverine Guillaume, Laurent Besacier, Christopher Cox, Katya Aplonova, Guillaume Jacques, Nathan Hill

Abstract: This paper reports on progress integrating the speech recognition toolkit ESPnet into Elpis, a web front-end originally designed to provide access to the Kaldi automatic speech recognition toolkit. The goal of this work is to make end-to-end speech recognition models available to language workers via a user-friendly graphical interface. Encouraging results are reported on (i) development of an ESP… ▽ More This paper reports on progress integrating the speech recognition toolkit ESPnet into Elpis, a web front-end originally designed to provide access to the Kaldi automatic speech recognition toolkit. The goal of this work is to make end-to-end speech recognition models available to language workers via a user-friendly graphical interface. Encouraging results are reported on (i) development of an ESPnet recipe for use in Elpis, with preliminary results on data sets previously used for training acoustic models with the Persephone toolkit along with a new data set that had not previously been used in speech recognition, and (ii) incorporating ESPnet into Elpis along with UI enhancements and a CUDA-supported Dockerfile. △ Less

Submitted 22 February, 2021; v1 submitted 15 December, 2020; originally announced January 2021.

arXiv:2012.07654 [pdf, other]

doi 10.1145/3447548.3467087

Session-Aware Query Auto-completion using Extreme Multi-label Ranking

Authors: Nishant Yadav, Rajat Sen, Daniel N. Hill, Arya Mazumdar, Inderjit S. Dhillon

Abstract: Query auto-completion (QAC) is a fundamental feature in search engines where the task is to suggest plausible completions of a prefix typed in the search bar. Previous queries in the user session can provide useful context for the user's intent and can be leveraged to suggest auto-completions that are more relevant while adhering to the user's prefix. Such session-aware QACs can be generated by re… ▽ More Query auto-completion (QAC) is a fundamental feature in search engines where the task is to suggest plausible completions of a prefix typed in the search bar. Previous queries in the user session can provide useful context for the user's intent and can be leveraged to suggest auto-completions that are more relevant while adhering to the user's prefix. Such session-aware QACs can be generated by recent sequence-to-sequence deep learning models; however, these generative approaches often do not meet the stringent latency requirements of responding to each user keystroke. Moreover, these generative approaches pose the risk of showing nonsensical queries. In this paper, we provide a solution to this problem: we take the novel approach of modeling session-aware QAC as an eXtreme Multi-Label Ranking (XMR) problem where the input is the previous query in the session and the user's current prefix, while the output space is the set of tens of millions of queries entered by users in the recent past. We adapt a popular XMR algorithm for this purpose by proposing several modifications to the key steps in the algorithm. The proposed modifications yield a 3.9x improvement in terms of Mean Reciprocal Rank (MRR) over the baseline XMR approach on a public search logs dataset. We are able to maintain an inference latency of less than 10 ms while still using session context. When compared against baseline models of acceptable latency, we observed a 33% improvement in MRR for short prefixes of up to 3 characters. Moreover, our model yielded a statistically significant improvement of 2.81% over a production QAC system in terms of suggestion acceptance rate, when deployed on the search bar of an online shop** store as part of an A/B test. △ Less

Submitted 21 August, 2021; v1 submitted 9 December, 2020; originally announced December 2020.

Comments: Accepted in KDD 2021. Updated results for baseline XMR

arXiv:1908.11322 [pdf, other]

doi 10.1145/3357384.3357980

A Zero Attention Model for Personalized Product Search

Authors: Qingyao Ai, Daniel N. Hill, S. V. N. Vishwanathan, W. Bruce Croft

Abstract: Product search is one of the most popular methods for people to discover and purchase products on e-commerce websites. Because personal preferences often have an important influence on the purchase decision of each customer, it is intuitive that personalization should be beneficial for product search engines. While synthetic experiments from previous studies show that purchase histories are useful… ▽ More Product search is one of the most popular methods for people to discover and purchase products on e-commerce websites. Because personal preferences often have an important influence on the purchase decision of each customer, it is intuitive that personalization should be beneficial for product search engines. While synthetic experiments from previous studies show that purchase histories are useful for identifying the individual intent of each product search session, the effect of personalization on product search in practice, however, remains mostly unknown. In this paper, we formulate the problem of personalized product search and conduct large-scale experiments with search logs sampled from a commercial e-commerce search engine. Results from our preliminary analysis show that the potential of personalization depends on query characteristics, interactions between queries, and user purchase histories. Based on these observations, we propose a Zero Attention Model for product search that automatically determines when and how to personalize a user-query pair via a novel attention mechanism. Empirical results on commercial product search logs show that the proposed model not only significantly outperforms state-of-the-art personalized product retrieval models, but also provides important information on the potential of personalization in each product search session. △ Less

Submitted 29 August, 2019; originally announced August 2019.

arXiv:1810.09558 [pdf, other]

doi 10.1145/3097983.3098184

An Efficient Bandit Algorithm for Realtime Multivariate Optimization

Authors: Daniel N Hill, Houssam Nassif, Yi Liu, Anand Iyer, S V N Vishwanathan

Abstract: Optimization is commonly employed to determine the content of web pages, such as to maximize conversions on landing pages or click-through rates on search engine result pages. Often the layout of these pages can be decoupled into several separate decisions. For example, the composition of a landing page may involve deciding which image to show, which wording to use, what color background to displa… ▽ More Optimization is commonly employed to determine the content of web pages, such as to maximize conversions on landing pages or click-through rates on search engine result pages. Often the layout of these pages can be decoupled into several separate decisions. For example, the composition of a landing page may involve deciding which image to show, which wording to use, what color background to display, etc. Such optimization is a combinatorial problem over an exponentially large decision space. Randomized experiments do not scale well to this setting, and therefore, in practice, one is typically limited to optimizing a single aspect of a web page at a time. This represents a missed opportunity in both the speed of experimentation and the exploitation of possible interactions between layout decisions. Here we focus on multivariate optimization of interactive web pages. We formulate an approach where the possible interactions between different components of the page are modeled explicitly. We apply bandit methodology to explore the layout space efficiently and use hill-climbing to select optimal content in realtime. Our algorithm also extends to contextualization and personalization of layout selection. Simulation results show the suitability of our approach to large decision spaces with strong interactions between content. We further apply our algorithm to optimize a message that promotes adoption of an Amazon service. After only a single week of online optimization, we saw a 21% conversion increase compared to the median layout. Our technique is currently being deployed to optimize content across several locations at Amazon.com. △ Less

Submitted 22 October, 2018; originally announced October 2018.

Comments: KDD'17 Audience Appreciation Award

Journal ref: Daniel N. Hill, Houssam Nassif, Yi Liu, Anand Iyer, and S. V. N. Vishwanathan. 2017. An Efficient Bandit Algorithm for Realtime Multivariate Optimization. In Proceedings of KDD'17, Halifax, NS, Canada, pp. 1813-1821, 2017

arXiv:1007.0050 [pdf, other]

Cloud Scheduler: a resource manager for distributed compute clouds

Authors: P. Armstrong, A. Agarwal, A. Bishop, A. Charbonneau, R. Desmarais, K. Fransham, N. Hill, I. Gable, S. Gaudet, S. Goliath, R. Impey, C. Leavett-Brown, J. Ouellete, M. Paterson, C. Pritchet, D. Penfold-Brown, W. Podaima, D. Schade, R. J. Sobie

Abstract: The availability of Infrastructure-as-a-Service (IaaS) computing clouds gives researchers access to a large set of new resources for running complex scientific applications. However, exploiting cloud resources for large numbers of jobs requires significant effort and expertise. In order to make it simple and transparent for researchers to deploy their applications, we have developed a virtual mach… ▽ More The availability of Infrastructure-as-a-Service (IaaS) computing clouds gives researchers access to a large set of new resources for running complex scientific applications. However, exploiting cloud resources for large numbers of jobs requires significant effort and expertise. In order to make it simple and transparent for researchers to deploy their applications, we have developed a virtual machine resource manager (Cloud Scheduler) for distributed compute clouds. Cloud Scheduler boots and manages the user-customized virtual machines in response to a user's job submission. We describe the motivation and design of the Cloud Scheduler and present results on its use on both science and commercial clouds. △ Less

Submitted 30 June, 2010; originally announced July 2010.

Comments: 10 pages, 1 figure

Showing 1–10 of 10 results for author: Hill, N