-
A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications
Authors:
Ahmed Magooda,
Alec Helyar,
Kyle Jackson,
David Sullivan,
Chad Atalla,
Emily Sheng,
Dan Vann,
Richard Edgar,
Hamid Palangi,
Roman Lutz,
Hongliang Kong,
Vincent Yun,
Eslam Kamal,
Federico Zarfati,
Hanna Wallach,
Sarah Bird,
Mei Chen
Abstract:
We present a framework for the automated measurement of responsible AI (RAI) metrics for large language models (LLMs) and associated products and services. Our framework for automatically measuring harms from LLMs builds on existing technical and sociotechnical expertise and leverages the capabilities of state-of-the-art LLMs, such as GPT-4. We use this framework to run through several case studie…
▽ More
We present a framework for the automated measurement of responsible AI (RAI) metrics for large language models (LLMs) and associated products and services. Our framework for automatically measuring harms from LLMs builds on existing technical and sociotechnical expertise and leverages the capabilities of state-of-the-art LLMs, such as GPT-4. We use this framework to run through several case studies investigating how different LLMs may violate a range of RAI-related principles. The framework may be employed alongside domain-specific sociotechnical expertise to create measurements for new harm areas in the future. By implementing this framework, we aim to enable more advanced harm measurement efforts and further the responsible use of LLMs.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
The CAMELS project: Expanding the galaxy formation model space with new ASTRID and 28-parameter TNG and SIMBA suites
Authors:
Yueying Ni,
Shy Genel,
Daniel Anglés-Alcázar,
Francisco Villaescusa-Navarro,
Yongseok Jo,
Simeon Bird,
Tiziana Di Matteo,
Rupert Croft,
Nianyi Chen,
Natalí S. M. de Santi,
Matthew Gebhardt,
Helen Shao,
Shivam Pandey,
Lars Hernquist,
Romeel Dave
Abstract:
We present CAMELS-ASTRID, the third suite of hydrodynamical simulations in the Cosmology and Astrophysics with MachinE Learning (CAMELS) project, along with new simulation sets that extend the model parameter space based on the previous frameworks of CAMELS-TNG and CAMELS-SIMBA, to provide broader training sets and testing grounds for machine-learning algorithms designed for cosmological studies.…
▽ More
We present CAMELS-ASTRID, the third suite of hydrodynamical simulations in the Cosmology and Astrophysics with MachinE Learning (CAMELS) project, along with new simulation sets that extend the model parameter space based on the previous frameworks of CAMELS-TNG and CAMELS-SIMBA, to provide broader training sets and testing grounds for machine-learning algorithms designed for cosmological studies. CAMELS-ASTRID employs the galaxy formation model following the ASTRID simulation and contains 2,124 hydrodynamic simulation runs that vary 3 cosmological parameters ($Ω_m$, $σ_8$, $Ω_b$) and 4 parameters controlling stellar and AGN feedback. Compared to the existing TNG and SIMBA simulation suites in CAMELS, the fiducial model of ASTRID features the mildest AGN feedback and predicts the least baryonic effect on the matter power spectrum. The training set of ASTRID covers a broader variation in the galaxy populations and the baryonic impact on the matter power spectrum compared to its TNG and SIMBA counterparts, which can make machine-learning models trained on the ASTRID suite exhibit better extrapolation performance when tested on other hydrodynamic simulation sets. We also introduce extension simulation sets in CAMELS that widely explore 28 parameters in the TNG and SIMBA models, demonstrating the enormity of the overall galaxy formation model parameter space and the complex non-linear interplay between cosmology and astrophysical processes. With the new simulation suites, we show that building robust machine-learning models favors training and testing on the largest possible diversity of galaxy formation models. We also demonstrate that it is possible to train accurate neural networks to infer cosmological parameters using the high-dimensional TNG-SB28 simulation set.
△ Less
Submitted 4 April, 2023;
originally announced April 2023.
-
Socio-Technological Challenges and Opportunities: Paths Forward
Authors:
Carole-Jean Wu,
Srilatha Manne,
Parthasarathy Ranganathan,
Sarah Bird,
Shane Greenstein
Abstract:
Advancements in digital technologies have a bootstrap** effect. The past fifty years of technological innovations from the computer architecture community have brought innovations and orders-of-magnitude efficiency improvements that engender use cases that were not previously possible -- stimulating novel application domains and increasing uses and deployments at an ever-faster pace. Consequentl…
▽ More
Advancements in digital technologies have a bootstrap** effect. The past fifty years of technological innovations from the computer architecture community have brought innovations and orders-of-magnitude efficiency improvements that engender use cases that were not previously possible -- stimulating novel application domains and increasing uses and deployments at an ever-faster pace. Consequently, computing technologies have fueled significant economic growth, creating education opportunities, enabling access to a wider and more diverse spectrum of information, and, at the same time, connecting people of differing needs in the world together. Technology must be offered that is inclusive of the world's physical, cultural, and economic diversity, and which is manufactured, used, and recycled with environmental sustainability at the forefront. For the next decades to come, we envision significant cross-disciplinary efforts to build a circular development cycle by placing pervasive connectivity, sustainability, and demographic inclusion at the design forefront in order to sustain and expand the benefits of a technologically rich society. We hope this work will inspire our computing community to take broader and more holistic approaches when develo** technological solutions to serve people from different parts of the world.
△ Less
Submitted 15 August, 2021;
originally announced August 2021.
-
Spoken Term Detection Methods for Sparse Transcription in Very Low-resource Settings
Authors:
Éric Le Ferrand,
Steven Bird,
Laurent Besacier
Abstract:
We investigate the efficiency of two very different spoken term detection approaches for transcription when the available data is insufficient to train a robust ASR system. This work is grounded in very low-resource language documentation scenario where only few minutes of recording have been transcribed for a given language so far.Experiments on two oral languages show that a pretrained universal…
▽ More
We investigate the efficiency of two very different spoken term detection approaches for transcription when the available data is insufficient to train a robust ASR system. This work is grounded in very low-resource language documentation scenario where only few minutes of recording have been transcribed for a given language so far.Experiments on two oral languages show that a pretrained universal phone recognizer, fine-tuned with only a few minutes of target language speech, can be used for spoken term detection with a better overall performance than a dynamic time war** approach. In addition, we show that representing phoneme recognition ambiguity in a graph structure can further boost the recall while maintaining high precision in the low resource spoken term detection task.
△ Less
Submitted 11 June, 2021;
originally announced June 2021.
-
AI-assisted super-resolution cosmological simulations II: Halo substructures, velocities and higher order statistics
Authors:
Yueying Ni,
Yin Li,
Patrick Lachance,
Rupert A. C. Croft,
Tiziana Di Matteo,
Simeon Bird,
Yu Feng
Abstract:
In this work, we expand and test the capabilities of our recently developed super-resolution (SR) model to generate high-resolution (HR) realizations of the full phase-space matter distribution, including both displacement and velocity, from computationally cheap low-resolution (LR) cosmological N-body simulations. The SR model enhances the simulation resolution by generating 512 times more tracer…
▽ More
In this work, we expand and test the capabilities of our recently developed super-resolution (SR) model to generate high-resolution (HR) realizations of the full phase-space matter distribution, including both displacement and velocity, from computationally cheap low-resolution (LR) cosmological N-body simulations. The SR model enhances the simulation resolution by generating 512 times more tracer particles, extending into the deeply non-linear regime where complex structure formation processes take place. We validate the SR model by deploying the model in 10 test simulations of box size 100 Mpc/h, and examine the matter power spectra, bispectra and 2D power spectra in redshift space. We find the generated SR field matches the true HR result at percent level down to scales of k ~ 10 h/Mpc. We also identify and inspect dark matter halos and their substructures. Our SR model generate visually authentic small-scale structures, that cannot be resolved by the LR input, and are in good statistical agreement with the real HR results. The SR model performs satisfactorily on the halo occupation distribution, halo correlations in both real and redshift space, and the pairwise velocity distribution, matching the HR results with comparable scatter, thus demonstrating its potential in making mock halo catalogs. The SR technique can be a powerful and promising tool for modelling small-scale galaxy formation physics in large cosmological volumes.
△ Less
Submitted 17 September, 2021; v1 submitted 3 May, 2021;
originally announced May 2021.
-
Enabling Interactive Transcription in an Indigenous Community
Authors:
Éric Le Ferrand,
Steven Bird,
Laurent Besacier
Abstract:
We propose a novel transcription workflow which combines spoken term detection and human-in-the-loop, together with a pilot experiment. This work is grounded in an almost zero-resource scenario where only a few terms have so far been identified, involving two endangered languages. We show that in the early stages of transcription, when the available data is insufficient to train a robust ASR syste…
▽ More
We propose a novel transcription workflow which combines spoken term detection and human-in-the-loop, together with a pilot experiment. This work is grounded in an almost zero-resource scenario where only a few terms have so far been identified, involving two endangered languages. We show that in the early stages of transcription, when the available data is insufficient to train a robust ASR system, it is possible to take advantage of the transcription of a small number of isolated words in order to bootstrap the transcription of a speech collection.
△ Less
Submitted 11 November, 2020;
originally announced November 2020.
-
AI-assisted super-resolution cosmological simulations
Authors:
Yin Li,
Yueying Ni,
Rupert A. C. Croft,
Tiziana Di Matteo,
Simeon Bird,
Yu Feng
Abstract:
Cosmological simulations of galaxy formation are limited by finite computational resources. We draw from the ongoing rapid advances in Artificial Intelligence (specifically Deep Learning) to address this problem. Neural networks have been developed to learn from high-resolution (HR) image data, and then make accurate super-resolution (SR) versions of different low-resolution (LR) images. We apply…
▽ More
Cosmological simulations of galaxy formation are limited by finite computational resources. We draw from the ongoing rapid advances in Artificial Intelligence (specifically Deep Learning) to address this problem. Neural networks have been developed to learn from high-resolution (HR) image data, and then make accurate super-resolution (SR) versions of different low-resolution (LR) images. We apply such techniques to LR cosmological N-body simulations, generating SR versions. Specifically, we are able to enhance the simulation resolution by generating 512 times more particles and predicting their displacements from the initial positions. Therefore our results can be viewed as new simulation realizations themselves rather than projections, e.g., to their density fields. Furthermore, the generation process is stochastic, enabling us to sample the small-scale modes conditioning on the large-scale environment. Our model learns from only 16 pairs of small-volume LR-HR simulations, and is then able to generate SR simulations that successfully reproduce the HR matter power spectrum to percent level up to $16\,h^{-1}\mathrm{Mpc}$, and the HR halo mass function to within $10 \%$ down to $10^{11} \, M_\odot$. We successfully deploy the model in a box 1000 times larger than the training simulation box, showing that high-resolution mock surveys can be generated rapidly. We conclude that AI assistance has the potential to revolutionize modeling of small-scale galaxy formation physics in large cosmological volumes.
△ Less
Submitted 4 May, 2021; v1 submitted 13 October, 2020;
originally announced October 2020.
-
Tracking Measurement Obfuscations from SourceURL
Authors:
Sarah Bird
Abstract:
Tracking scripts can use the sourceURL directive to mask their origin from developer tools and tools that use the same JS call stack and network stack information. Firefox and Chromium appear to be affected. Firefox 78 now includes a preference to disable this behavior. This short paper describes the effect when using the OpenWPM measurement platform along with details of discovery.
Tracking scripts can use the sourceURL directive to mask their origin from developer tools and tools that use the same JS call stack and network stack information. Firefox and Chromium appear to be affected. Firefox 78 now includes a preference to disable this behavior. This short paper describes the effect when using the OpenWPM measurement platform along with details of discovery.
△ Less
Submitted 20 May, 2020;
originally announced May 2020.
-
Bootstrap** Techniques for Polysynthetic Morphological Analysis
Authors:
William Lane,
Steven Bird
Abstract:
Polysynthetic languages have exceptionally large and sparse vocabularies, thanks to the number of morpheme slots and combinations in a word. This complexity, together with a general scarcity of written data, poses a challenge to the development of natural language technologies. To address this challenge, we offer linguistically-informed approaches for bootstrap** a neural morphological analyzer,…
▽ More
Polysynthetic languages have exceptionally large and sparse vocabularies, thanks to the number of morpheme slots and combinations in a word. This complexity, together with a general scarcity of written data, poses a challenge to the development of natural language technologies. To address this challenge, we offer linguistically-informed approaches for bootstrap** a neural morphological analyzer, and demonstrate its application to Kunwinjku, a polysynthetic Australian language. We generate data from a finite state transducer to train an encoder-decoder model. We improve the model by "hallucinating" missing linguistic structure into the training data, and by resampling from a Zipf distribution to simulate a more natural distribution of morphemes. The best model accounts for all instances of reduplication in the test set and achieves an accuracy of 94.7% overall, a 10 percentage point improvement over the FST baseline. This process demonstrates the feasibility of bootstrap** a neural morph analyzer from minimal resources.
△ Less
Submitted 2 May, 2020;
originally announced May 2020.
-
Actions speak louder than words: Semi-supervised learning for browser fingerprinting detection
Authors:
Sarah Bird,
Vikas Mishra,
Steven Englehardt,
Rob Willoughby,
David Zeber,
Walter Rudametkin,
Martin Lopatka
Abstract:
As online tracking continues to grow, existing anti-tracking and fingerprinting detection techniques that require significant manual input must be augmented. Heuristic approaches to fingerprinting detection are precise but must be carefully curated. Supervised machine learning techniques proposed for detecting tracking require manually generated label-sets. Seeking to overcome these challenges, we…
▽ More
As online tracking continues to grow, existing anti-tracking and fingerprinting detection techniques that require significant manual input must be augmented. Heuristic approaches to fingerprinting detection are precise but must be carefully curated. Supervised machine learning techniques proposed for detecting tracking require manually generated label-sets. Seeking to overcome these challenges, we present a semi-supervised machine learning approach for detecting fingerprinting scripts. Our approach is based on the core insight that fingerprinting scripts have similar patterns of API access when generating their fingerprints, even though their access patterns may not match exactly. Using this insight, we group scripts by their JavaScript (JS) execution traces and apply a semi-supervised approach to detect new fingerprinting scripts. We detail our methodology and demonstrate its ability to identify the majority of scripts ($\geqslant$94.9%) identified by existing heuristic techniques. We also show that the approach expands beyond detecting known scripts by surfacing candidate scripts that are likely to include fingerprinting. Through an analysis of these candidate scripts we discovered fingerprinting scripts that were missed by heuristics and for which there are no heuristics. In particular, we identified over one hundred device-class fingerprinting scripts present on hundreds of domains. To the best of our knowledge, this is the first time device-class fingerprinting has been measured in the wild. These successes illustrate the power of a sparse vector representation and semi-supervised learning to complement and extend existing tracking detection techniques.
△ Less
Submitted 9 March, 2020;
originally announced March 2020.
-
MLSys: The New Frontier of Machine Learning Systems
Authors:
Alexander Ratner,
Dan Alistarh,
Gustavo Alonso,
David G. Andersen,
Peter Bailis,
Sarah Bird,
Nicholas Carlini,
Bryan Catanzaro,
Jennifer Chayes,
Eric Chung,
Bill Dally,
Jeff Dean,
Inderjit S. Dhillon,
Alexandros Dimakis,
Pradeep Dubey,
Charles Elkan,
Grigori Fursin,
Gregory R. Ganger,
Lise Getoor,
Phillip B. Gibbons,
Garth A. Gibson,
Joseph E. Gonzalez,
Justin Gottschlich,
Song Han,
Kim Hazelwood
, et al. (44 additional authors not shown)
Abstract:
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne…
▽ More
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.
△ Less
Submitted 1 December, 2019; v1 submitted 29 March, 2019;
originally announced April 2019.
-
Abstractions for AI-Based User Interfaces and Systems
Authors:
Alex Renda,
Harrison Goldstein,
Sarah Bird,
Chris Quirk,
Adrian Sampson
Abstract:
Novel user interfaces based on artificial intelligence, such as natural-language agents, present new categories of engineering challenges. These systems need to cope with uncertainty and ambiguity, interface with machine learning algorithms, and compose information from multiple users to make decisions. We propose to treat these challenges as language-design problems. We describe three programming…
▽ More
Novel user interfaces based on artificial intelligence, such as natural-language agents, present new categories of engineering challenges. These systems need to cope with uncertainty and ambiguity, interface with machine learning algorithms, and compose information from multiple users to make decisions. We propose to treat these challenges as language-design problems. We describe three programming language abstractions for three core problems in intelligent system design. First, hypothetical worlds support nondeterministic search over spaces of alternative actions. Second, a feature type system abstracts the interaction between applications and learning algorithms. Finally, constructs for collaborative execution extend hypothetical worlds across multiple machines while controlling access to private data. We envision these features as first steps toward a complete language for implementing AI-based interfaces and applications.
△ Less
Submitted 14 September, 2017;
originally announced September 2017.
-
Learning Crosslingual Word Embeddings without Bilingual Corpora
Authors:
Long Duong,
Hiroshi Kanayama,
Tengfei Ma,
Steven Bird,
Trevor Cohn
Abstract:
Crosslingual word embeddings represent lexical items from different languages in the same vector space, enabling transfer of NLP tools. However, previous attempts had expensive resource requirements, difficulty incorporating monolingual data or were unable to handle polysemy. We address these drawbacks in our method which takes advantage of a high coverage dictionary in an EM style training algori…
▽ More
Crosslingual word embeddings represent lexical items from different languages in the same vector space, enabling transfer of NLP tools. However, previous attempts had expensive resource requirements, difficulty incorporating monolingual data or were unable to handle polysemy. We address these drawbacks in our method which takes advantage of a high coverage dictionary in an EM style training algorithm over monolingual corpora in two languages. Our model achieves state-of-the-art performance on bilingual lexicon induction task exceeding models using large bilingual corpora, and competitive results on the monolingual word similarity and cross-lingual document classification task.
△ Less
Submitted 30 June, 2016;
originally announced June 2016.
-
Making Contextual Decisions with Low Technical Debt
Authors:
Alekh Agarwal,
Sarah Bird,
Markus Cozowicz,
Luong Hoang,
John Langford,
Stephen Lee,
Jiaji Li,
Dan Melamed,
Gal Oshri,
Oswaldo Ribas,
Siddhartha Sen,
Alex Slivkins
Abstract:
Applications and systems are constantly faced with decisions that require picking from a set of actions based on contextual information. Reinforcement-based learning algorithms such as contextual bandits can be very effective in these settings, but applying them in practice is fraught with technical debt, and no general system exists that supports them completely. We address this and create the fi…
▽ More
Applications and systems are constantly faced with decisions that require picking from a set of actions based on contextual information. Reinforcement-based learning algorithms such as contextual bandits can be very effective in these settings, but applying them in practice is fraught with technical debt, and no general system exists that supports them completely. We address this and create the first general system for contextual learning, called the Decision Service.
Existing systems often suffer from technical debt that arises from issues like incorrect data collection and weak debuggability, issues we systematically address through our ML methodology and system abstractions. The Decision Service enables all aspects of contextual bandit learning using four system abstractions which connect together in a loop: explore (the decision space), log, learn, and deploy. Notably, our new explore and log abstractions ensure the system produces correct, unbiased data, which our learner uses for online learning and to enable real-time safeguards, all in a fully reproducible manner.
The Decision Service has a simple user interface and works with a variety of applications: we present two live production deployments for content recommendation that achieved click-through improvements of 25-30%, another with 18% revenue lift in the landing page, and ongoing applications in tech support and machine failure handling. The service makes real-time decisions and learns continuously and scalably, while significantly lowering technical debt.
△ Less
Submitted 9 May, 2017; v1 submitted 13 June, 2016;
originally announced June 2016.
-
Extending Dublin Core Metadata to Support the Description and Discovery of Language Resources
Authors:
Steven Bird,
Gary Simons
Abstract:
As language data and associated technologies proliferate and as the language resources community expands, it is becoming increasingly difficult to locate and reuse existing resources. Are there any lexical resources for such-and-such a language? What tool works with transcripts in this particular format? What is a good format to use for linguistic data of this type? Questions like these dominate…
▽ More
As language data and associated technologies proliferate and as the language resources community expands, it is becoming increasingly difficult to locate and reuse existing resources. Are there any lexical resources for such-and-such a language? What tool works with transcripts in this particular format? What is a good format to use for linguistic data of this type? Questions like these dominate many mailing lists, since web search engines are an unreliable way to find language resources. This paper reports on a new digital infrastructure for discovering language resources being developed by the Open Language Archives Community (OLAC). At the core of OLAC is its metadata format, which is designed to facilitate description and discovery of all kinds of language resources, including data, tools, or advice. The paper describes OLAC metadata, its relationship to Dublin Core metadata, and its dissemination using the metadata harvesting protocol of the Open Archives Initiative.
△ Less
Submitted 14 August, 2003;
originally announced August 2003.
-
A Grid Based Architecture for High-Performance NLP
Authors:
Baden Hughes,
Steven Bird
Abstract:
We describe the design and early implementation of an extensible, component-based software architecture for natural language engineering applications which interfaces with high performance distributed computing services. The architecture leverages existing linguistic resource description and discovery mechanisms based on metadata descriptions, combining these in a compatible fashion with other s…
▽ More
We describe the design and early implementation of an extensible, component-based software architecture for natural language engineering applications which interfaces with high performance distributed computing services. The architecture leverages existing linguistic resource description and discovery mechanisms based on metadata descriptions, combining these in a compatible fashion with other software definition abstractions. Within this architecture, application design is highly flexible, allowing disparate components to be combined to suit the overall application functionality, and formally described independently of processing concerns. An application specification language provides abstraction from the programming environment and allows ease of interface with high performance computational grids via a broker.
△ Less
Submitted 4 August, 2003;
originally announced August 2003.
-
The Open Language Archives Community: An infrastructure for distributed archiving of language resources
Authors:
Gary Simons,
Steven Bird
Abstract:
New ways of documenting and describing language via electronic media coupled with new ways of distributing the results via the World-Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resource discovery and resource creation. This article…
▽ More
New ways of documenting and describing language via electronic media coupled with new ways of distributing the results via the World-Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resource discovery and resource creation. This article describes the infrastructure that the Open Language Archives Community (OLAC) has built in order to address these problems. Its technical and usage infrastructures address problems of resource discovery by constructing a single virtual library of distributed resources. Its governance infrastructure addresses problems of resource creation by providing a mechanism through which the language-resource community can express its consensus on recommended best practices.
△ Less
Submitted 10 June, 2003;
originally announced June 2003.
-
Grid-Enabling Natural Language Engineering By Stealth
Authors:
Baden Hughes,
Steven Bird
Abstract:
We describe a proposal for an extensible, component-based software architecture for natural language engineering applications. Our model leverages existing linguistic resource description and discovery mechanisms based on extended Dublin Core metadata. In addition, the application design is flexible, allowing disparate components to be combined to suit the overall application functionality. An a…
▽ More
We describe a proposal for an extensible, component-based software architecture for natural language engineering applications. Our model leverages existing linguistic resource description and discovery mechanisms based on extended Dublin Core metadata. In addition, the application design is flexible, allowing disparate components to be combined to suit the overall application functionality. An application specification language provides abstraction from the programming environment and allows ease of interface with computational grids via a broker.
△ Less
Submitted 21 April, 2003;
originally announced April 2003.
-
Building an Open Language Archives Community on the OAI Foundation
Authors:
Gary Simons,
Steven Bird
Abstract:
The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. The Dublin Core (DC) Element Set and the OAI Protocol have provided a solid foundation for the OLAC framework. However, we need more precision in community-specific aspects of resource description than is offered by DC. Fu…
▽ More
The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. The Dublin Core (DC) Element Set and the OAI Protocol have provided a solid foundation for the OLAC framework. However, we need more precision in community-specific aspects of resource description than is offered by DC. Furthermore, many of the institutions and individuals who might participate in OLAC do not have the technical resources to support the OAI protocol. This paper presents our solutions to these two problems. To address the first, we have developed an extensible application profile for language resource metadata. To address the second, we have implemented Vida (the virtual data provider) and Viser (the virtual service provider), which permit community members to provide data and services without having to implement the OAI protocol. These solutions are generic and could be adopted by other specialized subcommunities.
△ Less
Submitted 14 February, 2003;
originally announced February 2003.
-
NLTK: The Natural Language Toolkit
Authors:
Edward Loper,
Steven Bird
Abstract:
NLTK, the Natural Language Toolkit, is a suite of open source program modules, tutorials and problem sets, providing ready-to-use computational linguistics courseware. NLTK covers symbolic and statistical natural language processing, and is interfaced to annotated corpora. Students augment and replace existing components, learn structured programming by example, and manipulate sophisticated mode…
▽ More
NLTK, the Natural Language Toolkit, is a suite of open source program modules, tutorials and problem sets, providing ready-to-use computational linguistics courseware. NLTK covers symbolic and statistical natural language processing, and is interfaced to annotated corpora. Students augment and replace existing components, learn structured programming by example, and manipulate sophisticated models from the outset.
△ Less
Submitted 17 May, 2002;
originally announced May 2002.
-
Querying Databases of Annotated Speech
Authors:
Steve Cassidy,
Steven Bird
Abstract:
Annotated speech corpora are databases consisting of signal data along with time-aligned symbolic `transcriptions'. Such databases are typically multidimensional, heterogeneous and dynamic. These properties present a number of tough challenges for representation and query. The temporal nature of the data adds an additional layer of complexity. This paper presents and harmonises two independent e…
▽ More
Annotated speech corpora are databases consisting of signal data along with time-aligned symbolic `transcriptions'. Such databases are typically multidimensional, heterogeneous and dynamic. These properties present a number of tough challenges for representation and query. The temporal nature of the data adds an additional layer of complexity. This paper presents and harmonises two independent efforts to model annotated speech databases, one at Macquarie University and one at the University of Pennsylvania. Various query languages are described, along with illustrative applications to a variety of analytical problems. The research reported here forms a part of several ongoing projects to develop platform-independent open-source tools for creating, browsing, searching, querying and transforming linguistic databases, and to disseminate large linguistic databases over the internet.
△ Less
Submitted 11 April, 2002;
originally announced April 2002.
-
Phonology
Authors:
Steven Bird
Abstract:
Phonology is the systematic study of the sounds used in language, their internal structure, and their composition into syllables, words and phrases. Computational phonology is the application of formal and computational techniques to the representation and processing of phonological information. This chapter will present the fundamentals of descriptive phonology along with a brief overview of co…
▽ More
Phonology is the systematic study of the sounds used in language, their internal structure, and their composition into syllables, words and phrases. Computational phonology is the application of formal and computational techniques to the representation and processing of phonological information. This chapter will present the fundamentals of descriptive phonology along with a brief overview of computational phonology.
△ Less
Submitted 11 April, 2002;
originally announced April 2002.
-
Computational Phonology
Authors:
Steven Bird
Abstract:
Phonology, as it is practiced, is deeply computational. Phonological analysis is data-intensive and the resulting models are nothing other than specialized data structures and algorithms. In the past, phonological computation - managing data and develo** analyses - was done manually with pencil and paper. Increasingly, with the proliferation of affordable computers, IPA fonts and drawing softw…
▽ More
Phonology, as it is practiced, is deeply computational. Phonological analysis is data-intensive and the resulting models are nothing other than specialized data structures and algorithms. In the past, phonological computation - managing data and develo** analyses - was done manually with pencil and paper. Increasingly, with the proliferation of affordable computers, IPA fonts and drawing software, phonologists are seeking to move their computation work online. Computational Phonology provides the theoretical and technological framework for this migration, building on methodologies and tools from computational linguistics. This piece consists of an apology for computational phonology, a history, and an overview of current research.
△ Less
Submitted 10 April, 2002;
originally announced April 2002.
-
Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development
Authors:
Christopher Cieri,
Steven Bird
Abstract:
Annotation graphs and annotation servers offer infrastructure to support the analysis of human language resources in the form of time-series data such as text, audio and video. This paper outlines areas of common need among empirical linguists and computational linguists. After reviewing examples of data and tools used or under development for each of several areas, it proposes a common framewor…
▽ More
Annotation graphs and annotation servers offer infrastructure to support the analysis of human language resources in the form of time-series data such as text, audio and video. This paper outlines areas of common need among empirical linguists and computational linguists. After reviewing examples of data and tools used or under development for each of several areas, it proposes a common framework for future tool development, data annotation and resource sharing based upon annotation graphs and servers.
△ Less
Submitted 10 April, 2002;
originally announced April 2002.
-
Seven Dimensions of Portability for Language Documentation and Description
Authors:
Steven Bird,
Gary Simons
Abstract:
The process of documenting and describing the world's languages is undergoing radical transformation with the rapid uptake of new digital technologies for capture, storage, annotation and dissemination. However, uncritical adoption of new tools and technologies is leading to resources that are difficult to reuse and which are less portable than the conventional printed resources they replace. We…
▽ More
The process of documenting and describing the world's languages is undergoing radical transformation with the rapid uptake of new digital technologies for capture, storage, annotation and dissemination. However, uncritical adoption of new tools and technologies is leading to resources that are difficult to reuse and which are less portable than the conventional printed resources they replace. We begin by reviewing current uses of software tools and digital technologies for language documentation and description. This sheds light on how digital language documentation and description are created and managed, leading to an analysis of seven portability problems under the following headings: content, format, discovery, access, citation, preservation and rights. After characterizing each problem we provide a series of value statements, and this provides the framework for a broad range of best practice recommendations.
△ Less
Submitted 10 April, 2002;
originally announced April 2002.
-
An Integrated Framework for Treebanks and Multilayer Annotations
Authors:
Scott Cotton,
Steven Bird
Abstract:
Treebank formats and associated software tools are proliferating rapidly, with little consideration for interoperability. We survey a wide variety of treebank structures and operations, and show how they can be mapped onto the annotation graph model, and leading to an integrated framework encompassing tree and non-tree annotations alike. This development opens up new possibilities for managing a…
▽ More
Treebank formats and associated software tools are proliferating rapidly, with little consideration for interoperability. We survey a wide variety of treebank structures and operations, and show how they can be mapped onto the annotation graph model, and leading to an integrated framework encompassing tree and non-tree annotations alike. This development opens up new possibilities for managing and exploiting multilayer annotations.
△ Less
Submitted 3 April, 2002;
originally announced April 2002.
-
TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit
Authors:
Steven Bird,
Kazuaki Maeda,
Xiaoyi Ma,
Haejoong Lee,
Beth Randall,
Salim Zayat
Abstract:
Four diverse tools built on the Annotation Graph Toolkit are described. Each tool associates linguistic codes and structures with time-series data. All are based on the same software library and tool architecture. TableTrans is for observational coding, using a spreadsheet whose rows are aligned to a signal. MultiTrans is for transcribing multi-party communicative interactions recorded using mul…
▽ More
Four diverse tools built on the Annotation Graph Toolkit are described. Each tool associates linguistic codes and structures with time-series data. All are based on the same software library and tool architecture. TableTrans is for observational coding, using a spreadsheet whose rows are aligned to a signal. MultiTrans is for transcribing multi-party communicative interactions recorded using multi-channel signals. InterTrans is for creating interlinear text aligned to audio. TreeTrans is for creating and manipulating syntactic trees. This work demonstrates that the development of diverse tools and re-use of software components is greatly facilitated by a common high-level application programming interface for representing the data and managing input/output, together with a common architecture for managing the interaction of multiple components.
△ Less
Submitted 3 April, 2002;
originally announced April 2002.
-
Creating Annotation Tools with the Annotation Graph Toolkit
Authors:
Kazuaki Maeda,
Steven Bird,
Xiaoyi Ma,
Haejoong Lee
Abstract:
The Annotation Graph Toolkit is a collection of software supporting the development of annotation tools based on the annotation graph model. The toolkit includes application programming interfaces for manipulating annotation graph data and for importing data from other formats. There are interfaces for the scripting languages Tcl and Python, a database interface, specialized graphical user inter…
▽ More
The Annotation Graph Toolkit is a collection of software supporting the development of annotation tools based on the annotation graph model. The toolkit includes application programming interfaces for manipulating annotation graph data and for importing data from other formats. There are interfaces for the scripting languages Tcl and Python, a database interface, specialized graphical user interfaces for a variety of annotation tasks, and several sample applications. This paper describes all the toolkit components for the benefit of would-be application developers.
△ Less
Submitted 3 April, 2002;
originally announced April 2002.
-
Models and Tools for Collaborative Annotation
Authors:
Xiaoyi Ma,
Haejoong Lee,
Steven Bird,
Kazuaki Maeda
Abstract:
The Annotation Graph Toolkit (AGTK) is a collection of software which facilitates development of linguistic annotation tools. AGTK provides a database interface which allows applications to use a database server for persistent storage. This paper discusses various modes of collaborative annotation and how they can be supported with tools built using AGTK and its database interface. We describe t…
▽ More
The Annotation Graph Toolkit (AGTK) is a collection of software which facilitates development of linguistic annotation tools. AGTK provides a database interface which allows applications to use a database server for persistent storage. This paper discusses various modes of collaborative annotation and how they can be supported with tools built using AGTK and its database interface. We describe the relational database schema and API, and describe a version of the TableTrans tool which supports collaborative annotation. The remainder of the paper discusses a high-level query language for annotation graphs, along with optimizations, in support of expressive and efficient access to the annotations held on a large central server. The paper demonstrates that it is straightforward to support a variety of different levels of collaborative annotation with existing AGTK-based tools, with a minimum of additional programming effort.
△ Less
Submitted 3 April, 2002;
originally announced April 2002.
-
The Open Language Archives Community and Asian Language Resources
Authors:
Steven Bird,
Gary Simons,
Chu-Ren Huang
Abstract:
The Open Language Archives Community (OLAC) is a new project to build a worldwide system of federated language archives based on the Open Archives Initiative and the Dublin Core Metadata Initiative. This paper aims to disseminate the OLAC vision to the language resources community in Asia, and to show language technologists and linguists how they can document their tools and data in such a way t…
▽ More
The Open Language Archives Community (OLAC) is a new project to build a worldwide system of federated language archives based on the Open Archives Initiative and the Dublin Core Metadata Initiative. This paper aims to disseminate the OLAC vision to the language resources community in Asia, and to show language technologists and linguists how they can document their tools and data in such a way that others can easily discover them. We describe OLAC and the OLAC Metadata Set, then discuss two key issues in the Asian context: language classification and multilingual resource classification.
△ Less
Submitted 3 October, 2001;
originally announced October 2001.
-
The OLAC Metadata Set and Controlled Vocabularies
Authors:
Steven Bird,
Gary Simons
Abstract:
As language data and associated technologies proliferate and as the language resources community rapidly expands, it has become difficult to locate and reuse existing resources. Are there any lexical resources for such-and-such a language? What tool can work with transcripts in this particular format? What is a good format to use for linguistic data of this type? Questions like these dominate ma…
▽ More
As language data and associated technologies proliferate and as the language resources community rapidly expands, it has become difficult to locate and reuse existing resources. Are there any lexical resources for such-and-such a language? What tool can work with transcripts in this particular format? What is a good format to use for linguistic data of this type? Questions like these dominate many mailing lists, since web search engines are an unreliable way to find language resources. This paper describes a new digital infrastructure for language resource discovery, based on the Open Archives Initiative, and called OLAC -- the Open Language Archives Community. The OLAC Metadata Set and the associated controlled vocabularies facilitate consistent description and focussed searching. We report progress on the metadata set and controlled vocabularies, describing current issues and soliciting input from the language resources community.
△ Less
Submitted 21 May, 2001;
originally announced May 2001.
-
A Formal Framework for Linguistic Annotation (revised version)
Authors:
Steven Bird,
Mark Liberman
Abstract:
`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions - audio, video and/or physiological recordings - or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entit…
▽ More
`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions - audio, video and/or physiological recordings - or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entity' identification, co-reference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focused on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.
△ Less
Submitted 26 October, 2000;
originally announced October 2000.
-
Many uses, many annotations for large speech corpora: Switchboard and TDT as case studies
Authors:
David Graff,
Steven Bird
Abstract:
This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out a…
▽ More
This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out as separate projects that were dispersed both geographically and chronologically. The TDT2 corpus has also received a variety of annotations, but all directly created or managed by a core group. In both cases, issues arise involving the propagation of repairs, consistency of references, and the ability to integrate annotations having different formats and levels of detail. We describe a general framework whereby these issues can be addressed successfully.
△ Less
Submitted 13 July, 2000;
originally announced July 2000.
-
Towards a query language for annotation graphs
Authors:
Steven Bird,
Peter Buneman,
Wang-Chiew Tan
Abstract:
The multidimensional, heterogeneous, and temporal nature of speech databases raises interesting challenges for representation and query. Recently, annotation graphs have been proposed as a general-purpose representational framework for speech databases. Typical queries on annotation graphs require path expressions similar to those used in semistructured query languages. However, the underlying m…
▽ More
The multidimensional, heterogeneous, and temporal nature of speech databases raises interesting challenges for representation and query. Recently, annotation graphs have been proposed as a general-purpose representational framework for speech databases. Typical queries on annotation graphs require path expressions similar to those used in semistructured query languages. However, the underlying model is rather different from the customary graph models for semistructured data: the graph is acyclic and unrooted, and both temporal and inclusion relationships are important. We develop a query language and describe optimization techniques for an underlying relational representation.
△ Less
Submitted 13 July, 2000;
originally announced July 2000.
-
ATLAS: A flexible and extensible architecture for linguistic annotation
Authors:
Steven Bird,
David Day,
John Garofolo,
John Henderson,
Christophe Laprun,
Mark Liberman
Abstract:
We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations. The abstract logical model provides for a range of storage formats and promotes the reuse of tools that interact through this API. We focus first on ``Annotation Graphs,'' a graph model for annotations on linear si…
▽ More
We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations. The abstract logical model provides for a range of storage formats and promotes the reuse of tools that interact through this API. We focus first on ``Annotation Graphs,'' a graph model for annotations on linear signals (such as text and speech) indexed by intervals, for which efficient database storage and querying techniques are applicable. We note how a wide range of existing annotated corpora can be mapped to this annotation graph model. This model is then generalized to encompass a wider variety of linguistic ``signals,'' including both naturally occuring phenomena (as recorded in images, video, multi-modal interactions, etc.), as well as the derived resources that are increasingly important to the engineering of natural language processing systems (such as word lists, dictionaries, aligned bilingual corpora, etc.). We conclude with a review of the current efforts towards implementing key pieces of this architecture.
△ Less
Submitted 13 July, 2000;
originally announced July 2000.
-
Annotation graphs as a framework for multidimensional linguistic data analysis
Authors:
Steven Bird,
Mark Liberman
Abstract:
In recent work we have presented a formal framework for linguistic annotation based on labeled acyclic digraphs. These `annotation graphs' offer a simple yet powerful method for representing complex annotation structures incorporating hierarchy and overlap. Here, we motivate and illustrate our approach using discourse-level annotations of text and speech data drawn from the CALLHOME, COCONUT, MU…
▽ More
In recent work we have presented a formal framework for linguistic annotation based on labeled acyclic digraphs. These `annotation graphs' offer a simple yet powerful method for representing complex annotation structures incorporating hierarchy and overlap. Here, we motivate and illustrate our approach using discourse-level annotations of text and speech data drawn from the CALLHOME, COCONUT, MUC-7, DAMSL and TRAINS annotation schemes. With the help of domain specialists, we have constructed a hybrid multi-level annotation for a fragment of the Boston University Radio Speech Corpus which includes the following levels: segment, word, breath, ToBI, Tilt, Treebank, coreference and named entity. We show how annotation graphs can represent hybrid multi-level structures which derive from a diverse set of file formats. We also show how the approach facilitates substantive comparison of multiple annotations of a single signal based on different theoretical models. The discussion shows how annotation graphs open the door to wide-ranging integration of tools, formats and corpora.
△ Less
Submitted 5 July, 1999;
originally announced July 1999.
-
A Formal Framework for Linguistic Annotation
Authors:
Steven Bird,
Mark Liberman
Abstract:
`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions -- audio, video and/or physiological recordings -- or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named ent…
▽ More
`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions -- audio, video and/or physiological recordings -- or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entity' identification, co-reference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focussed on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.
△ Less
Submitted 2 March, 1999;
originally announced March 1999.
-
A lexical database tool for quantitative phonological research
Authors:
Steven Bird
Abstract:
A lexical database tool tailored for phonological research is described. Database fields include transcriptions, glosses and hyperlinks to speech files. Database queries are expressed using HTML forms, and these permit regular expression search on any combination of fields. Regular expressions are passed directly to a Perl CGI program, enabling the full flexibility of Perl extended regular expre…
▽ More
A lexical database tool tailored for phonological research is described. Database fields include transcriptions, glosses and hyperlinks to speech files. Database queries are expressed using HTML forms, and these permit regular expression search on any combination of fields. Regular expressions are passed directly to a Perl CGI program, enabling the full flexibility of Perl extended regular expressions. The regular expression notation is extended to better support phonological searches, such as search for minimal pairs. Search results are presented in the form of HTML or LaTeX tables, where each cell is either a number (representing frequency) or a designated subset of the fields. Tables have up to four dimensions, with an elegant system for specifying which fragments of which fields should be used for the row/column labels. The tool offers several advantages over traditional methods of analysis: (i) it supports a quantitative method of doing phonological research; (ii) it gives universal access to the same set of informants; (iii) it enables other researchers to hear the original speech data without having to rely on published transcriptions; (iv) it makes the full power of regular expression search available, and search results are full multimedia documents; and (v) it enables the early refutation of false hypotheses, shortening the analysis-hypothesis-test loop. A life-size application to an African tone language (Dschang) is used for exemplification throughout the paper. The database contains 2200 records, each with approximately 15 fields. Running on a PC laptop with a stand-alone web server, the `Dschang HyperLexicon' has already been used extensively in phonological fieldwork and analysis in Cameroon.
△ Less
Submitted 22 July, 1997;
originally announced July 1997.
-
Automated tone transcription
Authors:
Steven Bird
Abstract:
In this paper I report on an investigation into the problem of assigning tones to pitch contours. The proposed model is intended to serve as a tool for phonologists working on instrumentally obtained pitch data from tone languages. Motivation and exemplification for the model is provided by data taken from my fieldwork on Bamileke Dschang (Cameroon). Following recent work by Liberman and others,…
▽ More
In this paper I report on an investigation into the problem of assigning tones to pitch contours. The proposed model is intended to serve as a tool for phonologists working on instrumentally obtained pitch data from tone languages. Motivation and exemplification for the model is provided by data taken from my fieldwork on Bamileke Dschang (Cameroon). Following recent work by Liberman and others, I provide a parametrised F_0 prediction function P which generates F_0 values from a tone sequence, and I explore the asymptotic behaviour of downstep. Next, I observe that transcribing a sequence X of pitch (i.e. F_0) values amounts to finding a tone sequence T such that P(T) {}~= X. This is a combinatorial optimisation problem, for which two non-deterministic search techniques are provided: a genetic algorithm and a simulated annealing algorithm. Finally, two implementations---one for each technique---are described and then compared using both artificial and real data for sequences of up to 20 tones. These programs can be adapted to other tone languages by adjusting the F_0 prediction function.
△ Less
Submitted 25 October, 1994; v1 submitted 24 October, 1994;
originally announced October 1994.