Skip to main content

Showing 1–39 of 39 results for author: Bird, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.17750  [pdf, other

    cs.CL

    A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications

    Authors: Ahmed Magooda, Alec Helyar, Kyle Jackson, David Sullivan, Chad Atalla, Emily Sheng, Dan Vann, Richard Edgar, Hamid Palangi, Roman Lutz, Hongliang Kong, Vincent Yun, Eslam Kamal, Federico Zarfati, Hanna Wallach, Sarah Bird, Mei Chen

    Abstract: We present a framework for the automated measurement of responsible AI (RAI) metrics for large language models (LLMs) and associated products and services. Our framework for automatically measuring harms from LLMs builds on existing technical and sociotechnical expertise and leverages the capabilities of state-of-the-art LLMs, such as GPT-4. We use this framework to run through several case studie… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Comments: This is a living document

  2. arXiv:2304.02096  [pdf, other

    astro-ph.CO astro-ph.GA cs.LG

    The CAMELS project: Expanding the galaxy formation model space with new ASTRID and 28-parameter TNG and SIMBA suites

    Authors: Yueying Ni, Shy Genel, Daniel Anglés-Alcázar, Francisco Villaescusa-Navarro, Yongseok Jo, Simeon Bird, Tiziana Di Matteo, Rupert Croft, Nianyi Chen, Natalí S. M. de Santi, Matthew Gebhardt, Helen Shao, Shivam Pandey, Lars Hernquist, Romeel Dave

    Abstract: We present CAMELS-ASTRID, the third suite of hydrodynamical simulations in the Cosmology and Astrophysics with MachinE Learning (CAMELS) project, along with new simulation sets that extend the model parameter space based on the previous frameworks of CAMELS-TNG and CAMELS-SIMBA, to provide broader training sets and testing grounds for machine-learning algorithms designed for cosmological studies.… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

  3. arXiv:2108.06738  [pdf, other

    cs.CY

    Socio-Technological Challenges and Opportunities: Paths Forward

    Authors: Carole-Jean Wu, Srilatha Manne, Parthasarathy Ranganathan, Sarah Bird, Shane Greenstein

    Abstract: Advancements in digital technologies have a bootstrap** effect. The past fifty years of technological innovations from the computer architecture community have brought innovations and orders-of-magnitude efficiency improvements that engender use cases that were not previously possible -- stimulating novel application domains and increasing uses and deployments at an ever-faster pace. Consequentl… ▽ More

    Submitted 15 August, 2021; originally announced August 2021.

    Comments: This article is intended to capture the ISCA panel and the following discussions on the Microprocessor 50: Societal Challenges from the lens of computer architects

  4. arXiv:2106.06160  [pdf, other

    cs.CL cs.SD eess.AS

    Spoken Term Detection Methods for Sparse Transcription in Very Low-resource Settings

    Authors: Éric Le Ferrand, Steven Bird, Laurent Besacier

    Abstract: We investigate the efficiency of two very different spoken term detection approaches for transcription when the available data is insufficient to train a robust ASR system. This work is grounded in very low-resource language documentation scenario where only few minutes of recording have been transcribed for a given language so far.Experiments on two oral languages show that a pretrained universal… ▽ More

    Submitted 11 June, 2021; originally announced June 2021.

  5. arXiv:2105.01016  [pdf, other

    astro-ph.CO cs.LG

    AI-assisted super-resolution cosmological simulations II: Halo substructures, velocities and higher order statistics

    Authors: Yueying Ni, Yin Li, Patrick Lachance, Rupert A. C. Croft, Tiziana Di Matteo, Simeon Bird, Yu Feng

    Abstract: In this work, we expand and test the capabilities of our recently developed super-resolution (SR) model to generate high-resolution (HR) realizations of the full phase-space matter distribution, including both displacement and velocity, from computationally cheap low-resolution (LR) cosmological N-body simulations. The SR model enhances the simulation resolution by generating 512 times more tracer… ▽ More

    Submitted 17 September, 2021; v1 submitted 3 May, 2021; originally announced May 2021.

    Comments: 13 pages, 11 figures, published version

  6. arXiv:2011.06198  [pdf, other

    cs.CL

    Enabling Interactive Transcription in an Indigenous Community

    Authors: Éric Le Ferrand, Steven Bird, Laurent Besacier

    Abstract: We propose a novel transcription workflow which combines spoken term detection and human-in-the-loop, together with a pilot experiment. This work is grounded in an almost zero-resource scenario where only a few terms have so far been identified, involving two endangered languages. We show that in the early stages of transcription, when the available data is insufficient to train a robust ASR syste… ▽ More

    Submitted 11 November, 2020; originally announced November 2020.

    Comments: inproceedings Coling 2020

  7. arXiv:2010.06608  [pdf, other

    astro-ph.CO cs.LG

    AI-assisted super-resolution cosmological simulations

    Authors: Yin Li, Yueying Ni, Rupert A. C. Croft, Tiziana Di Matteo, Simeon Bird, Yu Feng

    Abstract: Cosmological simulations of galaxy formation are limited by finite computational resources. We draw from the ongoing rapid advances in Artificial Intelligence (specifically Deep Learning) to address this problem. Neural networks have been developed to learn from high-resolution (HR) image data, and then make accurate super-resolution (SR) versions of different low-resolution (LR) images. We apply… ▽ More

    Submitted 4 May, 2021; v1 submitted 13 October, 2020; originally announced October 2020.

    Comments: 10 pages, 8 figures, matches PNAS accepted version

    Journal ref: PNAS May 11, 2021 118 (19) e2022038118

  8. arXiv:2005.10392  [pdf, other

    cs.CR

    Tracking Measurement Obfuscations from SourceURL

    Authors: Sarah Bird

    Abstract: Tracking scripts can use the sourceURL directive to mask their origin from developer tools and tools that use the same JS call stack and network stack information. Firefox and Chromium appear to be affected. Firefox 78 now includes a preference to disable this behavior. This short paper describes the effect when using the OpenWPM measurement platform along with details of discovery.

    Submitted 20 May, 2020; originally announced May 2020.

  9. arXiv:2005.00956  [pdf, ps, other

    cs.CL

    Bootstrap** Techniques for Polysynthetic Morphological Analysis

    Authors: William Lane, Steven Bird

    Abstract: Polysynthetic languages have exceptionally large and sparse vocabularies, thanks to the number of morpheme slots and combinations in a word. This complexity, together with a general scarcity of written data, poses a challenge to the development of natural language technologies. To address this challenge, we offer linguistically-informed approaches for bootstrap** a neural morphological analyzer,… ▽ More

    Submitted 2 May, 2020; originally announced May 2020.

  10. arXiv:2003.04463  [pdf, other

    cs.CR

    Actions speak louder than words: Semi-supervised learning for browser fingerprinting detection

    Authors: Sarah Bird, Vikas Mishra, Steven Englehardt, Rob Willoughby, David Zeber, Walter Rudametkin, Martin Lopatka

    Abstract: As online tracking continues to grow, existing anti-tracking and fingerprinting detection techniques that require significant manual input must be augmented. Heuristic approaches to fingerprinting detection are precise but must be carefully curated. Supervised machine learning techniques proposed for detecting tracking require manually generated label-sets. Seeking to overcome these challenges, we… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

  11. arXiv:1904.03257  [pdf, ps, other

    cs.LG cs.DB cs.DC cs.SE stat.ML

    MLSys: The New Frontier of Machine Learning Systems

    Authors: Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood , et al. (44 additional authors not shown)

    Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne… ▽ More

    Submitted 1 December, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

  12. arXiv:1709.04991  [pdf, other

    cs.PL cs.AI

    Abstractions for AI-Based User Interfaces and Systems

    Authors: Alex Renda, Harrison Goldstein, Sarah Bird, Chris Quirk, Adrian Sampson

    Abstract: Novel user interfaces based on artificial intelligence, such as natural-language agents, present new categories of engineering challenges. These systems need to cope with uncertainty and ambiguity, interface with machine learning algorithms, and compose information from multiple users to make decisions. We propose to treat these challenges as language-design problems. We describe three programming… ▽ More

    Submitted 14 September, 2017; originally announced September 2017.

  13. arXiv:1606.09403  [pdf, other

    cs.CL cs.AI

    Learning Crosslingual Word Embeddings without Bilingual Corpora

    Authors: Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, Trevor Cohn

    Abstract: Crosslingual word embeddings represent lexical items from different languages in the same vector space, enabling transfer of NLP tools. However, previous attempts had expensive resource requirements, difficulty incorporating monolingual data or were unable to handle polysemy. We address these drawbacks in our method which takes advantage of a high coverage dictionary in an EM style training algori… ▽ More

    Submitted 30 June, 2016; originally announced June 2016.

  14. arXiv:1606.03966  [pdf, other

    cs.LG cs.DC

    Making Contextual Decisions with Low Technical Debt

    Authors: Alekh Agarwal, Sarah Bird, Markus Cozowicz, Luong Hoang, John Langford, Stephen Lee, Jiaji Li, Dan Melamed, Gal Oshri, Oswaldo Ribas, Siddhartha Sen, Alex Slivkins

    Abstract: Applications and systems are constantly faced with decisions that require picking from a set of actions based on contextual information. Reinforcement-based learning algorithms such as contextual bandits can be very effective in these settings, but applying them in practice is fraught with technical debt, and no general system exists that supports them completely. We address this and create the fi… ▽ More

    Submitted 9 May, 2017; v1 submitted 13 June, 2016; originally announced June 2016.

  15. arXiv:cs/0308022  [pdf, ps, other

    cs.CL cs.DL

    Extending Dublin Core Metadata to Support the Description and Discovery of Language Resources

    Authors: Steven Bird, Gary Simons

    Abstract: As language data and associated technologies proliferate and as the language resources community expands, it is becoming increasingly difficult to locate and reuse existing resources. Are there any lexical resources for such-and-such a language? What tool works with transcripts in this particular format? What is a good format to use for linguistic data of this type? Questions like these dominate… ▽ More

    Submitted 14 August, 2003; originally announced August 2003.

    Comments: 12 pages, 1 figure

    ACM Class: H.2.7; H.3.3; H.3.7; I.2.7; I.7.2; J.5

    Journal ref: Computing and the Humanities, 37 (4), 2003

  16. arXiv:cs/0308008  [pdf

    cs.DC cs.CL

    A Grid Based Architecture for High-Performance NLP

    Authors: Baden Hughes, Steven Bird

    Abstract: We describe the design and early implementation of an extensible, component-based software architecture for natural language engineering applications which interfaces with high performance distributed computing services. The architecture leverages existing linguistic resource description and discovery mechanisms based on metadata descriptions, combining these in a compatible fashion with other s… ▽ More

    Submitted 4 August, 2003; originally announced August 2003.

    ACM Class: J.5; D.1; C.2

  17. arXiv:cs/0306040  [pdf

    cs.CL cs.DL

    The Open Language Archives Community: An infrastructure for distributed archiving of language resources

    Authors: Gary Simons, Steven Bird

    Abstract: New ways of documenting and describing language via electronic media coupled with new ways of distributing the results via the World-Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resource discovery and resource creation. This article… ▽ More

    Submitted 10 June, 2003; originally announced June 2003.

    Comments: 10 pages, 2 figures

    ACM Class: H.2.7; H.3.3; H.3.7; I.2.7; I.7.2; J.5

  18. arXiv:cs/0304028  [pdf

    cs.DC cs.CL

    Grid-Enabling Natural Language Engineering By Stealth

    Authors: Baden Hughes, Steven Bird

    Abstract: We describe a proposal for an extensible, component-based software architecture for natural language engineering applications. Our model leverages existing linguistic resource description and discovery mechanisms based on extended Dublin Core metadata. In addition, the application design is flexible, allowing disparate components to be combined to suit the overall application functionality. An a… ▽ More

    Submitted 21 April, 2003; originally announced April 2003.

    ACM Class: J.5; D.1; C.2

  19. arXiv:cs/0302021  [pdf

    cs.CL cs.DL

    Building an Open Language Archives Community on the OAI Foundation

    Authors: Gary Simons, Steven Bird

    Abstract: The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. The Dublin Core (DC) Element Set and the OAI Protocol have provided a solid foundation for the OLAC framework. However, we need more precision in community-specific aspects of resource description than is offered by DC. Fu… ▽ More

    Submitted 14 February, 2003; originally announced February 2003.

    Comments: 12 pages

    ACM Class: H.2.7; H.3.3; H.3.7; I.2.7; I.7.2; J.5

    Journal ref: Library Hi Tech 21(2), 2003

  20. arXiv:cs/0205028  [pdf, ps, other

    cs.CL

    NLTK: The Natural Language Toolkit

    Authors: Edward Loper, Steven Bird

    Abstract: NLTK, the Natural Language Toolkit, is a suite of open source program modules, tutorials and problem sets, providing ready-to-use computational linguistics courseware. NLTK covers symbolic and statistical natural language processing, and is interfaced to annotated corpora. Students augment and replace existing components, learn structured programming by example, and manipulate sophisticated mode… ▽ More

    Submitted 17 May, 2002; originally announced May 2002.

    Comments: 8 pages, 1 figure, Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Philadelphia, July 2002, Association for Computational Linguistics

    ACM Class: D.2.6; I.2.7; J.5; K.3.2

  21. arXiv:cs/0204026  [pdf, ps, other

    cs.CL cs.DB

    Querying Databases of Annotated Speech

    Authors: Steve Cassidy, Steven Bird

    Abstract: Annotated speech corpora are databases consisting of signal data along with time-aligned symbolic `transcriptions'. Such databases are typically multidimensional, heterogeneous and dynamic. These properties present a number of tough challenges for representation and query. The temporal nature of the data adds an additional layer of complexity. This paper presents and harmonises two independent e… ▽ More

    Submitted 11 April, 2002; originally announced April 2002.

    Comments: 9 pages, 4 figures

    ACM Class: H.2.3; H.2.4; H.5.5; I.2.7; J.5

    Journal ref: Database Technologies: Proceedings of the Eleventh Australasian Database Conference, pp. 12-20, IEEE Computer Society, 2000

  22. arXiv:cs/0204025  [pdf, ps, other

    cs.CL

    Phonology

    Authors: Steven Bird

    Abstract: Phonology is the systematic study of the sounds used in language, their internal structure, and their composition into syllables, words and phrases. Computational phonology is the application of formal and computational techniques to the representation and processing of phonological information. This chapter will present the fundamentals of descriptive phonology along with a brief overview of co… ▽ More

    Submitted 11 April, 2002; originally announced April 2002.

    Comments: 27 pages

    ACM Class: I.2.7; J.5

    Journal ref: In Ruslan Mitkov (ed) (2002). Oxford Handbook of Computational Linguistics

  23. arXiv:cs/0204023  [pdf, ps, other

    cs.CL

    Computational Phonology

    Authors: Steven Bird

    Abstract: Phonology, as it is practiced, is deeply computational. Phonological analysis is data-intensive and the resulting models are nothing other than specialized data structures and algorithms. In the past, phonological computation - managing data and develo** analyses - was done manually with pencil and paper. Increasingly, with the proliferation of affordable computers, IPA fonts and drawing softw… ▽ More

    Submitted 10 April, 2002; originally announced April 2002.

    Comments: 4 pages

    ACM Class: I.2.7; J.5

    Journal ref: Oxford International Encyclopedia of Linguistics, 2nd Edition, 2002

  24. arXiv:cs/0204022  [pdf

    cs.CL

    Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development

    Authors: Christopher Cieri, Steven Bird

    Abstract: Annotation graphs and annotation servers offer infrastructure to support the analysis of human language resources in the form of time-series data such as text, audio and video. This paper outlines areas of common need among empirical linguists and computational linguists. After reviewing examples of data and tools used or under development for each of several areas, it proposes a common framewor… ▽ More

    Submitted 10 April, 2002; originally announced April 2002.

    Comments: 8 pages, 6 figures

    ACM Class: H.2.4; H.5.3; H.5.5; I.2.7

    Journal ref: Proceedings of ACL Workshop on Sharing Tools and Resources for Research and Education, Toulouse, July 2001, pp 23-30

  25. arXiv:cs/0204020  [pdf, ps, other

    cs.CL cs.DL

    Seven Dimensions of Portability for Language Documentation and Description

    Authors: Steven Bird, Gary Simons

    Abstract: The process of documenting and describing the world's languages is undergoing radical transformation with the rapid uptake of new digital technologies for capture, storage, annotation and dissemination. However, uncritical adoption of new tools and technologies is leading to resources that are difficult to reuse and which are less portable than the conventional printed resources they replace. We… ▽ More

    Submitted 10 April, 2002; originally announced April 2002.

    Comments: 8 pages

    ACM Class: H.3.7; I.2.7; J.5

    Journal ref: Proceedings of the Workshop on Portability Issues in Human Language Technologies, Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002

  26. arXiv:cs/0204007  [pdf, ps, other

    cs.CL

    An Integrated Framework for Treebanks and Multilayer Annotations

    Authors: Scott Cotton, Steven Bird

    Abstract: Treebank formats and associated software tools are proliferating rapidly, with little consideration for interoperability. We survey a wide variety of treebank structures and operations, and show how they can be mapped onto the annotation graph model, and leading to an integrated framework encompassing tree and non-tree annotations alike. This development opens up new possibilities for managing a… ▽ More

    Submitted 3 April, 2002; originally announced April 2002.

    Comments: 8 pages

    ACM Class: I.2.7

    Journal ref: Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002

  27. arXiv:cs/0204006  [pdf, ps, other

    cs.CL cs.SD

    TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit

    Authors: Steven Bird, Kazuaki Maeda, Xiaoyi Ma, Haejoong Lee, Beth Randall, Salim Zayat

    Abstract: Four diverse tools built on the Annotation Graph Toolkit are described. Each tool associates linguistic codes and structures with time-series data. All are based on the same software library and tool architecture. TableTrans is for observational coding, using a spreadsheet whose rows are aligned to a signal. MultiTrans is for transcribing multi-party communicative interactions recorded using mul… ▽ More

    Submitted 3 April, 2002; originally announced April 2002.

    Comments: 7 pages, 7 figures

    ACM Class: D.2.13; H.5.5; I.2.7

    Journal ref: Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002

  28. arXiv:cs/0204005  [pdf, ps, other

    cs.CL cs.SD

    Creating Annotation Tools with the Annotation Graph Toolkit

    Authors: Kazuaki Maeda, Steven Bird, Xiaoyi Ma, Haejoong Lee

    Abstract: The Annotation Graph Toolkit is a collection of software supporting the development of annotation tools based on the annotation graph model. The toolkit includes application programming interfaces for manipulating annotation graph data and for importing data from other formats. There are interfaces for the scripting languages Tcl and Python, a database interface, specialized graphical user inter… ▽ More

    Submitted 3 April, 2002; originally announced April 2002.

    Comments: 8 pages, 12 figures

    ACM Class: D.2.13; H.5.5; I.2.7

    Journal ref: Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002

  29. arXiv:cs/0204004  [pdf, ps, other

    cs.CL cs.SD

    Models and Tools for Collaborative Annotation

    Authors: Xiaoyi Ma, Haejoong Lee, Steven Bird, Kazuaki Maeda

    Abstract: The Annotation Graph Toolkit (AGTK) is a collection of software which facilitates development of linguistic annotation tools. AGTK provides a database interface which allows applications to use a database server for persistent storage. This paper discusses various modes of collaborative annotation and how they can be supported with tools built using AGTK and its database interface. We describe t… ▽ More

    Submitted 3 April, 2002; originally announced April 2002.

    Comments: 8 pages, 6 figures

    ACM Class: H.2.4; H.5.3; H.5.5; I.2.7

    Journal ref: Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002

  30. arXiv:cs/0110014  [pdf, ps, other

    cs.CL cs.DL

    The Open Language Archives Community and Asian Language Resources

    Authors: Steven Bird, Gary Simons, Chu-Ren Huang

    Abstract: The Open Language Archives Community (OLAC) is a new project to build a worldwide system of federated language archives based on the Open Archives Initiative and the Dublin Core Metadata Initiative. This paper aims to disseminate the OLAC vision to the language resources community in Asia, and to show language technologists and linguists how they can document their tools and data in such a way t… ▽ More

    Submitted 3 October, 2001; originally announced October 2001.

    Comments: 8 pages, 2 figures

    ACM Class: H.2.7; H.3.3; H.3.7; I.2.7; I.7.2; J.5

    Journal ref: Proceedings of the Workshop on Language Resources in Asia, 6th Natural Language Processing Pacific Rim Symposium (NLPRS), Tokyo, November 2001

  31. arXiv:cs/0105030  [pdf, ps, other

    cs.CL cs.DL

    The OLAC Metadata Set and Controlled Vocabularies

    Authors: Steven Bird, Gary Simons

    Abstract: As language data and associated technologies proliferate and as the language resources community rapidly expands, it has become difficult to locate and reuse existing resources. Are there any lexical resources for such-and-such a language? What tool can work with transcripts in this particular format? What is a good format to use for linguistic data of this type? Questions like these dominate ma… ▽ More

    Submitted 21 May, 2001; originally announced May 2001.

    Comments: 12 pages, 5 figures

    ACM Class: H.2.7; H.3.3; H.3.7; I.2.7; I.7.2; J.5

    Journal ref: Proceedings of the ACL/EACL Workshop on Sharing Tools and Resources for Research and Education, Toulouse, July 2001, Association for Computational Linguistics

  32. arXiv:cs/0010033  [pdf, ps, other

    cs.CL cs.DB cs.DS

    A Formal Framework for Linguistic Annotation (revised version)

    Authors: Steven Bird, Mark Liberman

    Abstract: `Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions - audio, video and/or physiological recordings - or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entit… ▽ More

    Submitted 26 October, 2000; originally announced October 2000.

    Comments: 29 pages, 20 figures, to appear in Speech Communication, An earlier version appeared as cs.CL/9903003

    ACM Class: A.1; E.2; H.2.1; H.3.3; H.3.7; I.2.7

  33. arXiv:cs/0007024  [pdf, ps, other

    cs.CL

    Many uses, many annotations for large speech corpora: Switchboard and TDT as case studies

    Authors: David Graff, Steven Bird

    Abstract: This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out a… ▽ More

    Submitted 13 July, 2000; originally announced July 2000.

    Comments: 7 pages, 2 figures

    ACM Class: E.2; H.2.5; I.2.7

    Journal ref: Proceedings of the Second International Conference on Language Resources and Evaluation, pp. 427-433, Paris: European Language Resources Association, 2000

  34. arXiv:cs/0007023  [pdf, ps, other

    cs.CL cs.DB

    Towards a query language for annotation graphs

    Authors: Steven Bird, Peter Buneman, Wang-Chiew Tan

    Abstract: The multidimensional, heterogeneous, and temporal nature of speech databases raises interesting challenges for representation and query. Recently, annotation graphs have been proposed as a general-purpose representational framework for speech databases. Typical queries on annotation graphs require path expressions similar to those used in semistructured query languages. However, the underlying m… ▽ More

    Submitted 13 July, 2000; originally announced July 2000.

    Comments: 8 pages, 10 figures

    ACM Class: E.1; E.2; H.2.1; H.2.3; H.2.8; H.3.1; H.3.3; I.2.7

    Journal ref: Proceedings of the Second International Conference on Language Resources and Evaluation, pp. 807-814, Paris: European Language Resources Association, 2000

  35. arXiv:cs/0007022  [pdf, ps, other

    cs.CL

    ATLAS: A flexible and extensible architecture for linguistic annotation

    Authors: Steven Bird, David Day, John Garofolo, John Henderson, Christophe Laprun, Mark Liberman

    Abstract: We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations. The abstract logical model provides for a range of storage formats and promotes the reuse of tools that interact through this API. We focus first on ``Annotation Graphs,'' a graph model for annotations on linear si… ▽ More

    Submitted 13 July, 2000; originally announced July 2000.

    Comments: 8 pages, 9 figures

    ACM Class: E.2; H.2.1; H.3.3; H.3.4; H.3.7; I.2.7

    Journal ref: Proceedings of the Second International Conference on Language Resources and Evaluation, pp. 1699-1706, Paris: European Language Resources Association, 2000

  36. arXiv:cs/9907003  [pdf, ps, other

    cs.CL

    Annotation graphs as a framework for multidimensional linguistic data analysis

    Authors: Steven Bird, Mark Liberman

    Abstract: In recent work we have presented a formal framework for linguistic annotation based on labeled acyclic digraphs. These `annotation graphs' offer a simple yet powerful method for representing complex annotation structures incorporating hierarchy and overlap. Here, we motivate and illustrate our approach using discourse-level annotations of text and speech data drawn from the CALLHOME, COCONUT, MU… ▽ More

    Submitted 5 July, 1999; originally announced July 1999.

    Comments: 10 pages, 10 figures, Towards Standards and Tools for Discourse Tagging, Proceedings of the Workshop. pp. 1-10. Association for Computational Linguistics

    ACM Class: A.1; E.2; H.2.1; H.3.3; H.3.7; I.2.7

  37. arXiv:cs/9903003  [pdf, ps, other

    cs.CL

    A Formal Framework for Linguistic Annotation

    Authors: Steven Bird, Mark Liberman

    Abstract: `Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions -- audio, video and/or physiological recordings -- or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named ent… ▽ More

    Submitted 2 March, 1999; originally announced March 1999.

    Comments: 49 pages

    Report number: Tech Report MS-CIS-99-01, Dept of Computer and Information Science ACM Class: A.1; E.2; H.2.1; H.3.3; H.3.7; I.2.7

  38. A lexical database tool for quantitative phonological research

    Authors: Steven Bird

    Abstract: A lexical database tool tailored for phonological research is described. Database fields include transcriptions, glosses and hyperlinks to speech files. Database queries are expressed using HTML forms, and these permit regular expression search on any combination of fields. Regular expressions are passed directly to a Perl CGI program, enabling the full flexibility of Perl extended regular expre… ▽ More

    Submitted 22 July, 1997; originally announced July 1997.

    Comments: 7 pages, uses ipamacs.sty

    Journal ref: Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology, pp. 33-39, Madrid, July 1997. ACL

  39. Automated tone transcription

    Authors: Steven Bird

    Abstract: In this paper I report on an investigation into the problem of assigning tones to pitch contours. The proposed model is intended to serve as a tool for phonologists working on instrumentally obtained pitch data from tone languages. Motivation and exemplification for the model is provided by data taken from my fieldwork on Bamileke Dschang (Cameroon). Following recent work by Liberman and others,… ▽ More

    Submitted 25 October, 1994; v1 submitted 24 October, 1994; originally announced October 1994.

    Comments: 12 pages, 4 postscript figures, uses examples.sty, newapa.sty, latex-acl.sty, ipamacs.sty

    Journal ref: Proceedings of the First Meeting of the ACL Special