Search | arXiv e-print repository

The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings

Authors: Gili Goldin, Nick Howell, Noam Ordan, Ella Rabinovich, Shuly Wintner

Abstract: We present the Knesset Corpus, a corpus of Hebrew parliamentary proceedings containing over 30 million sentences (over 384 million tokens) from all the (plenary and committee) protocols held in the Israeli parliament between 1998 and 2022. Sentences are annotated with morpho-syntactic information and are associated with detailed meta-information reflecting demographic and political properties of t… ▽ More We present the Knesset Corpus, a corpus of Hebrew parliamentary proceedings containing over 30 million sentences (over 384 million tokens) from all the (plenary and committee) protocols held in the Israeli parliament between 1998 and 2022. Sentences are annotated with morpho-syntactic information and are associated with detailed meta-information reflecting demographic and political properties of the speakers, based on a large database of parliament members and factions that we compiled. We discuss the structure and composition of the corpus and the various processing steps we applied to it. To demonstrate the utility of this novel dataset we present two use cases. We show that the corpus can be used to examine historical developments in the style of political discussions by showing a reduction in lexical richness in the proceedings over time. We also investigate some differences between the styles of men and women speakers. These use cases exemplify the potential of the corpus to shed light on important trends in the Israeli society, supporting research in linguistics, political science, communication, law, etc. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 28 pages, 7 figures

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2404.01527 [pdf, other]

PlayFutures: Imagining Civic Futures with AI and Puppets

Authors: Supratim Pait, Sumita Sharma, Ashley Frith, Michael Nitsche, Noura Howell

Abstract: Children are the builders of the future and crucial to how the technologies around us develop. They are not voters but are participants in how the public spaces in a city are used. Through a workshop designed around kids of age 9-12, we investigate if novel technologies like artificial intelligence can be integrated in existing ways of play and performance to 1) re-imagine the future of civic spac… ▽ More Children are the builders of the future and crucial to how the technologies around us develop. They are not voters but are participants in how the public spaces in a city are used. Through a workshop designed around kids of age 9-12, we investigate if novel technologies like artificial intelligence can be integrated in existing ways of play and performance to 1) re-imagine the future of civic spaces, 2) reflect on these novel technologies in the process and 3) build ways of civic engagement through play. We do this using a blend AI image generation and Puppet making to ultimately build future scenarios, perform debate and discussion around the futures and reflect on AI, its role and potential in their process. We present our findings of how AI helped envision these futures, aid performances, and report some initial reflections from children about the technology. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: This is a position paper presented at the "CHI 2024 Workshop on Child-centred AI Design, May 11, 2024, Honolulu, HI, USA."

arXiv:2310.06764 [pdf, other]

OmniLingo: Listening- and speaking-based language learning

Authors: Francis M. Tyers, Nicholas Howell

Abstract: In this demo paper we present OmniLingo, an architecture for distributing data for listening- and speaking-based language learning applications and a demonstration client built using the architecture. The architecture is based on the Interplanetary Filesystem (IPFS) and puts at the forefront user sovereignty over data. In this demo paper we present OmniLingo, an architecture for distributing data for listening- and speaking-based language learning applications and a demonstration client built using the architecture. The architecture is based on the Interplanetary Filesystem (IPFS) and puts at the forefront user sovereignty over data. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:2210.07873 [pdf, other]

A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing

Authors: Amir Zeldes, Nick Howell, Noam Ordan, Yifat Ben Moshe

Abstract: Foundational Hebrew NLP tasks such as segmentation, tagging and parsing, have relied to date on various versions of the Hebrew Treebank (HTB, Sima'an et al. 2001). However, the data in HTB, a single-source newswire corpus, is now over 30 years old, and does not cover many aspects of contemporary Hebrew on the web. This paper presents a new, freely available UD treebank of Hebrew stratified from a… ▽ More Foundational Hebrew NLP tasks such as segmentation, tagging and parsing, have relied to date on various versions of the Hebrew Treebank (HTB, Sima'an et al. 2001). However, the data in HTB, a single-source newswire corpus, is now over 30 years old, and does not cover many aspects of contemporary Hebrew on the web. This paper presents a new, freely available UD treebank of Hebrew stratified from a range of topics selected from Hebrew Wikipedia. In addition to introducing the corpus and evaluating the quality of its annotations, we deploy automatic validation tools based on grew (Guillaume, 2021), and conduct the first cross domain parsing experiments in Hebrew. We obtain new state-of-the-art (SOTA) results on UD NLP tasks, using a combination of the latest language modelling and some incremental improvements to existing transformer based approaches. We also release a new version of the UD HTB matching annotation scheme updates from our new corpus. △ Less

Submitted 18 October, 2022; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: Proceedings of EMNLP 2022

Showing 1–4 of 4 results for author: Howell, N