-
The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings
Authors:
Gili Goldin,
Nick Howell,
Noam Ordan,
Ella Rabinovich,
Shuly Wintner
Abstract:
We present the Knesset Corpus, a corpus of Hebrew parliamentary proceedings containing over 30 million sentences (over 384 million tokens) from all the (plenary and committee) protocols held in the Israeli parliament between 1998 and 2022. Sentences are annotated with morpho-syntactic information and are associated with detailed meta-information reflecting demographic and political properties of t…
▽ More
We present the Knesset Corpus, a corpus of Hebrew parliamentary proceedings containing over 30 million sentences (over 384 million tokens) from all the (plenary and committee) protocols held in the Israeli parliament between 1998 and 2022. Sentences are annotated with morpho-syntactic information and are associated with detailed meta-information reflecting demographic and political properties of the speakers, based on a large database of parliament members and factions that we compiled. We discuss the structure and composition of the corpus and the various processing steps we applied to it. To demonstrate the utility of this novel dataset we present two use cases. We show that the corpus can be used to examine historical developments in the style of political discussions by showing a reduction in lexical richness in the proceedings over time. We also investigate some differences between the styles of men and women speakers. These use cases exemplify the potential of the corpus to shed light on important trends in the Israeli society, supporting research in linguistics, political science, communication, law, etc.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
PlayFutures: Imagining Civic Futures with AI and Puppets
Authors:
Supratim Pait,
Sumita Sharma,
Ashley Frith,
Michael Nitsche,
Noura Howell
Abstract:
Children are the builders of the future and crucial to how the technologies around us develop. They are not voters but are participants in how the public spaces in a city are used. Through a workshop designed around kids of age 9-12, we investigate if novel technologies like artificial intelligence can be integrated in existing ways of play and performance to 1) re-imagine the future of civic spac…
▽ More
Children are the builders of the future and crucial to how the technologies around us develop. They are not voters but are participants in how the public spaces in a city are used. Through a workshop designed around kids of age 9-12, we investigate if novel technologies like artificial intelligence can be integrated in existing ways of play and performance to 1) re-imagine the future of civic spaces, 2) reflect on these novel technologies in the process and 3) build ways of civic engagement through play. We do this using a blend AI image generation and Puppet making to ultimately build future scenarios, perform debate and discussion around the futures and reflect on AI, its role and potential in their process. We present our findings of how AI helped envision these futures, aid performances, and report some initial reflections from children about the technology.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
OmniLingo: Listening- and speaking-based language learning
Authors:
Francis M. Tyers,
Nicholas Howell
Abstract:
In this demo paper we present OmniLingo, an architecture for distributing data for listening- and speaking-based language learning applications and a demonstration client built using the architecture. The architecture is based on the Interplanetary Filesystem (IPFS) and puts at the forefront user sovereignty over data.
In this demo paper we present OmniLingo, an architecture for distributing data for listening- and speaking-based language learning applications and a demonstration client built using the architecture. The architecture is based on the Interplanetary Filesystem (IPFS) and puts at the forefront user sovereignty over data.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing
Authors:
Amir Zeldes,
Nick Howell,
Noam Ordan,
Yifat Ben Moshe
Abstract:
Foundational Hebrew NLP tasks such as segmentation, tagging and parsing, have relied to date on various versions of the Hebrew Treebank (HTB, Sima'an et al. 2001). However, the data in HTB, a single-source newswire corpus, is now over 30 years old, and does not cover many aspects of contemporary Hebrew on the web. This paper presents a new, freely available UD treebank of Hebrew stratified from a…
▽ More
Foundational Hebrew NLP tasks such as segmentation, tagging and parsing, have relied to date on various versions of the Hebrew Treebank (HTB, Sima'an et al. 2001). However, the data in HTB, a single-source newswire corpus, is now over 30 years old, and does not cover many aspects of contemporary Hebrew on the web. This paper presents a new, freely available UD treebank of Hebrew stratified from a range of topics selected from Hebrew Wikipedia. In addition to introducing the corpus and evaluating the quality of its annotations, we deploy automatic validation tools based on grew (Guillaume, 2021), and conduct the first cross domain parsing experiments in Hebrew. We obtain new state-of-the-art (SOTA) results on UD NLP tasks, using a combination of the latest language modelling and some incremental improvements to existing transformer based approaches. We also release a new version of the UD HTB matching annotation scheme updates from our new corpus.
△ Less
Submitted 18 October, 2022; v1 submitted 14 October, 2022;
originally announced October 2022.