Skip to main content

Showing 1–7 of 7 results for author: Pethe, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2311.04020  [pdf, other

    cs.CL

    Analyzing Film Adaptation through Narrative Alignment

    Authors: Tanzir Pial, Shahreen Salim, Charuta Pethe, Allen Kim, Steven Skiena

    Abstract: Novels are often adapted into feature films, but the differences between the two media usually require drop** sections of the source text from the movie script. Here we study this screen adaptation process by constructing narrative alignments using the Smith-Waterman local alignment algorithm coupled with SBERT embedding distance to quantify text similarity between scenes and book units. We use… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

    Comments: 20 pages, 5 figures, 10 tables

  2. arXiv:2311.03614  [pdf, other

    cs.CL

    STONYBOOK: A System and Resource for Large-Scale Analysis of Novels

    Authors: Charuta Pethe, Allen Kim, Rajesh Prabhakar, Tanzir Pial, Steven Skiena

    Abstract: Books have historically been the primary mechanism through which narratives are transmitted. We have developed a collection of resources for the large-scale analysis of novels, including: (1) an open source end-to-end NLP analysis pipeline for the annotation of novels into a standard XML format, (2) a collection of 49,207 distinct cleaned and annotated novels, and (3) a database with an associated… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: 8 pages, 12 figures

  3. arXiv:2310.06930  [pdf, other

    cs.SD cs.LG eess.AS

    Prosody Analysis of Audiobooks

    Authors: Charuta Pethe, Yunting Yin, Steven Skiena

    Abstract: Recent advances in text-to-speech have made it possible to generate natural-sounding audio from text. However, audiobook narrations involve dramatic vocalizations and intonations by the reader, with greater reliance on emotions, dialogues, and descriptions in the narrative. Using our dataset of 93 aligned book-audiobook pairs, we present improved models for prosody prediction properties (pitch, vo… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

  4. arXiv:2110.11934  [pdf, other

    cs.CL

    Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

    Authors: Allen Kim, Charuta Pethe, Naoya Inoue, Steve Skiena

    Abstract: Substantial amounts of work are required to clean large collections of digitized books for NLP analysis, both because of the presence of errors in the scanned text and the presence of duplicate volumes in the corpora. In this paper, we consider the issue of deduplication in the presence of optical character recognition (OCR) errors. We present methods to handle these errors, evaluated on a collect… ▽ More

    Submitted 22 October, 2021; originally announced October 2021.

    Comments: Accepted for Findings of EMNLP 2021

  5. arXiv:2011.04163  [pdf, other

    cs.CL

    Chapter Captor: Text Segmentation in Novels

    Authors: Charuta Pethe, Allen Kim, Steven Skiena

    Abstract: Books are typically segmented into chapters and sections, representing coherent subnarratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter titl… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

    Comments: 11 pages, 10 figures, Accepted at EMNLP 2020 as a long paper

  6. arXiv:2011.04124  [pdf, other

    cs.CL

    What time is it? Temporal Analysis of Novels

    Authors: Allen Kim, Charuta Pethe, Steven Skiena

    Abstract: Recognizing the flow of time in a story is a crucial aspect of understanding it. Prior work related to time has primarily focused on identifying temporal expressions or relative sequencing of events, but here we propose computationally annotating each line of a book with wall clock times, even in the absence of explicit time-descriptive phrases. To do so, we construct a data set of hourly time phr… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

    Comments: EMNLP 2020

  7. arXiv:1909.04002  [pdf, other

    cs.CL

    The Trumpiest Trump? Identifying a Subject's Most Characteristic Tweets

    Authors: Charuta Pethe, Steven Skiena

    Abstract: The sequence of documents produced by any given author varies in style and content, but some documents are more typical or representative of the source than others. We quantify the extent to which a given short text is characteristic of a specific person, using a dataset of tweets from fifteen celebrities. Such analysis is useful for generating excerpts of high-volume Twitter profiles, and underst… ▽ More

    Submitted 9 September, 2019; originally announced September 2019.

    Comments: 11 pages, 4 figures. Accepted at EMNLP-IJCNLP 2019 as a long paper