Skip to main content

Showing 1–12 of 12 results for author: Tanaka-Ishii, K

.
  1. arXiv:2405.10974  [pdf, other

    cs.IR cs.AI cs.CL cs.LG

    Bottleneck-Minimal Indexing for Generative Document Retrieval

    Authors: Xin Du, Lixin Xiu, Kumiko Tanaka-Ishii

    Abstract: We apply an information-theoretic perspective to reconsider generative document retrieval (GDR), in which a document $x \in X$ is indexed by $t \in T$, and a neural autoregressive model is trained to map queries $Q$ to $T$. GDR can be considered to involve information transmission from documents $X$ to queries $Q$, with the requirement to transmit more bits via the indexes $T$. By applying Shannon… ▽ More

    Submitted 20 May, 2024; v1 submitted 12 May, 2024; originally announced May 2024.

    Comments: Accepted for ICML 2024

  2. arXiv:2405.06321  [pdf, other

    cs.CL cond-mat.stat-mech cs.AI

    Correlation Dimension of Natural Language in a Statistical Manifold

    Authors: Xin Du, Kumiko Tanaka-Ishii

    Abstract: The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension arou… ▽ More

    Submitted 15 May, 2024; v1 submitted 10 May, 2024; originally announced May 2024.

    Comments: Published at Physical Review Research

    Journal ref: Physical Review Research, 6(2), L022028 (2024)

  3. Co-Training Realized Volatility Prediction Model with Neural Distributional Transformation

    Authors: Xin Du, Kai Moriyama, Kumiko Tanaka-Ishii

    Abstract: This paper shows a novel machine learning model for realized volatility (RV) prediction using a normalizing flow, an invertible neural network. Since RV is known to be skewed and have a fat tail, previous methods transform RV into values that follow a latent distribution with an explicit shape and then apply a prediction model. However, knowing that shape is non-trivial, and the transformation res… ▽ More

    Submitted 22 October, 2023; originally announced October 2023.

    Comments: Accepted at ICAIF'23

  4. arXiv:2307.02697  [pdf, other

    cs.CL physics.data-an

    Strahler Number of Natural Language Sentences in Comparison with Random Trees

    Authors: Kumiko Tanaka-Ishii, Akira Tanaka

    Abstract: The Strahler number was originally proposed to characterize the complexity of river bifurcation and has found various applications. This article proposes computation of the Strahler number's upper and lower limits for natural language sentence tree structures. Through empirical measurements across grammatically annotated data, the Strahler number of natural language sentences is shown to be almost… ▽ More

    Submitted 6 December, 2023; v1 submitted 5 July, 2023; originally announced July 2023.

    Comments: 34 pages, 12 figures, 11 tables

    Journal ref: Journal of Statistical Mechanics, 2023

  5. A Comparison of Two Fluctuation Analyses for Natural Language Clustering Phenomena: Taylor and Ebeling & Neiman Methods

    Authors: Kumiko Tanaka-Ishii, Shuntaro Takahashi

    Abstract: This article considers the fluctuation analysis methods of Taylor and Ebeling & Neiman. While both have been applied to various phenomena in the statistical mechanics domain, their similarities and differences have not been clarified. After considering their analytical aspects, this article presents a large-scale application of these methods to text. It is found that both methods can distinguish r… ▽ More

    Submitted 14 September, 2020; originally announced September 2020.

    Journal ref: Fractals, in 2021, No.2. https://www.worldscientific.com/toc/fractals/0/ja

  6. Extraction of Templates from Phrases Using Sequence Binary Decision Diagrams

    Authors: Daiki Hirano, Kumiko Tanaka-Ishii, Andrew Finch

    Abstract: The extraction of templates such as ``regard X as Y'' from a set of related phrases requires the identification of their internal structures. This paper presents an unsupervised approach for extracting templates on-the-fly from only tagged text by using a novel relaxed variant of the Sequence Binary Decision Diagram (SeqBDD). A SeqBDD can compress a set of sequences into a graphical structure equi… ▽ More

    Submitted 28 January, 2020; originally announced January 2020.

    Journal ref: Natural Language Engineering, 2018

  7. arXiv:1906.09379  [pdf, other

    cs.CL

    Evaluating Computational Language Models with Scaling Properties of Natural Language

    Authors: Shuntaro Takahashi, Kumiko Tanaka-Ishii

    Abstract: In this article, we evaluate computational models of natural language with respect to the universal statistical behaviors of natural language. Statistical mechanical analyses have revealed that natural language text is characterized by scaling properties, which quantify the global structure in the vocabulary population and the long memory of a text. We study whether five scaling properties (given… ▽ More

    Submitted 21 June, 2019; originally announced June 2019.

    Comments: 32 pages, accepted by Computational Linguistics

  8. Word Familiarity and Frequency

    Authors: Kumiko Tanaka-Ishii, Hiroshi Terada

    Abstract: Word frequency is assumed to correlate with word familiarity, but the strength of this correlation has not been thoroughly investigated. In this paper, we report on our analysis of the correlation between a word familiarity rating list obtained through a psycholinguistic experiment and the log-frequency obtained from various corpora of different kinds and sizes (up to the terabyte scale) for Engli… ▽ More

    Submitted 9 June, 2018; originally announced June 2018.

    Comments: 17 pages, 8 figures, Published in Studia Linguistica in 2011. Available also from Wiley Online Library

  9. arXiv:1804.08881  [pdf, other

    cs.CL

    Assessing Language Models with Scaling Properties

    Authors: Shuntaro Takahashi, Kumiko Tanaka-Ishii

    Abstract: Language models have primarily been evaluated with perplexity. While perplexity quantifies the most comprehensible prediction performance, it does not provide qualitative information on the success or failure of models. Another approach for evaluating language models is thus proposed, using the scaling properties of natural language. Five such tests are considered, with the first two accounting fo… ▽ More

    Submitted 24 April, 2018; originally announced April 2018.

    Comments: 14 pages, 16 figures

  10. arXiv:1804.07893  [pdf, other

    cs.CL

    Taylor's law for Human Linguistic Sequences

    Authors: Tatsuru Kobayashi, Kumiko Tanaka-Ishii

    Abstract: Taylor's law describes the fluctuation characteristics underlying a system in which the variance of an event within a time span grows by a power law with respect to the mean. Although Taylor's law has been applied in many natural and social systems, its application for language has been scarce. This article describes a new quantification of Taylor's law in natural language and reports an analysis… ▽ More

    Submitted 7 June, 2018; v1 submitted 21 April, 2018; originally announced April 2018.

    Comments: 11 pages, 16 figures, Accepted as ACL 2018 long paper

  11. arXiv:1712.03645  [pdf, other

    cs.CL physics.soc-ph

    Long-Range Correlation Underlying Childhood Language and Generative Models

    Authors: Kumiko Tanaka-Ishii

    Abstract: Long-range correlation, a property of time series exhibiting long-term memory, is mainly studied in the statistical physics domain and has been reported to exist in natural language. Using a state-of-the-art method for such analysis, long-range correlation is first shown to occur in long CHILDES data sets. To understand why, Bayesian generative models of language, originally proposed in the cognit… ▽ More

    Submitted 10 December, 2017; originally announced December 2017.

  12. Do Neural Nets Learn Statistical Laws behind Natural Language?

    Authors: Shuntaro Takahashi, Kumiko Tanaka-Ishii

    Abstract: The performance of deep learning in natural language processing has been spectacular, but the reasons for this success remain unclear because of the inherent complexity of deep learning. This paper provides empirical evidence of its effectiveness and of a limitation of neural networks for language engineering. Precisely, we demonstrate that a neural language model based on long short-term memory (… ▽ More

    Submitted 28 November, 2017; v1 submitted 16 July, 2017; originally announced July 2017.

    Comments: 21 pages, 11 figures