-
A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining
Authors:
Masaaki Nagata,
Makoto Morishita,
Katsuki Chousa,
Norihito Yasuda
Abstract:
Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs t…
▽ More
Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Optimal Construction of N-bit-delay Almost Instantaneous Fixed-to-Variable-Length Codes
Authors:
Ryosuke Sugiura,
Masaaki Nishino,
Norihito Yasuda,
Yutaka Kamamoto,
Takehiro Moriya
Abstract:
This paper presents an optimal construction of $N$-bit-delay almost instantaneous fixed-to-variable-length (AIFV) codes, the general form of binary codes we can make when finite bits of decoding delay are allowed. The presented method enables us to optimize lossless codes among a broader class of codes compared to the conventional FV and AIFV codes. The paper first discusses the problem of code co…
▽ More
This paper presents an optimal construction of $N$-bit-delay almost instantaneous fixed-to-variable-length (AIFV) codes, the general form of binary codes we can make when finite bits of decoding delay are allowed. The presented method enables us to optimize lossless codes among a broader class of codes compared to the conventional FV and AIFV codes. The paper first discusses the problem of code construction, which contains some essential partial problems, and defines three classes of optimality to clarify how far we can solve the problems. The properties of the optimal codes are analyzed theoretically, showing the sufficient conditions for achieving the optimum. Then, we propose an algorithm for constructing $N$-bit-delay AIFV codes for given stationary memory-less sources. The optimality of the constructed codes is discussed both theoretically and empirically. They showed shorter expected code lengths when $N\ge 3$ than the conventional AIFV-$m$ and extended Huffman codes. Moreover, in the random numbers simulation, they performed higher compression efficiency than the 32-bit-precision range codes under reasonable conditions.
△ Less
Submitted 5 November, 2023;
originally announced November 2023.
-
International Competition on Graph Counting Algorithms 2023
Authors:
Takeru Inoue,
Norihito Yasuda,
Hidetomo Nabeshima,
Masaaki Nishino,
Shuhei Denzumi,
Shin-ichi Minato
Abstract:
This paper reports on the details of the International Competition on Graph Counting Algorithms (ICGCA) held in 2023. The graph counting problem is to count the subgraphs satisfying specified constraints on a given graph. The problem belongs to #P-complete, a computationally tough class. Since many essential systems in modern society, e.g., infrastructure networks, are often represented as graphs,…
▽ More
This paper reports on the details of the International Competition on Graph Counting Algorithms (ICGCA) held in 2023. The graph counting problem is to count the subgraphs satisfying specified constraints on a given graph. The problem belongs to #P-complete, a computationally tough class. Since many essential systems in modern society, e.g., infrastructure networks, are often represented as graphs, graph counting algorithms are a key technology to efficiently scan all the subgraphs representing the feasible states of the system. In the ICGCA, contestants were asked to count the paths on a graph under a length constraint. The benchmark set included 150 challenging instances, emphasizing graphs resembling infrastructure networks. Eleven solvers were submitted and ranked by the number of benchmarks correctly solved within a time limit. The winning solver, TLDC, was designed based on three fundamental approaches: backtracking search, dynamic programming, and model counting or #SAT (a counting version of Boolean satisfiability). Detailed analyses show that each approach has its own strengths, and one approach is unlikely to dominate the others. The codes and papers of the participating solvers are available: https://afsa.jp/icgca/.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
Generalization Analysis on Learning with a Concurrent Verifier
Authors:
Masaaki Nishino,
Kengo Nakamura,
Norihito Yasuda
Abstract:
Machine learning technologies have been used in a wide range of practical systems. In practical situations, it is natural to expect the input-output pairs of a machine learning model to satisfy some requirements. However, it is difficult to obtain a model that satisfies requirements by just learning from examples. A simple solution is to add a module that checks whether the input-output pairs meet…
▽ More
Machine learning technologies have been used in a wide range of practical systems. In practical situations, it is natural to expect the input-output pairs of a machine learning model to satisfy some requirements. However, it is difficult to obtain a model that satisfies requirements by just learning from examples. A simple solution is to add a module that checks whether the input-output pairs meet the requirements and then modifies the model's outputs. Such a module, which we call a {\em concurrent verifier} (CV), can give a certification, although how the generalizability of the machine learning model changes using a CV is unclear. This paper gives a generalization analysis of learning with a CV. We analyze how the learnability of a machine learning model changes with a CV and show a condition where we can obtain a guaranteed hypothesis using a verifier only in the inference time. We also show that typical error bounds based on Rademacher complexity will be no larger than that of the original model when using a CV in multi-class classification and structured prediction settings.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Solving Rep-tile by Computers: Performance of Solvers and Analyses of Solutions
Authors:
Mutsunori Banbara,
Kenji Hashimoto,
Takashi Horiyama,
Shin-ichi Minato,
Kakeru Nakamura,
Masaaki Nishino,
Masahiko Sakai,
Ryuhei Uehara,
Yushi Uno,
Norihito Yasuda
Abstract:
A rep-tile is a polygon that can be dissected into smaller copies (of the same size) of the original polygon. A polyomino is a polygon that is formed by joining one or more unit squares edge to edge. These two notions were first introduced and investigated by Solomon W. Golomb in the 1950s and popularized by Martin Gardner in the 1960s. Since then, dozens of studies have been made in communities o…
▽ More
A rep-tile is a polygon that can be dissected into smaller copies (of the same size) of the original polygon. A polyomino is a polygon that is formed by joining one or more unit squares edge to edge. These two notions were first introduced and investigated by Solomon W. Golomb in the 1950s and popularized by Martin Gardner in the 1960s. Since then, dozens of studies have been made in communities of recreational mathematics and puzzles. In this study, we first focus on the specific rep-tiles that have been investigated in these communities. Since the notion of rep-tiles is so simple that can be formulated mathematically in a natural way, we can apply a representative puzzle solver, a MIP solver, and SAT-based solvers for solving the rep-tile problem in common. In comparing their performance, we can conclude that the puzzle solver is the weakest while the SAT-based solvers are the strongest in the context of simple puzzle solving. We then turn to analyses of the specific rep-tiles. Using some properties of the rep-tile patterns found by a solver, we can complete analyses of specific rep-tiles up to certain sizes. That is, up to certain sizes, we can determine the existence of solutions, clarify the number of the solutions, or we can enumerate all the solutions for each size. In the last case, we find new series of solutions for the rep-tiles which have never been found in the communities.
△ Less
Submitted 7 October, 2021;
originally announced October 2021.
-
Single-epoch supernova classification with deep convolutional neural networks
Authors:
Akisato Kimura,
Ichiro Takahashi,
Masaomi Tanaka,
Naoki Yasuda,
Naonori Ueda,
Naoki Yoshida
Abstract:
Supernovae Type-Ia (SNeIa) play a significant role in exploring the history of the expansion of the Universe, since they are the best-known standard candles with which we can accurately measure the distance to the objects. Finding large samples of SNeIa and investigating their detailed characteristics have become an important issue in cosmology and astronomy. Existing methods relied on a photometr…
▽ More
Supernovae Type-Ia (SNeIa) play a significant role in exploring the history of the expansion of the Universe, since they are the best-known standard candles with which we can accurately measure the distance to the objects. Finding large samples of SNeIa and investigating their detailed characteristics have become an important issue in cosmology and astronomy. Existing methods relied on a photometric approach that first measures the luminance of supernova candidates precisely and then fits the results to a parametric function of temporal changes in luminance. However, it inevitably requires multi-epoch observations and complex luminance measurements. In this work, we present a novel method for classifying SNeIa simply from single-epoch observation images without any complex measurements, by effectively integrating the state-of-the-art computer vision methodology into the standard photometric approach. Our method first builds a convolutional neural network for estimating the luminance of supernovae from telescope images, and then constructs another neural network for the classification, where the estimated luminance and observation dates are used as features for classification. Both of the neural networks are integrated into a single deep neural network to classify SNeIa directly from observation images. Experimental results show the effectiveness of the proposed method and reveal classification performance comparable to existing photometric methods with multi-epoch observations.
△ Less
Submitted 30 November, 2017;
originally announced November 2017.