Showing 1–2 of 2 results for author: Tolochinsky, E

Search v0.5.6 released 2020-02-24

arXiv:1805.07697 [pdf]

cs.CL

The UN Parallel Corpus Annotated for Translation Direction

Authors: Elad Tolochinsky, Ohad Mosafi, Ella Rabinovich, Shuly Wintner

Abstract: This work distinguishes between translated and original text in the UN protocol corpus. By modeling the problem as classification problem, we can achieve up to 95% classification accuracy. We begin by deriving a parallel corpus for different language-pairs annotated for translation direction, and then classify the data by using various feature extraction methods. We compare the different methods a… ▽ More This work distinguishes between translated and original text in the UN protocol corpus. By modeling the problem as classification problem, we can achieve up to 95% classification accuracy. We begin by deriving a parallel corpus for different language-pairs annotated for translation direction, and then classify the data by using various feature extraction methods. We compare the different methods as well as the ability to distinguish between translated and original texts in the different languages. The annotated corpus is publicly available. △ Less

Submitted 19 May, 2018; originally announced May 2018.
arXiv:1802.07382 [pdf, other]

cs.LG cs.DS

Generic Coreset for Scalable Learning of Monotonic Kernels: Logistic Regression, Sigmoid and more

Authors: Elad Tolochinsky, Ibrahim Jubran, Dan Feldman

Abstract: Coreset (or core-set) is a small weighted \emph{subset} $Q$ of an input set $P$ with respect to a given \emph{monotonic} function $f:\mathbb{R}\to\mathbb{R}$ that \emph{provably} approximates its fitting loss $\sum_{p\in P}f(p\cdot x)$ to \emph{any} given $x\in\mathbb{R}^d$. Using $Q$ we can obtain approximation of $x^*$ that minimizes this loss, by running \emph{existing} optimization algorithms… ▽ More Coreset (or core-set) is a small weighted \emph{subset} $Q$ of an input set $P$ with respect to a given \emph{monotonic} function $f:\mathbb{R}\to\mathbb{R}$ that \emph{provably} approximates its fitting loss $\sum_{p\in P}f(p\cdot x)$ to \emph{any} given $x\in\mathbb{R}^d$. Using $Q$ we can obtain approximation of $x^*$ that minimizes this loss, by running \emph{existing} optimization algorithms on $Q$. In this work we provide: (i) A lower bound which proves that there are sets with no coresets smaller than $n=|P|$ for general monotonic loss functions. (ii) A proof that, under a natural assumption that holds e.g. for logistic regression and the sigmoid activation functions, a small coreset exists for \emph{any} input $P$. (iii) A generic coreset construction algorithm that computes such a small coreset $Q$ in $O(nd+n\log n)$ time, and (iv) Experimental results which demonstrate that our coresets are effective and are much smaller in practice than predicted in theory. △ Less

Submitted 23 December, 2021; v1 submitted 20 February, 2018; originally announced February 2018.

Search v0.5.6 released 2020-02-24