Skip to main content

Showing 1–4 of 4 results for author: El-Kishky, A

Searching in archive stat. Search in all archives.
.
  1. arXiv:2002.00761  [pdf, other

    cs.CL cs.IR cs.LG stat.ML

    Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance

    Authors: Ahmed El-Kishky, Francisco Guzmán

    Abstract: Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Such aligned data can be used for a variety of NLP tasks from training cross-lingual representations to mining parallel data for machine translation. In this paper we develop an unsupervised scoring function that leverages cross-lingual sentence embeddings… ▽ More

    Submitted 11 October, 2020; v1 submitted 31 January, 2020; originally announced February 2020.

    Comments: In Proceedings of AACL-IJCNLP, 2020

  2. arXiv:1911.06407  [pdf, other

    cs.LG cs.IR stat.ML

    Mining News Events from Comparable News Corpora: A Multi-Attribute Proximity Network Modeling Approach

    Authors: Hyungsul Kim, Ahmed El-Kishky, Xiang Ren, Jiawei Han

    Abstract: We present ProxiModel, a novel event mining framework for extracting high-quality structured event knowledge from large, redundant, and noisy news data sources. The proposed model differentiates itself from other approaches by modeling both the event correlation within each individual document as well as across the corpus. To facilitate this, we introduce the concept of a proximity-network, a nove… ▽ More

    Submitted 14 November, 2019; originally announced November 2019.

  3. arXiv:1911.06154  [pdf, other

    cs.CL cs.LG stat.ML

    CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

    Authors: Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzman, Philipp Koehn

    Abstract: Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. In this paper, we exploit the signals embedded in URLs to label web documents at scale with an average precision of 94.5% across different language pairs. We mine sixty-eight snapshots of the Common Crawl corpus and identify web document pairs… ▽ More

    Submitted 11 October, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

    Comments: EMNLP 2020

  4. arXiv:1908.07832  [pdf, other

    cs.CL cs.LG stat.ML

    Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings

    Authors: Ahmed El-Kishky, Frank Xu, Aston Zhang, Jiawei Han

    Abstract: Traditionally, many text-mining tasks treat individual word-tokens as the finest meaningful semantic granularity. However, in many languages and specialized corpora, words are composed by concatenating semantically meaningful subword structures. Word-level analysis cannot leverage the semantic information present in such subword structures. With regard to word embedding techniques, this leads to n… ▽ More

    Submitted 13 November, 2019; v1 submitted 17 August, 2019; originally announced August 2019.