Skip to main content

Showing 1–23 of 23 results for author: El-Kishky, A

.
  1. arXiv:2310.16303  [pdf, other

    cs.CL cs.IR

    URL-BERT: Training Webpage Representations via Social Media Engagements

    Authors: Ayesha Qamar, Chetan Verma, Ahmed El-Kishky, Sumit Binnani, Sneha Mehta, Taylor Berg-Kirkpatrick

    Abstract: Understanding and representing webpages is crucial to online social networks where users may share and engage with URLs. Common language model (LM) encoders such as BERT can be used to understand and represent the textual content of webpages. However, these representations may not model thematic information of web domains and URLs or accurately capture their appeal to social media users. In this w… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  2. arXiv:2210.16586  [pdf, other

    cs.CL cs.AI cs.LG cs.SI

    NTULM: Enriching Social Media Text Representations with Non-Textual Units

    Authors: **ning Li, Shubhanshu Mishra, Ahmed El-Kishky, Sneha Mehta, Vivek Kulkarni

    Abstract: On social media, additional context is often present in the form of annotations and meta-data such as the post's author, mentions, Hashtags, and hyperlinks. We refer to these annotations as Non-Textual Units (NTUs). We posit that NTUs provide social context beyond their textual semantics and leveraging these units can enrich social media text representations. In this work we construct an NTU-centr… ▽ More

    Submitted 29 October, 2022; originally announced October 2022.

    Comments: 14 pages, 5 figures, Accepted to the Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022). URL: https://aclanthology.org/2022.wnut-1.7/

    MSC Class: 68T50; 68T07 ACM Class: I.2.7

  3. arXiv:2210.16271  [pdf, other

    cs.SI cs.IR

    MiCRO: Multi-interest Candidate Retrieval Online

    Authors: Frank Portman, Stephen Ragain, Ahmed El-Kishky

    Abstract: Providing personalized recommendations in an environment where items exhibit ephemerality and temporal relevancy (e.g. in social media) presents a few unique challenges: (1) inductively understanding ephemeral appeal for items in a setting where new items are created frequently, (2) adapting to trends within engagement patterns where items may undergo temporal shifts in relevance, (3) accurately m… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: Preprint

  4. arXiv:2209.07562  [pdf, other

    cs.CL

    TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations at Twitter

    Authors: Xinyang Zhang, Yury Malkov, Omar Florez, Serim Park, Brian McWilliams, Jiawei Han, Ahmed El-Kishky

    Abstract: Pre-trained language models (PLMs) are fundamental for natural language processing applications. Most existing PLMs are not tailored to the noisy user-generated text on social media, and the pre-training does not factor in the valuable social engagement logs available in a social network. We present TwHIN-BERT, a multilingual language model productionized at Twitter, trained on in-domain data from… ▽ More

    Submitted 26 August, 2023; v1 submitted 15 September, 2022; originally announced September 2022.

  5. arXiv:2209.05706  [pdf, other

    cs.CL

    Non-Parametric Temporal Adaptation for Social Media Topic Classification

    Authors: Fatemehsadat Mireshghallah, Nikolai Vogler, Junxian He, Omar Florez, Ahmed El-Kishky, Taylor Berg-Kirkpatrick

    Abstract: User-generated social media data is constantly changing as new trends influence online discussion and personal information is deleted due to privacy concerns. However, most current NLP models are static and rely on fixed training data, which means they are unable to adapt to temporal change -- both test distribution shift and deleted training data -- without frequent, costly re-training. In this p… ▽ More

    Submitted 15 May, 2023; v1 submitted 12 September, 2022; originally announced September 2022.

  6. kNN-Embed: Locally Smoothed Embedding Mixtures For Multi-interest Candidate Retrieval

    Authors: Ahmed El-Kishky, Thomas Markovich, Kenny Leung, Frank Portman, Aria Haghighi, Ying Xiao

    Abstract: Candidate retrieval is the first stage in recommendation systems, where a light-weight system is used to retrieve potentially relevant items for an input user. These candidate items are then ranked and pruned in later stages of recommender systems using a more complex ranking model. As the top of the recommendation funnel, it is important to retrieve a high-recall candidate set to feed into downst… ▽ More

    Submitted 5 August, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Cham: Springer Nature Switzerland, 2023 (PAKDD 2023)

  7. TwHIN: Embedding the Twitter Heterogeneous Information Network for Personalized Recommendation

    Authors: Ahmed El-Kishky, Thomas Markovich, Serim Park, Chetan Verma, Baek** Kim, Ramy Eskander, Yury Malkov, Frank Portman, Sofía Samaniego, Ying Xiao, Aria Haghighi

    Abstract: Social networks, such as Twitter, form a heterogeneous information network (HIN) where nodes represent domain entities (e.g., user, content, advertiser, etc.) and edges represent one of many entity interactions (e.g, a user re-sharing content or "following" another). Interactions from multiple relation types can encode valuable information about social network entities not fully captured by a sing… ▽ More

    Submitted 5 September, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

    Journal ref: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022)

  8. arXiv:2201.11675  [pdf, other

    cs.SI cs.AI

    Learning Stance Embeddings from Signed Social Graphs

    Authors: John Pougué-Biyong, Akshay Gupta, Aria Haghighi, Ahmed El-Kishky

    Abstract: A key challenge in social network analysis is understanding the position, or stance, of people in the graph on a large set of topics. While past work has modeled (dis)agreement in social networks using signed graphs, these approaches have not modeled agreement patterns across a range of correlated topics. For instance, disagreement on one topic may make disagreement(or agreement) more likely for r… ▽ More

    Submitted 14 December, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

    Comments: WSDM 2023

    Journal ref: International Conference on Web Search and Data Mining (WSDM) 2023

  9. arXiv:2109.08627  [pdf, other

    cs.CL

    Classification-based Quality Estimation: Small and Efficient Models for Real-world Applications

    Authors: Shuo Sun, Ahmed El-Kishky, Vishrav Chaudhary, James Cross, Francisco Guzmán, Lucia Specia

    Abstract: Sentence-level Quality estimation (QE) of machine translation is traditionally formulated as a regression task, and the performance of QE models is typically measured by Pearson correlation with human labels. Recent QE models have achieved previously-unseen levels of correlation with human judgments, but they rely on large multilingual contextualized language models that are computationally expens… ▽ More

    Submitted 17 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021

  10. arXiv:2107.08357  [pdf, other

    cs.CL cs.CR

    As Easy as 1, 2, 3: Behavioural Testing of NMT Systems for Numerical Translation

    Authors: Jun Wang, Chang Xu, Francisco Guzman, Ahmed El-Kishky, Benjamin I. P. Rubinstein, Trevor Cohn

    Abstract: Mistranslated numbers have the potential to cause serious effects, such as financial loss or medical misinformation. In this work we develop comprehensive assessments of the robustness of neural machine translation systems to numerical text via behavioural testing. We explore a variety of numerical translation capabilities a system is expected to exhibit and design effective test examples to expos… ▽ More

    Submitted 18 July, 2021; originally announced July 2021.

    Comments: Findings of ACL, to appear

  11. arXiv:2107.05243  [pdf, other

    cs.CL cs.CR

    Putting words into the system's mouth: A targeted attack on neural machine translation using monolingual data poisoning

    Authors: Jun Wang, Chang Xu, Francisco Guzman, Ahmed El-Kishky, Yuqing Tang, Benjamin I. P. Rubinstein, Trevor Cohn

    Abstract: Neural machine translation systems are known to be vulnerable to adversarial test inputs, however, as we show in this paper, these systems are also vulnerable to training attacks. Specifically, we propose a poisoning attack in which a malicious adversary inserts a small poisoned sample of monolingual text into the training set of a system trained using back-translation. This sample is designed to… ▽ More

    Submitted 12 July, 2021; originally announced July 2021.

    Comments: Findings of ACL, to appear

  12. arXiv:2105.15071  [pdf, other

    cs.CL

    Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data

    Authors: Wei-Jen Ko, Ahmed El-Kishky, Adithya Renduchintala, Vishrav Chaudhary, Naman Goyal, Francisco Guzmán, Pascale Fung, Philipp Koehn, Mona Diab

    Abstract: The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a… ▽ More

    Submitted 1 June, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

    Comments: ACL 2021

  13. arXiv:2104.08597  [pdf, other

    cs.CL

    XLEnt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment

    Authors: Ahmed El-Kishky, Adithya Renduchintala, James Cross, Francisco Guzmán, Philipp Koehn

    Abstract: Cross-lingual named-entity lexica are an important resource to multilingual NLP tasks such as machine translation and cross-lingual wikification. While knowledge bases contain a large number of entities in high-resource languages such as English and French, corresponding entities for lower-resource languages are often missing. To address this, we propose Lexical-Semantic-Phonetic Align (LSP-Align)… ▽ More

    Submitted 10 September, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

  14. arXiv:2102.04020  [pdf, other

    cs.CL

    Quality Estimation without Human-labeled Data

    Authors: Yi-Lin Tuan, Ahmed El-Kishky, Adithya Renduchintala, Vishrav Chaudhary, Francisco Guzmán, Lucia Specia

    Abstract: Quality estimation aims to measure the quality of translated content without access to a reference translation. This is crucial for machine translation systems in real-world scenarios where high-quality translation is needed. While many approaches exist for quality estimation, they are based on supervised machine learning requiring costly human labelled data. As an alternative, we propose a techni… ▽ More

    Submitted 8 February, 2021; originally announced February 2021.

    Comments: Accepted by EACL2021

  15. arXiv:2010.11125  [pdf, other

    cs.CL cs.LG

    Beyond English-Centric Multilingual Machine Translation

    Authors: Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin

    Abstract: Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

  16. arXiv:2002.00761  [pdf, other

    cs.CL cs.IR cs.LG stat.ML

    Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance

    Authors: Ahmed El-Kishky, Francisco Guzmán

    Abstract: Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Such aligned data can be used for a variety of NLP tasks from training cross-lingual representations to mining parallel data for machine translation. In this paper we develop an unsupervised scoring function that leverages cross-lingual sentence embeddings… ▽ More

    Submitted 11 October, 2020; v1 submitted 31 January, 2020; originally announced February 2020.

    Comments: In Proceedings of AACL-IJCNLP, 2020

  17. arXiv:1911.06407  [pdf, other

    cs.LG cs.IR stat.ML

    Mining News Events from Comparable News Corpora: A Multi-Attribute Proximity Network Modeling Approach

    Authors: Hyungsul Kim, Ahmed El-Kishky, Xiang Ren, Jiawei Han

    Abstract: We present ProxiModel, a novel event mining framework for extracting high-quality structured event knowledge from large, redundant, and noisy news data sources. The proposed model differentiates itself from other approaches by modeling both the event correlation within each individual document as well as across the corpus. To facilitate this, we introduce the concept of a proximity-network, a nove… ▽ More

    Submitted 14 November, 2019; originally announced November 2019.

  18. arXiv:1911.06154  [pdf, other

    cs.CL cs.LG stat.ML

    CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

    Authors: Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzman, Philipp Koehn

    Abstract: Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. In this paper, we exploit the signals embedded in URLs to label web documents at scale with an average precision of 94.5% across different language pairs. We mine sixty-eight snapshots of the Common Crawl corpus and identify web document pairs… ▽ More

    Submitted 11 October, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

    Comments: EMNLP 2020

  19. arXiv:1911.00830  [pdf, other

    cs.CV

    Leveraging Pretrained Image Classifiers for Language-Based Segmentation

    Authors: David Golub, Ahmed El-Kishky, Roberto Martín-Martín

    Abstract: Current semantic segmentation models cannot easily generalize to new object classes unseen during train time: they require additional annotated images and retraining. We propose a novel segmentation model that injects visual priors into semantic segmentation architectures, allowing them to segment out new target labels without retraining. As visual priors, we use the activations of pretrained imag… ▽ More

    Submitted 10 March, 2020; v1 submitted 3 November, 2019; originally announced November 2019.

  20. arXiv:1910.06848  [pdf, other

    cs.CL

    Facebook AI's WAT19 Myanmar-English Translation Task Submission

    Authors: Peng-Jen Chen, Jiajun Shen, Matt Le, Vishrav Chaudhary, Ahmed El-Kishky, Guillaume Wenzek, Myle Ott, Marc'Aurelio Ranzato

    Abstract: This paper describes Facebook AI's submission to the WAT 2019 Myanmar-English translation task. Our baseline systems are BPE-based transformer models. We explore methods to leverage monolingual data to improve generalization, including self-training, back-translation and their combination. We further improve results by using noisy channel re-ranking and ensembling. We demonstrate that these techni… ▽ More

    Submitted 15 October, 2019; originally announced October 2019.

    Comments: The 6th Workshop on Asian Translation

  21. arXiv:1908.07832  [pdf, other

    cs.CL cs.LG stat.ML

    Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings

    Authors: Ahmed El-Kishky, Frank Xu, Aston Zhang, Jiawei Han

    Abstract: Traditionally, many text-mining tasks treat individual word-tokens as the finest meaningful semantic granularity. However, in many languages and specialized corpora, words are composed by concatenating semantically meaningful subword structures. Word-level analysis cannot leverage the semantic information present in such subword structures. With regard to word embedding techniques, this leads to n… ▽ More

    Submitted 13 November, 2019; v1 submitted 17 August, 2019; originally announced August 2019.

  22. arXiv:1804.09931  [pdf, other

    cs.CL

    Integrating Local Context and Global Cohesiveness for Open Information Extraction

    Authors: Qi Zhu, Xiang Ren, **gbo Shang, Yu Zhang, Ahmed El-Kishky, Jiawei Han

    Abstract: Extracting entities and their relations from text is an important task for understanding massive text corpora. Open information extraction (IE) systems mine relation tuples (i.e., entity arguments and a predicate string to describe their relation) from sentences. These relation tuples are not confined to a predefined schema for the relations of interests. However, current Open IE systems focus on… ▽ More

    Submitted 1 December, 2018; v1 submitted 26 April, 2018; originally announced April 2018.

    Comments: 8 pages + 1 page reference. Accepted to WSDM 2019

  23. arXiv:1406.6312  [pdf, other

    cs.CL cs.IR cs.LG

    Scalable Topical Phrase Mining from Text Corpora

    Authors: Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare Voss, Jiawei Han

    Abstract: While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inherent grou** of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing work either performs post processing to the inference results of unigram-based topic models, or utilizes complex n-gram-discovery topic models. These methods ge… ▽ More

    Submitted 18 November, 2014; v1 submitted 24 June, 2014; originally announced June 2014.

    Journal ref: Proceedings of the VLDB Endowment, Vol. 8(3), pp. 305 - 316, 2014