Skip to main content

Showing 1–41 of 41 results for author: Callan, J

.
  1. arXiv:2404.04163  [pdf, other

    cs.IR cs.CL

    Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval

    Authors: João Coelho, Bruno Martins, João Magalhães, Jamie Callan, Chenyan Xiong

    Abstract: This study investigates the existence of positional biases in Transformer-based models for text representation learning, particularly in the context of web document retrieval. We build on previous research that demonstrated loss of information in the middle of input sequences for causal language models, extending it to the domain of representation learning. We examine positional biases at various… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

  2. arXiv:2402.04357  [pdf, other

    cs.IR

    Building Retrieval Systems for the ClueWeb22-B Corpus

    Authors: Harshit Mehrotra, Jamie Callan, Zhen Fan

    Abstract: The ClueWeb22 dataset containing nearly 10 billion documents was released in 2022 to support academic and industry research. The goal of this project was to build retrieval baselines for the English section of the "super head" part (category B) of this dataset. These baselines can then be used by the research community to compare their systems and also to generate data to train/evaluate new retrie… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

  3. arXiv:2311.10516  [pdf, other

    cs.SE

    User-Centric Deployment of Automated Program Repair at Bloomberg

    Authors: David Williams, James Callan, Serkan Kirbas, Sergey Mechtaev, Justyna Petke, Thomas Prideaux-Ghee, Federica Sarro

    Abstract: Automated program repair (APR) tools have unlocked the potential for the rapid rectification of codebase issues. However, to encourage wider adoption of program repair in practice, it is necessary to address the usability concerns related to generating irrelevant or out-of-context patches. When software engineers are presented with patches they deem uninteresting or unhelpful, they are burdened wi… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

  4. arXiv:2310.19813  [pdf, ps, other

    cs.SE cs.AI cs.LG cs.NE

    Enhancing Genetic Improvement Mutations Using Large Language Models

    Authors: Alexander E. I. Brownlee, James Callan, Karine Even-Mendoza, Alina Geiger, Carol Hanna, Justyna Petke, Federica Sarro, Dominik Sobania

    Abstract: Large language models (LLMs) have been successfully applied to software engineering tasks, including program repair. However, their application in search-based techniques such as Genetic Improvement (GI) is still largely unexplored. In this paper, we evaluate the use of LLMs as mutation operators for GI to improve the search process. We expand the Gin Java GI toolkit to call OpenAI's API to genera… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Comments: Accepted for publication at the Symposium on Search-Based Software Engineering (SSBSE) 2023

    Journal ref: Arcaini, P., Yue, T., Fredericks, E.M. (eds) Search-Based Software Engineering. SSBSE 2023. Lecture Notes in Computer Science, vol 14415. Springer, Cham

  5. arXiv:2308.11387  [pdf, other

    cs.SE

    Multi-Objective Improvement of Android Applications

    Authors: James Callan, Justyna Petke

    Abstract: Non-functional properties, such as runtime or memory use, are important to mobile app users and developers, as they affect user experience. Previous work on automated improvement of non-functional properties in mobile apps failed to address the inherent trade-offs between such properties. We propose a practical approach and the first open-source tool, GIDroid (2023), for multi-objective automated… ▽ More

    Submitted 22 August, 2023; originally announced August 2023.

    Comments: 32 pages, 8 Figures

  6. arXiv:2305.06983  [pdf, other

    cs.CL cs.LG

    Active Retrieval Augmented Generation

    Authors: Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, Graham Neubig

    Abstract: Despite the remarkable ability of large language models (LMs) to comprehend and generate language, they have a tendency to hallucinate and create factually inaccurate output. Augmenting LMs by retrieving information from external knowledge resources is one promising solution. Most existing retrieval augmented LMs employ a retrieve-and-generate setup that only retrieves information once based on th… ▽ More

    Submitted 21 October, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  7. arXiv:2212.10496  [pdf, other

    cs.IR cs.CL

    Precise Zero-Shot Dense Retrieval without Relevance Labels

    Authors: Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan

    Abstract: While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings~(HyDE). Given a query, HyDE first… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

  8. arXiv:2212.02027  [pdf, other

    cs.CL cs.LG

    Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer

    Authors: Zhengbao Jiang, Luyu Gao, Jun Araki, Haibo Ding, Zhiruo Wang, Jamie Callan, Graham Neubig

    Abstract: Systems for knowledge-intensive tasks such as open-domain question answering (QA) usually consist of two stages: efficient retrieval of relevant documents from a large corpus and detailed reading of the selected documents to generate answers. Retrievers and readers are usually modeled separately, which necessitates a cumbersome implementation and is hard to train and adapt in an end-to-end fashion… ▽ More

    Submitted 4 December, 2022; originally announced December 2022.

    Comments: EMNLP 2022

  9. arXiv:2211.15848  [pdf, other

    cs.IR cs.AI cs.CL

    ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information

    Authors: Arnold Overwijk, Chenyan Xiong, Xiao Liu, Cameron VandenBerg, Jamie Callan

    Abstract: ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academic and industry research, for example, in information systems, retrieval-augmented AI systems, and model pretraining. Compared with earlier ClueWeb corpora, the C… ▽ More

    Submitted 1 December, 2022; v1 submitted 28 November, 2022; originally announced November 2022.

  10. arXiv:2211.10435  [pdf, other

    cs.CL cs.AI

    PAL: Program-aided Language Models

    Authors: Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, Graham Neubig

    Abstract: Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic reasoning tasks, when provided with a few examples at test time ("few-shot prompting"). Much of this success can be attributed to prompting methods such as "chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solv… ▽ More

    Submitted 27 January, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

    Comments: The first three authors contributed equally. Our code and data are publicly available at http://reasonwithpal.com/

  11. Long Document Re-ranking with Modular Re-ranker

    Authors: Luyu Gao, Jamie Callan

    Abstract: Long document re-ranking has been a challenging problem for neural re-rankers based on deep language models like BERT. Early work breaks the documents into short passage-like chunks. These chunks are independently mapped to scalar scores or latent vectors, which are then pooled into a final relevance score. These encode-and-pool methods however inevitably introduce an information bottleneck: the l… ▽ More

    Submitted 6 June, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

    Comments: SIGIR 22

  12. arXiv:2203.05765  [pdf, other

    cs.IR cs.CL

    Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval

    Authors: Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan

    Abstract: Recent rapid advancements in deep pre-trained language models and the introductions of large datasets have powered research in embedding-based dense retrieval. While several good research papers have emerged, many of them come with their own software stacks. These stacks are typically optimized for some particular research goals instead of efficiency or code structure. In this paper, we present Te… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

  13. arXiv:2108.13454  [pdf, other

    cs.IR cs.AI

    Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback

    Authors: HongChien Yu, Chenyan Xiong, Jamie Callan

    Abstract: Dense retrieval systems conduct first-stage retrieval using embedded representations and simple similarity metrics to match a query to documents. Its effectiveness depends on encoded embeddings to capture the semantics of queries and documents, a challenging task due to the shortness and ambiguity of search queries. This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedb… ▽ More

    Submitted 30 August, 2021; originally announced August 2021.

    Comments: Accepted at CIKM 2021

  14. arXiv:2108.05540  [pdf, other

    cs.IR cs.CL

    Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

    Authors: Luyu Gao, Jamie Callan

    Abstract: Recent research demonstrates the effectiveness of using fine-tuned language models~(LM) for dense retrieval. However, dense retrievers are hard to train, typically requiring heavily engineered fine-tuning pipelines to realize their full potential. In this paper, we identify and address two underlying problems of dense retrievers: i)~fragility to training data noise and ii)~requiring large batches… ▽ More

    Submitted 12 August, 2021; originally announced August 2021.

  15. arXiv:2104.08253  [pdf, other

    cs.CL cs.IR

    Condenser: a Pre-training Architecture for Dense Retrieval

    Authors: Luyu Gao, Jamie Callan

    Abstract: Pre-trained Transformer language models (LM) have become go-to text representation encoders. Prior research fine-tunes deep LMs to encode text sequences such as sentences and passages into single dense vector representations for efficient text comparison and retrieval. However, dense encoders require a lot of data and sophisticated techniques to effectively train and suffer in low data situations.… ▽ More

    Submitted 20 September, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

    Comments: EMNLP 2021

  16. arXiv:2104.07186  [pdf, other

    cs.IR cs.CL

    COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List

    Authors: Luyu Gao, Zhuyun Dai, Jamie Callan

    Abstract: Classical information retrieval systems such as BM25 rely on exact lexical match and carry out search efficiently with inverted list index. Recent neural IR models shifts towards soft semantic matching all query document terms, but they lose the computation efficiency of exact match systems. This paper presents COIL, a contextualized exact match retrieval architecture that brings semantic lexical… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

    Comments: NAACL 2021

  17. arXiv:2101.08751  [pdf, other

    cs.IR

    Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline

    Authors: Luyu Gao, Zhuyun Dai, Jamie Callan

    Abstract: Pre-trained deep language models~(LM) have advanced the state-of-the-art of text retrieval. Rerankers fine-tuned from deep LM estimates candidate relevance based on rich contextualized matching signals. Meanwhile, deep LMs can also be leveraged to improve search index, building retrievers with better recall. One would expect a straightforward combination of both in a pipeline to have additive perf… ▽ More

    Submitted 21 January, 2021; originally announced January 2021.

    Comments: ECIR 2021

  18. arXiv:2101.08705  [pdf, ps, other

    cs.IR

    Assessing the Benefits of Model Ensembles in Neural Re-Ranking for Passage Retrieval

    Authors: Luís Borges, Bruno Martins, Jamie Callan

    Abstract: Our work aimed at experimentally assessing the benefits of model ensembling within the context of neural methods for passage reranking. Starting from relatively standard neural models, we use a previous technique named Fast Geometric Ensembling to generate multiple model instances from particular training schedules, then focusing or attention on different types of approaches for combining the resu… ▽ More

    Submitted 21 January, 2021; originally announced January 2021.

    Comments: ECIR 2021 short paper

  19. arXiv:2101.07918  [pdf, other

    cs.IR cs.AI cs.LG

    PGT: Pseudo Relevance Feedback Using a Graph-Based Transformer

    Authors: HongChien Yu, Zhuyun Dai, Jamie Callan

    Abstract: Most research on pseudo relevance feedback (PRF) has been done in vector space and probabilistic retrieval models. This paper shows that Transformer-based rerankers can also benefit from the extra context that PRF provides. It presents PGT, a graph-based Transformer that sparsifies attention between graph nodes to enable PRF while avoiding the high computational complexity of most Transformer arch… ▽ More

    Submitted 19 January, 2021; originally announced January 2021.

    Comments: Accepted at ECIR 2021 (short paper track)

  20. arXiv:2101.06983  [pdf, other

    cs.LG cs.CL cs.IR

    Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup

    Authors: Luyu Gao, Yunyi Zhang, Jiawei Han, Jamie Callan

    Abstract: Contrastive learning has been applied successfully to learn vector representations of text. Previous research demonstrated that learning high-quality representations benefits from batch-wise contrastive loss with a large number of negatives. In practice, the technique of in-batch negative is used, where for each example in a batch, other batch examples' positives will be taken as its negatives, av… ▽ More

    Submitted 14 June, 2021; v1 submitted 18 January, 2021; originally announced January 2021.

    Comments: RepL4NLP 2021

  21. Generating Categories for Sets of Entities

    Authors: Shuo Zhang, Krisztian Balog, Jamie Callan

    Abstract: Category systems are central components of knowledge bases, as they provide a hierarchical grou** of semantically related concepts and entities. They are a unique and valuable resource that is utilized in a broad range of information access tasks. To aid knowledge editors in the manual process of expanding a category system, this paper presents a method of generating categories for sets of entit… ▽ More

    Submitted 19 August, 2020; originally announced August 2020.

    Comments: Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)

  22. arXiv:2008.07688  [pdf, other

    cs.LG cs.AI cs.IR stat.ML

    Ranking Clarification Questions via Natural Language Inference

    Authors: Vaibhav Kumar, Vikas Raunak, Jamie Callan

    Abstract: Given a natural language query, teaching machines to ask clarifying questions is of immense utility in practical natural language processing systems. Such interactions could help in filling information gaps for better machine comprehension of the query. For the task of ranking clarification questions, we hypothesize that determining whether a clarification question pertains to a missing entry in a… ▽ More

    Submitted 17 August, 2020; originally announced August 2020.

    Comments: Accepted at CIKM 2020

  23. Understanding BERT Rankers Under Distillation

    Authors: Luyu Gao, Zhuyun Dai, Jamie Callan

    Abstract: Deep language models such as BERT pre-trained on large corpus have given a huge performance boost to the state-of-the-art information retrieval ranking systems. Knowledge embedded in such models allows them to pick up complex matching signals between passages and queries. However, the high computation cost during inference limits their deployment in real-world search scenarios. In this paper, we s… ▽ More

    Submitted 21 July, 2020; originally announced July 2020.

  24. Summarizing and Exploring Tabular Data in Conversational Search

    Authors: Shuo Zhang, Zhuyun Dai, Krisztian Balog, Jamie Callan

    Abstract: Tabular data provide answers to a significant portion of search queries. However, reciting an entire result table is impractical in conversational search systems. We propose to generate natural language summaries as answers to describe the complex information contained in a table. Through crowdsourcing experiments, we build a new conversation-oriented, open-domain table summarization dataset. It i… ▽ More

    Submitted 10 July, 2020; v1 submitted 23 May, 2020; originally announced May 2020.

    Comments: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020), 2020

  25. arXiv:2004.13969  [pdf, other

    cs.IR

    Complementing Lexical Retrieval with Semantic Residual Embedding

    Authors: Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, Jamie Callan

    Abstract: This paper presents CLEAR, a retrieval model that seeks to complement classical lexical exact-match models such as BM25 with semantic matching signals from a neural embedding matching model. CLEAR explicitly trains the neural embedding to encode language structures and semantics that lexical retrieval fails to capture with a novel residual-based embedding learning method. Empirical evaluations dem… ▽ More

    Submitted 29 March, 2021; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: ECIR 2021

  26. arXiv:2004.13313  [pdf, other

    cs.IR

    Modularized Transfomer-based Ranking Framework

    Authors: Luyu Gao, Zhuyun Dai, Jamie Callan

    Abstract: Recent innovations in Transformer-based ranking models have advanced the state-of-the-art in information retrieval. However, these Transformers are computationally expensive, and their opaque hidden states make it hard to understand the ranking process. In this work, we modularize the Transformer ranker into separate modules for text representation and interaction. We show how this design enables… ▽ More

    Submitted 6 October, 2020; v1 submitted 28 April, 2020; originally announced April 2020.

  27. arXiv:2003.13624  [pdf, other

    cs.IR cs.CL cs.LG

    TREC CAsT 2019: The Conversational Assistance Track Overview

    Authors: Jeffrey Dalton, Chenyan Xiong, Jamie Callan

    Abstract: The Conversational Assistance Track (CAsT) is a new track for TREC 2019 to facilitate Conversational Information Seeking (CIS) research and to create a large-scale reusable test collection for conversational search systems. The document corpus is 38,426,252 passages from the TREC Complex Answer Retrieval (CAR) and Microsoft MAchine Reading COmprehension (MARCO) datasets. Eighty information seeking… ▽ More

    Submitted 30 March, 2020; originally announced March 2020.

  28. arXiv:1910.10687  [pdf, other

    cs.IR

    Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval

    Authors: Zhuyun Dai, Jamie Callan

    Abstract: Term frequency is a common method for identifying the importance of a term in a query or document. But it is a weak signal, especially when the frequency distribution is flat, such as in long queries or short documents where the text is of sentence/passage-length. This paper proposes a Deep Contextualized Term Weighting framework that learns to map BERT's contextualized text representations to con… ▽ More

    Submitted 26 November, 2019; v1 submitted 23 October, 2019; originally announced October 2019.

  29. Deeper Text Understanding for IR with Contextual Neural Language Modeling

    Authors: Zhuyun Dai, Jamie Callan

    Abstract: Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations. Neural IR models have achieved promising results in learning query-document relevance patterns, but few explorations have been done on understanding the text content of a query or a document. This paper studies leveraging a recently-proposed contextual neural language model, BER… ▽ More

    Submitted 22 May, 2019; originally announced May 2019.

    Comments: In proceedings of SIGIR 2019

  30. Modeling Temporal Evidence from External Collections

    Authors: Flávio Martins, João Magalhães, Jamie Callan

    Abstract: Newsworthy events are broadcast through multiple mediums and prompt the crowds to produce comments on social media. In this paper, we propose to leverage on this behavioral dynamics to estimate the most relevant time periods for an event (i.e., query). Recent advances have shown how to improve the estimation of the temporal relevance of such topics. In this approach, we build on two major noveltie… ▽ More

    Submitted 2 December, 2018; originally announced December 2018.

    Comments: To appear in WSDM 2019

    ACM Class: H.3.3

  31. A Vertical PRF Architecture for Microblog Search

    Authors: Flávio Martins, João Magalhães, Jamie Callan

    Abstract: In microblog retrieval, query expansion can be essential to obtain good search results due to the short size of queries and posts. Since information in microblogs is highly dynamic, an up-to-date index coupled with pseudo-relevance feedback (PRF) with an external corpus has a higher chance of retrieving more relevant documents and improving ranking. In this paper, we focus on the research question… ▽ More

    Submitted 8 October, 2018; originally announced October 2018.

    Comments: To appear in ICTIR 2018

    ACM Class: H.3.3

  32. arXiv:1810.02907  [pdf

    cs.IR

    Sifaka: Text Mining Above a Search API

    Authors: Cameron VandenBerg, Jamie Callan

    Abstract: Text mining and analytics software has become popular, but little attention has been paid to the software architectures of such systems. Often they are built from scratch using special-purpose software and data structures, which increases their cost and complexity. This demo paper describes Sifaka, a new open-source text mining application constructed above a standard search engine index using exi… ▽ More

    Submitted 5 October, 2018; originally announced October 2018.

    Comments: 5 pages, 4 figures

  33. arXiv:1809.10522  [pdf, other

    cs.IR cs.CL cs.LG

    Consistency and Variation in Kernel Neural Ranking Model

    Authors: Mary Arpita Pyreddy, Varshini Ramaseshan, Narendra Nath Joshi, Zhuyun Dai, Chenyan Xiong, Jamie Callan, Zhiyuan Liu

    Abstract: This paper studies the consistency of the kernel-based neural ranking model K-NRM, a recent state-of-the-art neural IR model, which is important for reproducible research and deployment in the industry. We find that K-NRM has low variance on relevance-based metrics across experimental trials. In spite of this low variance in overall performance, different trials produce different document rankings… ▽ More

    Submitted 27 September, 2018; originally announced September 2018.

    Comments: 4 pages, 4 figures, 2 tables

    Journal ref: Mary Arpita Pyreddy et al. 2018. Consistency and Variation in Kernel Neural Ranking Model. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR '18). ACM, New York, NY, USA, 961-964

  34. Towards Better Text Understanding and Retrieval through Kernel Entity Salience Modeling

    Authors: Chenyan Xiong, Zhengzhong Liu, Jamie Callan, Tie-Yan Liu

    Abstract: This paper presents a Kernel Entity Salience Model (KESM) that improves text understanding and retrieval by better estimating entity salience (importance) in documents. KESM represents entities by knowledge enriched distributed representations, models the interactions between entities and words by kernels, and combines the kernel scores to estimate entity salience. The whole model is learned end-t… ▽ More

    Submitted 3 May, 2018; originally announced May 2018.

    Journal ref: In proceedings of SIGIR 2018

  35. arXiv:1804.02734  [pdf, other

    cs.IR

    A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

    Authors: Keyang Xu, Kyle Yingkai Gao, Jamie Callan

    Abstract: Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on… ▽ More

    Submitted 8 April, 2018; originally announced April 2018.

  36. arXiv:1707.03423  [pdf, other

    cs.IR

    Scientific Table Search Using Keyword Queries

    Authors: Kyle Yingkai Gao, Jamie Callan

    Abstract: Tables are common and important in scientific documents, yet most text-based document search systems do not capture structures and semantics specific to tables. How to bridge different types of mismatch between keywords queries and scientific tables and what influences ranking quality needs to be carefully investigated. This paper considers the structure of tables and gives different emphasis to t… ▽ More

    Submitted 11 July, 2017; originally announced July 2017.

  37. Word-Entity Duet Representations for Document Ranking

    Authors: Chenyan Xiong, Jamie Callan, Tie-Yan Liu

    Abstract: This paper presents a word-entity duet framework for utilizing knowledge bases in ad-hoc retrieval. In this work, the query and documents are modeled by word-based representations and entity-based representations. Ranking features are generated by the interactions between the two representations, incorporating information from the word space, the entity space, and the cross-space connections throu… ▽ More

    Submitted 20 June, 2017; originally announced June 2017.

    Journal ref: SIGIR 2017

  38. End-to-End Neural Ad-hoc Ranking with Kernel Pooling

    Authors: Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, Russell Power

    Abstract: This paper proposes K-NRM, a kernel based neural model for document ranking. Given a query and a set of documents, K-NRM uses a translation matrix that models word-level similarities via word embeddings, a new kernel-pooling technique that uses kernels to extract multi-level soft match features, and a learning-to-rank layer that combines those features into the final ranking score. The whole model… ▽ More

    Submitted 20 June, 2017; originally announced June 2017.

  39. Barbara Made the News: Mining the Behavior of Crowds for Time-Aware Learning to Rank

    Authors: Flávio Martins, João Magalhães, Jamie Callan

    Abstract: In Twitter, and other microblogging services, the generation of new content by the crowd is often biased towards immediacy: what is happening now. Prompted by the propagation of commentary and information through multiple mediums, users on the Web interact with and produce new posts about newsworthy topics and give rise to trending topics. This paper proposes to leverage on the behavioral dynamics… ▽ More

    Submitted 9 February, 2016; originally announced February 2016.

    Comments: To appear in WSDM 2016

    ACM Class: H.3.3

  40. arXiv:1307.0261  [pdf, other

    cs.LG cs.CL cs.IR

    WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction

    Authors: Bhavana Dalvi, William W. Cohen, Jamie Callan

    Abstract: We describe a open-domain information extraction method for extracting concept-instance pairs from an HTML corpus. Most earlier approaches to this problem rely on combining clusters of distributionally similar terms and concept-instance pairs obtained with Hearst patterns. In contrast, our method relies on a novel approach for clustering terms found in HTML tables, and then assigning concept names… ▽ More

    Submitted 30 June, 2013; originally announced July 2013.

    Comments: 10 pages; International Conference on Web Search and Data Mining 2012

  41. arXiv:1307.0253  [pdf, other

    cs.LG

    Exploratory Learning

    Authors: Bhavana Dalvi, William W. Cohen, Jamie Callan

    Abstract: In multiclass semi-supervised learning (SSL), it is sometimes the case that the number of classes present in the data is not known, and hence no labeled examples are provided for some classes. In this paper we present variants of well-known semi-supervised multiclass learning methods that are robust when the data contains an unknown number of classes. In particular, we present an "exploratory" ext… ▽ More

    Submitted 30 June, 2013; originally announced July 2013.

    Comments: 16 pages; European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2013