Search | arXiv e-print repository

Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track

Authors: Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan Nguyen, Daniel Campos, Nick Craswell, Jimmy Lin

Abstract: Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the tradi… ▽ More Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the traditional search paradigm that relies on displaying a ranked list of documents. Therefore, given these recent advancements, it is crucial to have an arena to build, test, visualize, and systematically evaluate RAG-based search systems. With this in mind, we propose the TREC 2024 RAG Track to foster innovation in evaluating RAG systems. In our work, we lay out the steps we've made towards making this track a reality -- we describe the details of our reusable framework, Ragnarök, explain the curation of the new MS MARCO V2.1 collection choice, release the development topics for the track, and standardize the I/O definitions which assist the end user. Next, using Ragnarök, we identify and provide key industrial baselines such as OpenAI's GPT-4o or Cohere's Command R+. Further, we introduce a web-based user interface for an interactive arena allowing benchmarking pairwise RAG systems by crowdsourcing. We open-source our Ragnarök framework and baselines to achieve a unified standard for future RAG systems. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2405.07767 [pdf, other]

Synthetic Test Collections for Retrieval Evaluation

Authors: Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, Daniel Campos

Abstract: Test collections play a vital role in evaluation of information retrieval (IR) systems. Obtaining a diverse set of user queries for test collection construction can be challenging, and acquiring relevance judgments, which indicate the appropriateness of retrieved documents to a query, is often costly and resource-intensive. Generating synthetic datasets using Large Language Models (LLMs) has recen… ▽ More Test collections play a vital role in evaluation of information retrieval (IR) systems. Obtaining a diverse set of user queries for test collection construction can be challenging, and acquiring relevance judgments, which indicate the appropriateness of retrieved documents to a query, is often costly and resource-intensive. Generating synthetic datasets using Large Language Models (LLMs) has recently gained significant attention in various applications. In IR, while previous work exploited the capabilities of LLMs to generate synthetic queries or documents to augment training data and improve the performance of ranking models, using LLMs for constructing synthetic test collections is relatively unexplored. Previous studies demonstrate that LLMs have the potential to generate synthetic relevance judgments for use in the evaluation of IR systems. In this paper, we comprehensively investigate whether it is possible to use LLMs to construct fully synthetic test collections by generating not only synthetic judgments but also synthetic queries. In particular, we analyse whether it is possible to construct reliable synthetic test collections and the potential risks of bias such test collections may exhibit towards LLM-based models. Our experiments indicate that using LLMs it is possible to construct synthetic test collections that can reliably be used for retrieval evaluation. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: SIGIR 2024

arXiv:2405.05374 [pdf, other]

Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models

Authors: Luke Merrick, Danmei Xu, Gaurav Nuti, Daniel Campos

Abstract: This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the… ▽ More This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the largest model, arctic-embed-l outperforming closed source embedding models such as Cohere's embed-v3 and Open AI's text-embed-3-large. In addition to the details of our training recipe, we have provided several informative ablation studies, which we believe are the cause of our model performance. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: 17 pages, 11 Figures, 9 tables

arXiv:2404.13990 [pdf, other]

QCore: Data-Efficient, On-Device Continual Calibration for Quantized Models -- Extended Version

Authors: David Campos, Bin Yang, Tung Kieu, Miao Zhang, Chenjuan Guo, Christian S. Jensen

Abstract: We are witnessing an increasing availability of streaming data that may contain valuable information on the underlying processes. It is thus attractive to be able to deploy machine learning models on edge devices near sensors such that decisions can be made instantaneously, rather than first having to transmit incoming data to servers. To enable deployment on edge devices with limited storage and… ▽ More We are witnessing an increasing availability of streaming data that may contain valuable information on the underlying processes. It is thus attractive to be able to deploy machine learning models on edge devices near sensors such that decisions can be made instantaneously, rather than first having to transmit incoming data to servers. To enable deployment on edge devices with limited storage and computational capabilities, the full-precision parameters in standard models can be quantized to use fewer bits. The resulting quantized models are then calibrated using back-propagation and full training data to ensure accuracy. This one-time calibration works for deployments in static environments. However, model deployment in dynamic edge environments call for continual calibration to adaptively adjust quantized models to fit new incoming data, which may have different distributions. The first difficulty in enabling continual calibration on the edge is that the full training data may be too large and thus not always available on edge devices. The second difficulty is that the use of back-propagation on the edge for repeated calibration is too expensive. We propose QCore to enable continual calibration on the edge. First, it compresses the full training data into a small subset to enable effective calibration of quantized models with different bit-widths. We also propose means of updating the subset when new streaming data arrives to reflect changes in the environment, while not forgetting earlier training data. Second, we propose a small bit-flip** network that works with the subset to update quantized model parameters, thus enabling efficient continual calibration without back-propagation. An experimental study, conducted with real-world data in a continual learning setting, offers insight into the properties of QCore and shows that it is capable of outperforming strong baseline methods. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: 15 pages. An extended version of "QCore: Data-Efficient, On-Device Continual Calibration for Quantized Models" accepted at PVLDB 2024

arXiv:2403.16859 [pdf, other]

A Semi-Lagrangian Approach for Time and Energy Path Planning Optimization in Static Flow Fields

Authors: Víctor C. da S. Campos, Armando A. Neto, Douglas G. Macharet

Abstract: Efficient path planning for autonomous mobile robots is a critical problem across numerous domains, where optimizing both time and energy consumption is paramount. This paper introduces a novel methodology that considers the dynamic influence of an environmental flow field and considers geometric constraints, including obstacles and forbidden zones, enriching the complexity of the planning problem… ▽ More Efficient path planning for autonomous mobile robots is a critical problem across numerous domains, where optimizing both time and energy consumption is paramount. This paper introduces a novel methodology that considers the dynamic influence of an environmental flow field and considers geometric constraints, including obstacles and forbidden zones, enriching the complexity of the planning problem. We formulate it as a multi-objective optimal control problem, propose a novel transformation called Harmonic Transformation, and apply a semi-Lagrangian scheme to solve it. The set of Pareto efficient solutions is obtained considering two distinct approaches: a deterministic method and an evolutionary-based one, both of which are designed to make use of the proposed Harmonic Transformation. Through an extensive analysis of these approaches, we demonstrate their efficacy in finding optimized paths. △ Less

Submitted 14 June, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

Comments: 12 pages, initial paper submission; Preprint submitted to Journal of the Franklin Institute

arXiv:2311.07861 [pdf, other]

Overview of the TREC 2023 Product Product Search Track

Authors: Daniel Campos, Surya Kallumadi, Corby Rosset, Cheng Xiang Zhai, Alessandro Magnani

Abstract: This is the first year of the TREC Product search track. The focus this year was the creation of a reusable collection and evaluation of the impact of the use of metadata and multi-modal data on retrieval accuracy. This year we leverage the new product search corpus, which includes contextual metadata. Our analysis shows that in the product search domain, traditional retrieval systems are highly e… ▽ More This is the first year of the TREC Product search track. The focus this year was the creation of a reusable collection and evaluation of the impact of the use of metadata and multi-modal data on retrieval accuracy. This year we leverage the new product search corpus, which includes contextual metadata. Our analysis shows that in the product search domain, traditional retrieval systems are highly effective and commonly outperform general-purpose pretrained embedding models. Our analysis also evaluates the impact of using simplified and metadata-enhanced collections, finding no clear trend in the impact of the expanded collection. We also see some surprising outcomes; despite their widespread adoption and competitive performance on other tasks, we find single-stage dense retrieval runs can commonly be noncompetitive or generate low-quality results both in the zero-shot and fine-tuned domain. △ Less

Submitted 15 November, 2023; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: 14 pages, 4 figures, 11 tables - TREC 2023

arXiv:2305.03431 [pdf, other]

Hearing the voice of experts: Unveiling Stack Exchange communities' knowledge of test smells

Authors: Luana Martins, Denivan Campos, Railana Santana, Joselito Mota Junior, Heitor Costa, Ivan Machado

Abstract: Refactorings are transformations to improve the code design without changing overall functionality and observable behavior. During the refactoring process of smelly test code, practitioners may struggle to identify refactoring candidates and define and apply corrective strategies. This paper reports on an empirical study aimed at understanding how test smells and test refactorings are discussed on… ▽ More Refactorings are transformations to improve the code design without changing overall functionality and observable behavior. During the refactoring process of smelly test code, practitioners may struggle to identify refactoring candidates and define and apply corrective strategies. This paper reports on an empirical study aimed at understanding how test smells and test refactorings are discussed on the Stack Exchange network. Developers commonly count on Stack Exchange to pick the brains of the wise, i.e., to `look up' how others are completing similar tasks. Therefore, in light of data from the Stack Exchange discussion topics, we could examine how developers understand and perceive test smells, the corrective actions they take to handle them, and the challenges they face when refactoring test code aiming to fix test smells. We observed that developers are interested in others' perceptions and hands-on experience handling test code issues. Besides, there is a clear indication that developers often ask whether test smells or anti-patterns are either good or bad testing practices than code-based refactoring recommendations. △ Less

Submitted 5 May, 2023; originally announced May 2023.

Comments: Preprint of the manuscript accepted for publication at CHASE 2023

arXiv:2304.03401 [pdf, other]

Noise-Robust Dense Retrieval via Contrastive Alignment Post Training

Authors: Daniel Campos, ChengXiang Zhai, Alessandro Magnani

Abstract: The success of contextual word representations and advances in neural information retrieval have made dense vector-based retrieval a standard approach for passage and document ranking. While effective and efficient, dual-encoders are brittle to variations in query distributions and noisy queries. Data augmentation can make models more robust but introduces overhead to training set generation and r… ▽ More The success of contextual word representations and advances in neural information retrieval have made dense vector-based retrieval a standard approach for passage and document ranking. While effective and efficient, dual-encoders are brittle to variations in query distributions and noisy queries. Data augmentation can make models more robust but introduces overhead to training set generation and requires retraining and index regeneration. We present Contrastive Alignment POst Training (CAPOT), a highly efficient finetuning method that improves model robustness without requiring index regeneration, the training set optimization, or alteration. CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root. We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead. △ Less

Submitted 10 April, 2023; v1 submitted 6 April, 2023; originally announced April 2023.

Comments: 8 pages, 6 figures, 30 tables

arXiv:2304.02721 [pdf, other]

To Asymmetry and Beyond: Structured Pruning of Sequence to Sequence Models for Improved Inference Efficiency

Authors: Daniel Campos, ChengXiang Zhai

Abstract: Sequence-to-sequence language models can be used to produce abstractive summaries which are coherent, relevant, and concise. Still, model sizes can make deployment in latency-sensitive or web-scale implementations difficult. This paper studies the relationship between model size, structured pruning, inference efficiency, and summarization accuracy on widely used summarization datasets. We show tha… ▽ More Sequence-to-sequence language models can be used to produce abstractive summaries which are coherent, relevant, and concise. Still, model sizes can make deployment in latency-sensitive or web-scale implementations difficult. This paper studies the relationship between model size, structured pruning, inference efficiency, and summarization accuracy on widely used summarization datasets. We show that model accuracy is tied to the encoder size while inference efficiency is connected to the decoder. Using asymmetric pruning can lead to nearly 3x improvement in inference latency with ~1 point loss in Rouge-2. Moreover, we find both the average degradation and the role of asymmetry to be consistent across model sizes and variations in datasets. △ Less

Submitted 12 June, 2023; v1 submitted 5 April, 2023; originally announced April 2023.

Comments: SustaiNLP2023 @ ACL 2023,9 pages, 6 figures, 33 tables

arXiv:2304.01016 [pdf, other]

Quick Dense Retrievers Consume KALE: Post Training Kullback Leibler Alignment of Embeddings for Asymmetrical dual encoders

Authors: Daniel Campos, Alessandro Magnani, ChengXiang Zhai

Abstract: In this paper, we consider the problem of improving the inference latency of language model-based dense retrieval systems by introducing structural compression and model size asymmetry between the context and query encoders. First, we investigate the impact of pre and post-training compression on the MSMARCO, Natural Questions, TriviaQA, SQUAD, and SCIFACT, finding that asymmetry in the dual encod… ▽ More In this paper, we consider the problem of improving the inference latency of language model-based dense retrieval systems by introducing structural compression and model size asymmetry between the context and query encoders. First, we investigate the impact of pre and post-training compression on the MSMARCO, Natural Questions, TriviaQA, SQUAD, and SCIFACT, finding that asymmetry in the dual encoders in dense retrieval can lead to improved inference efficiency. Knowing this, we introduce Kullback Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods by pruning and aligning the query encoder after training. Specifically, KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation. Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference. △ Less

Submitted 1 June, 2023; v1 submitted 31 March, 2023; originally announced April 2023.

Comments: SustaiNLP2023 @ ACL 2023, 8 pages, 4 figures, 30 tables

arXiv:2304.00114 [pdf, other]

Dense Sparse Retrieval: Using Sparse Language Models for Inference Efficient Dense Retrieval

Authors: Daniel Campos, ChengXiang Zhai

Abstract: Vector-based retrieval systems have become a common staple for academic and industrial search applications because they provide a simple and scalable way of extending the search to leverage contextual representations for documents and queries. As these vector-based systems rely on contextual language models, their usage commonly requires GPUs, which can be expensive and difficult to manage. Given… ▽ More Vector-based retrieval systems have become a common staple for academic and industrial search applications because they provide a simple and scalable way of extending the search to leverage contextual representations for documents and queries. As these vector-based systems rely on contextual language models, their usage commonly requires GPUs, which can be expensive and difficult to manage. Given recent advances in introducing sparsity into language models for improved inference efficiency, in this paper, we study how sparse language models can be used for dense retrieval to improve inference efficiency. Using the popular retrieval library Tevatron and the MSMARCO, NQ, and TriviaQA datasets, we find that sparse language models can be used as direct replacements with little to no drop in accuracy and up to 4.3x improved inference speeds △ Less

Submitted 31 March, 2023; originally announced April 2023.

arXiv:2303.17612 [pdf, other]

oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes

Authors: Daniel Campos, Alexandre Marques, Mark Kurtz, ChengXiang Zhai

Abstract: In this paper, we introduce the range of oBERTa language models, an easy-to-use set of language models which allows Natural Language Processing (NLP) practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. Specifically, oBERTa extends existing work on pruning, knowledge distillation, and quantization and leverages frozen embeddings improves distilla… ▽ More In this paper, we introduce the range of oBERTa language models, an easy-to-use set of language models which allows Natural Language Processing (NLP) practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. Specifically, oBERTa extends existing work on pruning, knowledge distillation, and quantization and leverages frozen embeddings improves distillation and model initialization to deliver higher accuracy on a broad range of transfer tasks. In generating oBERTa, we explore how the highly optimized RoBERTa differs from the BERT for pruning during pre-training and finetuning. We find it less amenable to compression during fine-tuning. We explore the use of oBERTa on seven representative NLP tasks and find that the improved compression techniques allow a pruned oBERTa model to match the performance of BERTbase and exceed the performance of Prune OFA Large on the SQUAD V1.1 Question Answering dataset, despite being 8x and 2x, respectively faster in inference. We release our code, training regimes, and associated model for broad usage to encourage usage and experimentation △ Less

Submitted 6 June, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

Comments: SustaiNLP2023 @ ACL 2023,9 pages, 2 figures, 45 tables

arXiv:2302.12721 [pdf, other]

doi 10.1145/3589316

LightTS: Lightweight Time Series Classification with Adaptive Ensemble Distillation -- Extended Version

Authors: David Campos, Miao Zhang, Bin Yang, Tung Kieu, Chenjuan Guo, Christian S. Jensen

Abstract: Due to the swee** digitalization of processes, increasingly vast amounts of time series data are being produced. Accurate classification of such time series facilitates decision making in multiple domains. State-of-the-art classification accuracy is often achieved by ensemble learning where results are synthesized from multiple base models. This characteristic implies that ensemble learning need… ▽ More Due to the swee** digitalization of processes, increasingly vast amounts of time series data are being produced. Accurate classification of such time series facilitates decision making in multiple domains. State-of-the-art classification accuracy is often achieved by ensemble learning where results are synthesized from multiple base models. This characteristic implies that ensemble learning needs substantial computing resources, preventing their use in resource-limited environments, such as in edge devices. To extend the applicability of ensemble learning, we propose the LightTS framework that compresses large ensembles into lightweight models while ensuring competitive accuracy. First, we propose adaptive ensemble distillation that assigns adaptive weights to different base models such that their varying classification capabilities contribute purposefully to the training of the lightweight model. Second, we propose means of identifying Pareto optimal settings w.r.t. model accuracy and model size, thus enabling users with a space budget to select the most accurate lightweight model. We report on experiments using 128 real-world time series sets and different types of base models that justify key decisions in the design of LightTS and provide evidence that LightTS is able to outperform competitors. △ Less

Submitted 24 February, 2023; originally announced February 2023.

Comments: 15 pages. An extended version of "LightTS: Lightweight Time Series Classification with Adaptive Ensemble Distillation" accepted at SIGMOD 2023

Journal ref: Proceedings of the ACM on Management of Data 1, 2 (2023), 171:1-171:27

arXiv:2211.15927 [pdf, ps, other]

Compressing Cross-Lingual Multi-Task Models at Qualtrics

Authors: Daniel Campos, Daniel Perry, Samir Joshi, Yashmeet Gambhir, Wei Du, Zhengzheng Xing, Aaron Colak

Abstract: Experience management is an emerging business area where organizations focus on understanding the feedback of customers and employees in order to improve their end-to-end experiences. This results in a unique set of machine learning problems to help understand how people feel, discover issues they care about, and find which actions need to be taken on data that are different in content and distrib… ▽ More Experience management is an emerging business area where organizations focus on understanding the feedback of customers and employees in order to improve their end-to-end experiences. This results in a unique set of machine learning problems to help understand how people feel, discover issues they care about, and find which actions need to be taken on data that are different in content and distribution from traditional NLP domains. In this paper, we present a case study of building text analysis applications that perform multiple classification tasks efficiently in 12 languages in the nascent business area of experience management. In order to scale up modern ML methods on experience data, we leverage cross lingual and multi-task modeling techniques to consolidate our models into a single deployment to avoid overhead. We also make use of model compression and model distillation to reduce overall inference latency and hardware cost to the level acceptable for business needs while maintaining model prediction quality. Our findings show that multi-task modeling improves task performance for a subset of experience management tasks in both XLM-R and mBert architectures. Among the compressed architectures we explored, we found that MiniLM achieved the best compression/performance tradeoff. Our case study demonstrates a speedup of up to 15.61x with 2.60% average task degradation (or 3.29x speedup with 1.71% degradation) and estimated savings of 44% over using the original full-size model. These results demonstrate a successful scaling up of text classification for the challenging new area of ML for experience management. △ Less

Submitted 28 November, 2022; originally announced November 2022.

Comments: accepted to IAAI-23 (part of AAAI-23)

ACM Class: I.2.7

arXiv:2205.12452 [pdf, other]

Sparse*BERT: Sparse Models Generalize To New tasks and Domains

Authors: Daniel Campos, Alexandre Marques, Tuan Nguyen, Mark Kurtz, ChengXiang Zhai

Abstract: Large Language Models have become the core architecture upon which most modern natural language processing (NLP) systems build. These models can consistently deliver impressive accuracy and robustness across tasks and domains, but their high computational overhead can make inference difficult and expensive. To make using these models less costly, recent work has explored leveraging structured and… ▽ More Large Language Models have become the core architecture upon which most modern natural language processing (NLP) systems build. These models can consistently deliver impressive accuracy and robustness across tasks and domains, but their high computational overhead can make inference difficult and expensive. To make using these models less costly, recent work has explored leveraging structured and unstructured pruning, quantization, and distillation to improve inference speed and decrease size. This paper studies how models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks. Our experimentation shows that models that are pruned during pretraining using general domain masked language models can transfer to novel domains and tasks without extensive hyperparameter exploration or specialized approaches. We demonstrate that our general sparse model Sparse*BERT can become SparseBioBERT simply by pretraining the compressed architecture on unstructured biomedical text. Moreover, we show that SparseBioBERT can match the quality of BioBERT with only 10\% of the parameters. △ Less

Submitted 5 April, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

Comments: Presented at Sparsity in Neural Networks Workshop at ICML 2022, 6 pages, 2 figures, 4 tables

arXiv:2203.07259 [pdf, other]

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Authors: Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, Dan Alistarh

Abstract: Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, wi… ▽ More Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, with low accuracy loss. In this context, this paper's contributions are two-fold. We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models. We introduce Optimal BERT Surgeon (oBERT), an efficient and accurate weight pruning method based on approximate second-order information, which we show to yield state-of-the-art results in both stages of language tasks: pre-training and fine-tuning. Specifically, oBERT extends existing work on unstructured second-order pruning by allowing for pruning blocks of weights, and by being applicable at the BERT scale. Second, we investigate the impact of this pruning method when compounding compression approaches to obtain highly compressed but accurate models for deployment on edge devices. These models significantly push boundaries of the current state-of-the-art sparse BERT models with respect to all metrics: model size, inference speed and task accuracy. For example, relative to the dense BERT-base, we obtain 10x model size compression (in MB) with < 1% accuracy drop, 10x CPU-inference speedup with < 2% accuracy drop, and 29x CPU-inference speedup with < 7.5% accuracy drop. Our code, fully integrated with Transformers and SparseML, is available at https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT. △ Less

Submitted 17 October, 2022; v1 submitted 14 March, 2022; originally announced March 2022.

Comments: Accepted to EMNLP 2022

arXiv:2203.02592 [pdf, other]

Sparsity-Inducing Categorical Prior Improves Robustness of the Information Bottleneck

Authors: Anirban Samaddar, Sandeep Madireddy, Prasanna Balaprakash, Tapabrata Maiti, Gustavo de los Campos, Ian Fischer

Abstract: The information bottleneck framework provides a systematic approach to learning representations that compress nuisance information in the input and extract semantically meaningful information about predictions. However, the choice of a prior distribution that fixes the dimensionality across all the data can restrict the flexibility of this approach for learning robust representations. We present a… ▽ More The information bottleneck framework provides a systematic approach to learning representations that compress nuisance information in the input and extract semantically meaningful information about predictions. However, the choice of a prior distribution that fixes the dimensionality across all the data can restrict the flexibility of this approach for learning robust representations. We present a novel sparsity-inducing spike-slab categorical prior that uses sparsity as a mechanism to provide the flexibility that allows each data point to learn its own dimension distribution. In addition, it provides a mechanism for learning a joint distribution of the latent variable and the sparsity and hence can account for the complete uncertainty in the latent space. Through a series of experiments using in-distribution and out-of-distribution learning scenarios on the MNIST, CIFAR-10, and ImageNet data, we show that the proposed approach improves accuracy and robustness compared to traditional fixed-dimensional priors, as well as other sparsity induction mechanisms for latent variable models proposed in the literature. △ Less

Submitted 27 October, 2022; v1 submitted 4 March, 2022; originally announced March 2022.

arXiv:2111.11108 [pdf, other]

doi 10.14778/3494124.3494142

Unsupervised Time Series Outlier Detection with Diversity-Driven Convolutional Ensembles -- Extended Version

Authors: David Campos, Tung Kieu, Chenjuan Guo, Feiteng Huang, Kai Zheng, Bin Yang, Christian S. Jensen

Abstract: With the swee** digitalization of societal, medical, industrial, and scientific processes, sensing technologies are being deployed that produce increasing volumes of time series data, thus fueling a plethora of new or improved applications. In this setting, outlier detection is frequently important, and while solutions based on neural networks exist, they leave room for improvement in terms of b… ▽ More With the swee** digitalization of societal, medical, industrial, and scientific processes, sensing technologies are being deployed that produce increasing volumes of time series data, thus fueling a plethora of new or improved applications. In this setting, outlier detection is frequently important, and while solutions based on neural networks exist, they leave room for improvement in terms of both accuracy and efficiency. With the objective of achieving such improvements, we propose a diversity-driven, convolutional ensemble. To improve accuracy, the ensemble employs multiple basic outlier detection models built on convolutional sequence-to-sequence autoencoders that can capture temporal dependencies in time series. Further, a novel diversity-driven training method maintains diversity among the basic models, with the aim of improving the ensemble's accuracy. To improve efficiency, the approach enables a high degree of parallelism during training. In addition, it is able to transfer some model parameters from one basic model to another, which reduces training time. We report on extensive experiments using real-world multivariate time series that offer insight into the design choices underlying the new approach and offer evidence that it is capable of improved accuracy and efficiency. This is an extended version of "Unsupervised Time Series Outlier Detection with Diversity-Driven Convolutional Ensembles", to appear in PVLDB 2022. △ Less

Submitted 22 November, 2021; originally announced November 2021.

Comments: 14 pages. An extended version of "Unsupervised Time Series Outlier Detection with Diversity-Driven Convolutional Ensembles", to appear in PVLDB 2022

Journal ref: Proceedings of the VLDB Endowment, 15, 3 (2022), 611-623

arXiv:2109.04202 [pdf, other]

IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System

Authors: Daniel Campos, Heng Ji

Abstract: Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital information is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and mac… ▽ More Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital information is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and machine-readable molecule representations, text-based molecule descriptors like SMILES and SELFIES were created. These text-based molecule representations provide molecule generation but are unfortunately rarely present in published literature. In the absence of molecule descriptors, the generation of molecule descriptors from the 2-D images present in the literature is necessary to understand chemistry literature at scale. Successful methods such as Optical Structure Recognition Application (OSRA), and ChemSchematicResolver are able to extract the locations of molecules structures in chemistry papers and infer molecular descriptions and reactions. While effective, existing systems expect chemists to correct outputs, making them unsuitable for unsupervised large-scale data mining. Leveraging the task formulation of image captioning introduced by DECIMER, we introduce IMG2SMI, a model which leverages Deep Residual Networks for image feature extraction and an encoder-decoder Transformer layers for molecule description generation. Unlike previous Neural Network-based systems, IMG2SMI builds around the task of molecule description generation, which enables IMG2SMI to outperform OSRA-based systems by 163% in molecule similarity prediction as measured by the molecular MACCS Fingerprint Tanimoto Similarity. Additionally, to facilitate further research on this task, we release a new molecule prediction dataset. including 81 million molecules for molecule description generation △ Less

Submitted 3 September, 2021; originally announced September 2021.

arXiv:2108.05868 [pdf, other]

A Semi-Lagrangian Approach for the Minimal Exposure Path Problem in Wireless Sensor Networks

Authors: Armando Alves Neto, Víctor C. da Silva Campos, Douglas G. Macharet

Abstract: A critical metric of the coverage quality in Wireless Sensor Networks (WSNs) is the Minimal Exposure Path (MEP), a path through the environment that least exposes an intruder to the sensor detecting nodes. Many approaches have been proposed in the last decades to solve this optimization problem, ranging from classic (grid-based and Voronoi-based) planners to genetic meta-heuristics. However, most… ▽ More A critical metric of the coverage quality in Wireless Sensor Networks (WSNs) is the Minimal Exposure Path (MEP), a path through the environment that least exposes an intruder to the sensor detecting nodes. Many approaches have been proposed in the last decades to solve this optimization problem, ranging from classic (grid-based and Voronoi-based) planners to genetic meta-heuristics. However, most of them are limited to specific sensing models and obstacle-free spaces. Still, none of them guarantee an optimal solution, and the state-of-the-art is expensive in terms of run-time. Therefore, in this paper, we propose a novel method that models the MEP as an Optimal Control problem and solves it by using a Semi-Lagrangian approach. This framework is shown to converge to the optimal MEP while also incorporates different homogeneous and heterogeneous sensor models and geometric constraints (obstacles). Experiments show that our method dominates the state-of-the-art, improving the results by approximately 10% with a relatively lower execution time. △ Less

Submitted 12 August, 2021; originally announced August 2021.

arXiv:2108.02170 [pdf, other]

Curriculum learning for language modeling

Authors: Daniel Campos

Abstract: Language Models like ELMo and BERT have provided robust representations of natural language, which serve as the language understanding component for a diverse range of downstream tasks.Curriculum learning is a method that employs a structured training regime instead, which has been leveraged in computer vision and machine translation to improve model training speed and model performance. While lan… ▽ More Language Models like ELMo and BERT have provided robust representations of natural language, which serve as the language understanding component for a diverse range of downstream tasks.Curriculum learning is a method that employs a structured training regime instead, which has been leveraged in computer vision and machine translation to improve model training speed and model performance. While language models have proven transformational for the natural language processing community, these models have proven expensive, energy-intensive, and challenging to train. In this work, we explore the effect of curriculum learning on language model pretraining using various linguistically motivated curricula and evaluate transfer performance on the GLUE Benchmark. Despite a broad variety of training methodologies and experiments we do not find compelling evidence that curriculum learning methods improve language model training. △ Less

Submitted 4 August, 2021; originally announced August 2021.

arXiv:2107.13902 [pdf, other]

Developers perception on the severity of test smells: an empirical study

Authors: Denivan Campos, Larissa Rocha, Ivan Machado

Abstract: Unit testing is an essential component of the software development life-cycle. A developer could easily and quickly catch and fix software faults introduced in the source code by creating and running unit tests. Despite their importance, unit tests are subject to bad design or implementation decisions, the so-called test smells. These might decrease software systems quality from various aspects, m… ▽ More Unit testing is an essential component of the software development life-cycle. A developer could easily and quickly catch and fix software faults introduced in the source code by creating and running unit tests. Despite their importance, unit tests are subject to bad design or implementation decisions, the so-called test smells. These might decrease software systems quality from various aspects, making it harder to understand, more complex to maintain, and more prone to errors and bugs. Many studies discuss the likely effects of test smells on test code. However, there is a lack of studies that capture developers perceptions of such issues. This study empirically analyzes how developers perceive the severity of test smells in the test code they develop. Severity refers to the degree to how a test smell may negatively impact the test code. We selected six open-source software projects from GitHub and interviewed their developers to understand whether and how the test smells affected the test code. Although most of the interviewed developers considered the test smells as having a low severity to their code, they indicated that test smells might negatively impact the project, particularly in test code maintainability and evolution. Also, detecting and removing test smells from the test code may be positive for the project. △ Less

Submitted 29 July, 2021; originally announced July 2021.

Comments: 14 pages

arXiv:2105.04021 [pdf, other]

MS MARCO: Benchmarking Ranking Models in the Large-Data Regime

Authors: Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin

Abstract: Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public leaderboard such as MS MARCO, are intended to encourage research and track our progress, addressing big questions in our field. However, the goal is not simply to identify which run is "best", achieving the top score. The goal is to move the field forward by develo** new robust techniques, that work in many different setting… ▽ More Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public leaderboard such as MS MARCO, are intended to encourage research and track our progress, addressing big questions in our field. However, the goal is not simply to identify which run is "best", achieving the top score. The goal is to move the field forward by develo** new robust techniques, that work in many different settings, and are adopted in research and practice. This paper uses the MS MARCO and TREC Deep Learning Track as our case study, comparing it to the case of TREC ad hoc ranking in the 1990s. We show how the design of the evaluation effort can encourage or discourage certain outcomes, and raising questions about internal and external validity of results. We provide some analysis of certain pitfalls, and a statement of best practices for avoiding such pitfalls. We summarize the progress of the effort so far, and describe our desired end state of "robust usefulness", along with steps that might be required to get us there. △ Less

Submitted 9 May, 2021; originally announced May 2021.

arXiv:2104.09399 [pdf, other]

TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime

Authors: Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees, Ian Soboroff

Abstract: The TREC Deep Learning (DL) Track studies ad hoc search in the large data regime, meaning that a large set of human-labeled training data is available. Results so far indicate that the best models with large data may be deep neural networks. This paper supports the reuse of the TREC DL test collections in three ways. First we describe the data sets in detail, documenting clearly and in one place s… ▽ More The TREC Deep Learning (DL) Track studies ad hoc search in the large data regime, meaning that a large set of human-labeled training data is available. Results so far indicate that the best models with large data may be deep neural networks. This paper supports the reuse of the TREC DL test collections in three ways. First we describe the data sets in detail, documenting clearly and in one place some details that are otherwise scattered in track guidelines, overview papers and in our associated MS MARCO leaderboard pages. We intend this description to make it easy for newcomers to use the TREC DL data. Second, because there is some risk of iteration and selection bias when reusing a data set, we describe the best practices for writing a paper using TREC DL data, without overfitting. We provide some illustrative analysis. Finally we address a number of issues around the TREC DL data, including an analysis of reusability. △ Less

Submitted 19 April, 2021; originally announced April 2021.

Comments: arXiv admin note: text overlap with arXiv:2003.07820

arXiv:2102.12887 [pdf, other]

Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard

Authors: Jimmy Lin, Daniel Campos, Nick Craswell, Bhaskar Mitra, Emine Yilmaz

Abstract: Leaderboards are a ubiquitous part of modern research in applied machine learning. By design, they sort entries into some linear order, where the top-scoring entry is recognized as the "state of the art" (SOTA). Due to the rapid progress being made in information retrieval today, particularly with neural models, the top entry in a leaderboard is replaced with some regularity. These are touted as i… ▽ More Leaderboards are a ubiquitous part of modern research in applied machine learning. By design, they sort entries into some linear order, where the top-scoring entry is recognized as the "state of the art" (SOTA). Due to the rapid progress being made in information retrieval today, particularly with neural models, the top entry in a leaderboard is replaced with some regularity. These are touted as improvements in the state of the art. Such pronouncements, however, are almost never qualified with significance testing. In the context of the MS MARCO document ranking leaderboard, we pose a specific question: How do we know if a run is significantly better than the current SOTA? We ask this question against the backdrop of recent IR debates on scale types: in particular, whether commonly used significance tests are even mathematically permissible. Recognizing these potential pitfalls in evaluation methodology, our study proposes an evaluation framework that explicitly treats certain outcomes as distinct and avoids aggregating them into a single-point metric. Empirical analysis of SOTA runs from the MS MARCO document ranking leaderboard reveals insights about how one run can be "significantly better" than another that are obscured by the current official evaluation metric (MRR@100). △ Less

Submitted 25 February, 2021; originally announced February 2021.

arXiv:2102.07662 [pdf, other]

Overview of the TREC 2020 deep learning track

Authors: Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos

Abstract: This is the second year of the TREC Deep Learning Track, with the goal of studying ad hoc ranking in the large training data regime. We again have a document retrieval task and a passage retrieval task, each with hundreds of thousands of human-labeled training queries. We evaluate using single-shot TREC-style evaluation, to give us a picture of which ranking methods work best when large data is av… ▽ More This is the second year of the TREC Deep Learning Track, with the goal of studying ad hoc ranking in the large training data regime. We again have a document retrieval task and a passage retrieval task, each with hundreds of thousands of human-labeled training queries. We evaluate using single-shot TREC-style evaluation, to give us a picture of which ranking methods work best when large data is available, with much more comprehensive relevance labeling on the small number of test queries. This year we have further evidence that rankers with BERT-style pretraining outperform other rankers in the large data regime. △ Less

Submitted 15 February, 2021; originally announced February 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2003.07820

arXiv:2012.02530 [pdf, other]

Logic Synthesis Meets Machine Learning: Trading Exactness for Generalization

Authors: Shubham Rai, Walter Lau Neto, Yukio Miyasaka, Xinpei Zhang, Mingfei Yu, Qingyang Yi Masahiro Fujita, Guilherme B. Manske, Matheus F. Pontes, Leomar S. da Rosa Junior, Marilton S. de Aguiar, Paulo F. Butzen, Po-Chun Chien, Yu-Shan Huang, Hoa-Ren Wang, Jie-Hong R. Jiang, Jiaqi Gu, Zheng Zhao, Zixuan Jiang, David Z. Pan, Brunno A. de Abreu, Isac de Souza Campos, Augusto Berndt, Cristina Meinhardt, Jonata T. Carvalho, Mateus Grellert , et al. (15 additional authors not shown)

Abstract: Logic synthesis is a fundamental step in hardware design whose goal is to find structural representations of Boolean functions while minimizing delay and area. If the function is completely-specified, the implementation accurately represents the function. If the function is incompletely-specified, the implementation has to be true only on the care set. While most of the algorithms in logic synthes… ▽ More Logic synthesis is a fundamental step in hardware design whose goal is to find structural representations of Boolean functions while minimizing delay and area. If the function is completely-specified, the implementation accurately represents the function. If the function is incompletely-specified, the implementation has to be true only on the care set. While most of the algorithms in logic synthesis rely on SAT and Boolean methods to exactly implement the care set, we investigate learning in logic synthesis, attempting to trade exactness for generalization. This work is directly related to machine learning where the care set is the training set and the implementation is expected to generalize on a validation set. We present learning incompletely-specified functions based on the results of a competition conducted at IWLS 2020. The goal of the competition was to implement 100 functions given by a set of care minterms for training, while testing the implementation using a set of validation minterms sampled from the same function. We make this benchmark suite available and offer a detailed comparative analysis of the different approaches to learning △ Less

Submitted 15 December, 2020; v1 submitted 4 December, 2020; originally announced December 2020.

Comments: In this 23 page manuscript, we explore the connection between machine learning and logic synthesis which was the main goal for International Workshop on logic synthesis. It includes approaches applied by ten teams spanning 6 countries across the world

arXiv:2008.04627 [pdf, other]

A Comparison of Humanoid Robot Simulators: A Quantitative Approach

Authors: Angel Ayala, Francisco Cruz, Diego Campos, Rodrigo Rubio, Bruno Fernandes, Richard Dazeley

Abstract: Research on humanoid robotic systems involves a considerable amount of computational resources, not only for the involved design but also for its development and subsequent implementation. For robotic systems to be implemented in real-world scenarios, in several situations, it is preferred to develop and test them under controlled environments in order to reduce the risk of errors and unexpected b… ▽ More Research on humanoid robotic systems involves a considerable amount of computational resources, not only for the involved design but also for its development and subsequent implementation. For robotic systems to be implemented in real-world scenarios, in several situations, it is preferred to develop and test them under controlled environments in order to reduce the risk of errors and unexpected behavior. In this regard, a more accessible and efficient alternative is to implement the environment using robotic simulation tools. This paper presents a quantitative comparison of Gazebo, Webots, and V-REP, three simulators widely used by the research community to develop robotic systems. To compare the performance of these three simulators, elements such as CPU, memory footprint, and disk access are used to measure and compare them to each other. In order to measure the use of resources, each simulator executes 20 times a robotic scenario composed by a NAO robot that must navigate to a goal position avoiding a specific obstacle. In general terms, our results show that Webots is the simulator with the lowest use of resources, followed by V-REP, which has advantages over Gazebo, mainly because of the CPU use. △ Less

Submitted 11 August, 2020; originally announced August 2020.

Comments: Accepted in the IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), 2020

arXiv:2006.05324 [pdf, other]

ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search

Authors: Nick Craswell, Daniel Campos, Bhaskar Mitra, Emine Yilmaz, Bodo Billerbeck

Abstract: Users of Web search engines reveal their information needs through queries and clicks, making click logs a useful asset for information retrieval. However, click logs have not been publicly released for academic use, because they can be too revealing of personally or commercially sensitive information. This paper describes a click data release related to the TREC Deep Learning Track document corpu… ▽ More Users of Web search engines reveal their information needs through queries and clicks, making click logs a useful asset for information retrieval. However, click logs have not been publicly released for academic use, because they can be too revealing of personally or commercially sensitive information. This paper describes a click data release related to the TREC Deep Learning Track document corpus. After aggregation and filtering, including a k-anonymity requirement, we find 1.4 million of the TREC DL URLs have 18 million connections to 10 million distinct queries. Our dataset of these queries and connections to TREC documents is of similar size to proprietary datasets used in previous papers on query mining and ranking. We perform some preliminary experiments using the click data to augment the TREC DL training data, offering by comparison: 28x more queries, with 49x more connections to 4.4x more URLs in the corpus. We present a description of the dataset's generation process, characteristics, use in ranking and suggest other potential uses. △ Less

Submitted 18 August, 2020; v1 submitted 9 June, 2020; originally announced June 2020.

arXiv:2004.13486 [pdf, other]

On the Reliability of Test Collections for Evaluating Systems of Different Types

Authors: Emine Yilmaz, Nick Craswell, Bhaskar Mitra, Daniel Campos

Abstract: As deep learning based models are increasingly being used for information retrieval (IR), a major challenge is to ensure the availability of test collections for measuring their quality. Test collections are generated based on pooling results of various retrieval systems, but until recently this did not include deep learning systems. This raises a major challenge for reusable evaluation: Since dee… ▽ More As deep learning based models are increasingly being used for information retrieval (IR), a major challenge is to ensure the availability of test collections for measuring their quality. Test collections are generated based on pooling results of various retrieval systems, but until recently this did not include deep learning systems. This raises a major challenge for reusable evaluation: Since deep learning based models use external resources (e.g. word embeddings) and advanced representations as opposed to traditional methods that are mainly based on lexical similarity, they may return different types of relevant document that were not identified in the original pooling. If so, test collections constructed using traditional methods are likely to lead to biased and unfair evaluation results for deep learning (neural) systems. This paper uses simulated pooling to test the fairness and reusability of test collections, showing that pooling based on traditional systems only can lead to biased evaluation of deep learning systems. △ Less

Submitted 28 April, 2020; originally announced April 2020.

arXiv:2004.01401 [pdf, ps, other]

XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

Authors: Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei Zhang, Rahul Agrawal, Edward Cui, Sining Wei, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Daniel Campos, Rangan Majumder, Ming Zhou

Abstract: In this paper, we introduce XGLUE, a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE(Wang et al., 2019), which is labeled in English for natural language understanding tasks only, XGLUE has two main advantages: (1) it pr… ▽ More In this paper, we introduce XGLUE, a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE(Wang et al., 2019), which is labeled in English for natural language understanding tasks only, XGLUE has two main advantages: (1) it provides 11 diversified tasks that cover both natural language understanding and generation scenarios; (2) for each task, it provides labeled data in multiple languages. We extend a recent cross-lingual pre-trained model Unicoder(Huang et al., 2019) to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline. We also evaluate the base versions (12-layer) of Multilingual BERT, XLM and XLM-R for comparison. △ Less

Submitted 22 May, 2020; v1 submitted 3 April, 2020; originally announced April 2020.

arXiv:2003.07820 [pdf, other]

Overview of the TREC 2019 deep learning track

Authors: Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees

Abstract: The Deep Learning Track is a new track for TREC 2019, with the goal of studying ad hoc ranking in a large data regime. It is the first track with large human-labeled training sets, introducing two sets corresponding to two tasks, each with rigorous TREC-style blind evaluation and reusable test sets. The document retrieval task has a corpus of 3.2 million documents with 367 thousand training querie… ▽ More The Deep Learning Track is a new track for TREC 2019, with the goal of studying ad hoc ranking in a large data regime. It is the first track with large human-labeled training sets, introducing two sets corresponding to two tasks, each with rigorous TREC-style blind evaluation and reusable test sets. The document retrieval task has a corpus of 3.2 million documents with 367 thousand training queries, for which we generate a reusable test set of 43 queries. The passage retrieval task has a corpus of 8.8 million passages with 503 thousand training queries, for which we generate a reusable test set of 43 queries. This year 15 groups submitted a total of 75 runs, using various combinations of deep learning, transfer learning and traditional IR ranking methods. Deep learning runs significantly outperformed traditional IR runs. Possible explanations for this result are that we introduced large training data and we included deep models trained on such data in our judging pools, whereas some past studies did not have such training data or pooling. △ Less

Submitted 18 March, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

arXiv:1911.02671 [pdf, other]

Open Domain Web Keyphrase Extraction Beyond Language Modeling

Authors: Lee Xiong, Chuan Hu, Chenyan Xiong, Daniel Campos, Arnold Overwijk

Abstract: This paper studies keyphrase extraction in real-world scenarios where documents are from diverse domains and have variant content quality. We curate and release OpenKP, a large scale open domain keyphrase extraction dataset with near one hundred thousand web documents and expert keyphrase annotations. To handle the variations of domain and content quality, we develop BLING-KPE, a neural keyphrase… ▽ More This paper studies keyphrase extraction in real-world scenarios where documents are from diverse domains and have variant content quality. We curate and release OpenKP, a large scale open domain keyphrase extraction dataset with near one hundred thousand web documents and expert keyphrase annotations. To handle the variations of domain and content quality, we develop BLING-KPE, a neural keyphrase extraction model that goes beyond language understanding using visual presentations of documents and weak supervision from search queries. Experimental results on OpenKP confirm the effectiveness of BLING-KPE and the contributions of its neural architecture, visual features, and search log weak supervision. Zero-shot evaluations on DUC-2001 demonstrate the improved generalization ability of learning from the open domain data compared to a specific domain. △ Less

Submitted 6 November, 2019; originally announced November 2019.

Journal ref: EMNLP-IJCNLP 2019

arXiv:1910.04277 [pdf]

Experiments in Inferring Social Networks of Diffusion

Authors: Daniel Campos, Zoe Konrad

Abstract: Information diffusion is a fundamental process that takes place over networks. While it is rarely realistic to observe the individual transmissions of the information diffusion process, it is typically possible to observe when individuals first publish the information. We look specifically at previously published algorithm NETINF that probabilistically identifies the optimal network that best expl… ▽ More Information diffusion is a fundamental process that takes place over networks. While it is rarely realistic to observe the individual transmissions of the information diffusion process, it is typically possible to observe when individuals first publish the information. We look specifically at previously published algorithm NETINF that probabilistically identifies the optimal network that best explains the observed infection times. We explore how the algorithm could perform on a range of intrinsically different social and information network topologies, from news blogs and websites to Twitter to Reddit. △ Less

Submitted 9 October, 2019; originally announced October 2019.

arXiv:1709.06489 [pdf, ps, other]

Accurate Genomic Prediction Of Human Height

Authors: Louis Lello, Steven G. Avery, Laurent Tellier, Ana Vazquez, Gustavo de los Campos, Stephen D. H. Hsu

Abstract: We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, $\sim$40, 20, and 9 percent of total variance for the three traits. For example, predicted heights corre… ▽ More We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, $\sim$40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate $\sim$0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the "missing heritability" problem -- i.e., the gap between prediction R-squared and SNP heritability. The $\sim$20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results. △ Less

Submitted 19 September, 2017; originally announced September 2017.

Comments: 17 pages, 10 figures

arXiv:1611.09268 [pdf, other]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Authors: Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang

Abstract: We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that… ▽ More We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models. △ Less

Submitted 31 October, 2018; v1 submitted 28 November, 2016; originally announced November 2016.

arXiv:cs/0309038 [pdf, ps, other]

doi 10.1007/s10878-004-4835-9

A novel evolutionary formulation of the maximum independent set problem

Authors: V. C. Barbosa, L. C. D. Campos

Abstract: We introduce a novel evolutionary formulation of the problem of finding a maximum independent set of a graph. The new formulation is based on the relationship that exists between a graph's independence number and its acyclic orientations. It views such orientations as individuals and evolves them with the aid of evolutionary operators that are very heavily based on the structure of the graph and… ▽ More We introduce a novel evolutionary formulation of the problem of finding a maximum independent set of a graph. The new formulation is based on the relationship that exists between a graph's independence number and its acyclic orientations. It views such orientations as individuals and evolves them with the aid of evolutionary operators that are very heavily based on the structure of the graph and its acyclic orientations. The resulting heuristic has been tested on some of the Second DIMACS Implementation Challenge benchmark graphs, and has been found to be competitive when compared to several of the other heuristics that have also been tested on those graphs. △ Less

Submitted 22 September, 2003; originally announced September 2003.

Report number: ES-615/03 ACM Class: F.2.2; I.2.8

Journal ref: Journal of Combinatorial Optimization 8 (2004), 419-437

Showing 1–37 of 37 results for author: Campos, D