-
Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track
Authors:
Ronak Pradeep,
Nandan Thakur,
Sahel Sharifymoghaddam,
Eric Zhang,
Ryan Nguyen,
Daniel Campos,
Nick Craswell,
Jimmy Lin
Abstract:
Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the tradi…
▽ More
Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the traditional search paradigm that relies on displaying a ranked list of documents. Therefore, given these recent advancements, it is crucial to have an arena to build, test, visualize, and systematically evaluate RAG-based search systems. With this in mind, we propose the TREC 2024 RAG Track to foster innovation in evaluating RAG systems. In our work, we lay out the steps we've made towards making this track a reality -- we describe the details of our reusable framework, Ragnarök, explain the curation of the new MS MARCO V2.1 collection choice, release the development topics for the track, and standardize the I/O definitions which assist the end user. Next, using Ragnarök, we identify and provide key industrial baselines such as OpenAI's GPT-4o or Cohere's Command R+. Further, we introduce a web-based user interface for an interactive arena allowing benchmarking pairwise RAG systems by crowdsourcing. We open-source our Ragnarök framework and baselines to achieve a unified standard for future RAG systems.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Synthetic Test Collections for Retrieval Evaluation
Authors:
Hossein A. Rahmani,
Nick Craswell,
Emine Yilmaz,
Bhaskar Mitra,
Daniel Campos
Abstract:
Test collections play a vital role in evaluation of information retrieval (IR) systems. Obtaining a diverse set of user queries for test collection construction can be challenging, and acquiring relevance judgments, which indicate the appropriateness of retrieved documents to a query, is often costly and resource-intensive. Generating synthetic datasets using Large Language Models (LLMs) has recen…
▽ More
Test collections play a vital role in evaluation of information retrieval (IR) systems. Obtaining a diverse set of user queries for test collection construction can be challenging, and acquiring relevance judgments, which indicate the appropriateness of retrieved documents to a query, is often costly and resource-intensive. Generating synthetic datasets using Large Language Models (LLMs) has recently gained significant attention in various applications. In IR, while previous work exploited the capabilities of LLMs to generate synthetic queries or documents to augment training data and improve the performance of ranking models, using LLMs for constructing synthetic test collections is relatively unexplored. Previous studies demonstrate that LLMs have the potential to generate synthetic relevance judgments for use in the evaluation of IR systems. In this paper, we comprehensively investigate whether it is possible to use LLMs to construct fully synthetic test collections by generating not only synthetic judgments but also synthetic queries. In particular, we analyse whether it is possible to construct reliable synthetic test collections and the potential risks of bias such test collections may exhibit towards LLM-based models. Our experiments indicate that using LLMs it is possible to construct synthetic test collections that can reliably be used for retrieval evaluation.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models
Authors:
Luke Merrick,
Danmei Xu,
Gaurav Nuti,
Daniel Campos
Abstract:
This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the…
▽ More
This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the largest model, arctic-embed-l outperforming closed source embedding models such as Cohere's embed-v3 and Open AI's text-embed-3-large. In addition to the details of our training recipe, we have provided several informative ablation studies, which we believe are the cause of our model performance.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
QCore: Data-Efficient, On-Device Continual Calibration for Quantized Models -- Extended Version
Authors:
David Campos,
Bin Yang,
Tung Kieu,
Miao Zhang,
Chenjuan Guo,
Christian S. Jensen
Abstract:
We are witnessing an increasing availability of streaming data that may contain valuable information on the underlying processes. It is thus attractive to be able to deploy machine learning models on edge devices near sensors such that decisions can be made instantaneously, rather than first having to transmit incoming data to servers. To enable deployment on edge devices with limited storage and…
▽ More
We are witnessing an increasing availability of streaming data that may contain valuable information on the underlying processes. It is thus attractive to be able to deploy machine learning models on edge devices near sensors such that decisions can be made instantaneously, rather than first having to transmit incoming data to servers. To enable deployment on edge devices with limited storage and computational capabilities, the full-precision parameters in standard models can be quantized to use fewer bits. The resulting quantized models are then calibrated using back-propagation and full training data to ensure accuracy. This one-time calibration works for deployments in static environments. However, model deployment in dynamic edge environments call for continual calibration to adaptively adjust quantized models to fit new incoming data, which may have different distributions. The first difficulty in enabling continual calibration on the edge is that the full training data may be too large and thus not always available on edge devices. The second difficulty is that the use of back-propagation on the edge for repeated calibration is too expensive. We propose QCore to enable continual calibration on the edge. First, it compresses the full training data into a small subset to enable effective calibration of quantized models with different bit-widths. We also propose means of updating the subset when new streaming data arrives to reflect changes in the environment, while not forgetting earlier training data. Second, we propose a small bit-flip** network that works with the subset to update quantized model parameters, thus enabling efficient continual calibration without back-propagation. An experimental study, conducted with real-world data in a continual learning setting, offers insight into the properties of QCore and shows that it is capable of outperforming strong baseline methods.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
A Semi-Lagrangian Approach for Time and Energy Path Planning Optimization in Static Flow Fields
Authors:
Víctor C. da S. Campos,
Armando A. Neto,
Douglas G. Macharet
Abstract:
Efficient path planning for autonomous mobile robots is a critical problem across numerous domains, where optimizing both time and energy consumption is paramount. This paper introduces a novel methodology that considers the dynamic influence of an environmental flow field and considers geometric constraints, including obstacles and forbidden zones, enriching the complexity of the planning problem…
▽ More
Efficient path planning for autonomous mobile robots is a critical problem across numerous domains, where optimizing both time and energy consumption is paramount. This paper introduces a novel methodology that considers the dynamic influence of an environmental flow field and considers geometric constraints, including obstacles and forbidden zones, enriching the complexity of the planning problem. We formulate it as a multi-objective optimal control problem, propose a novel transformation called Harmonic Transformation, and apply a semi-Lagrangian scheme to solve it. The set of Pareto efficient solutions is obtained considering two distinct approaches: a deterministic method and an evolutionary-based one, both of which are designed to make use of the proposed Harmonic Transformation. Through an extensive analysis of these approaches, we demonstrate their efficacy in finding optimized paths.
△ Less
Submitted 14 June, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.
-
Overview of the TREC 2023 Product Product Search Track
Authors:
Daniel Campos,
Surya Kallumadi,
Corby Rosset,
Cheng Xiang Zhai,
Alessandro Magnani
Abstract:
This is the first year of the TREC Product search track. The focus this year was the creation of a reusable collection and evaluation of the impact of the use of metadata and multi-modal data on retrieval accuracy. This year we leverage the new product search corpus, which includes contextual metadata. Our analysis shows that in the product search domain, traditional retrieval systems are highly e…
▽ More
This is the first year of the TREC Product search track. The focus this year was the creation of a reusable collection and evaluation of the impact of the use of metadata and multi-modal data on retrieval accuracy. This year we leverage the new product search corpus, which includes contextual metadata. Our analysis shows that in the product search domain, traditional retrieval systems are highly effective and commonly outperform general-purpose pretrained embedding models. Our analysis also evaluates the impact of using simplified and metadata-enhanced collections, finding no clear trend in the impact of the expanded collection. We also see some surprising outcomes; despite their widespread adoption and competitive performance on other tasks, we find single-stage dense retrieval runs can commonly be noncompetitive or generate low-quality results both in the zero-shot and fine-tuned domain.
△ Less
Submitted 15 November, 2023; v1 submitted 13 November, 2023;
originally announced November 2023.
-
Hearing the voice of experts: Unveiling Stack Exchange communities' knowledge of test smells
Authors:
Luana Martins,
Denivan Campos,
Railana Santana,
Joselito Mota Junior,
Heitor Costa,
Ivan Machado
Abstract:
Refactorings are transformations to improve the code design without changing overall functionality and observable behavior. During the refactoring process of smelly test code, practitioners may struggle to identify refactoring candidates and define and apply corrective strategies. This paper reports on an empirical study aimed at understanding how test smells and test refactorings are discussed on…
▽ More
Refactorings are transformations to improve the code design without changing overall functionality and observable behavior. During the refactoring process of smelly test code, practitioners may struggle to identify refactoring candidates and define and apply corrective strategies. This paper reports on an empirical study aimed at understanding how test smells and test refactorings are discussed on the Stack Exchange network. Developers commonly count on Stack Exchange to pick the brains of the wise, i.e., to `look up' how others are completing similar tasks. Therefore, in light of data from the Stack Exchange discussion topics, we could examine how developers understand and perceive test smells, the corrective actions they take to handle them, and the challenges they face when refactoring test code aiming to fix test smells. We observed that developers are interested in others' perceptions and hands-on experience handling test code issues. Besides, there is a clear indication that developers often ask whether test smells or anti-patterns are either good or bad testing practices than code-based refactoring recommendations.
△ Less
Submitted 5 May, 2023;
originally announced May 2023.
-
Noise-Robust Dense Retrieval via Contrastive Alignment Post Training
Authors:
Daniel Campos,
ChengXiang Zhai,
Alessandro Magnani
Abstract:
The success of contextual word representations and advances in neural information retrieval have made dense vector-based retrieval a standard approach for passage and document ranking. While effective and efficient, dual-encoders are brittle to variations in query distributions and noisy queries. Data augmentation can make models more robust but introduces overhead to training set generation and r…
▽ More
The success of contextual word representations and advances in neural information retrieval have made dense vector-based retrieval a standard approach for passage and document ranking. While effective and efficient, dual-encoders are brittle to variations in query distributions and noisy queries. Data augmentation can make models more robust but introduces overhead to training set generation and requires retraining and index regeneration. We present Contrastive Alignment POst Training (CAPOT), a highly efficient finetuning method that improves model robustness without requiring index regeneration, the training set optimization, or alteration. CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root. We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead.
△ Less
Submitted 10 April, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
To Asymmetry and Beyond: Structured Pruning of Sequence to Sequence Models for Improved Inference Efficiency
Authors:
Daniel Campos,
ChengXiang Zhai
Abstract:
Sequence-to-sequence language models can be used to produce abstractive summaries which are coherent, relevant, and concise. Still, model sizes can make deployment in latency-sensitive or web-scale implementations difficult. This paper studies the relationship between model size, structured pruning, inference efficiency, and summarization accuracy on widely used summarization datasets. We show tha…
▽ More
Sequence-to-sequence language models can be used to produce abstractive summaries which are coherent, relevant, and concise. Still, model sizes can make deployment in latency-sensitive or web-scale implementations difficult. This paper studies the relationship between model size, structured pruning, inference efficiency, and summarization accuracy on widely used summarization datasets. We show that model accuracy is tied to the encoder size while inference efficiency is connected to the decoder. Using asymmetric pruning can lead to nearly 3x improvement in inference latency with ~1 point loss in Rouge-2. Moreover, we find both the average degradation and the role of asymmetry to be consistent across model sizes and variations in datasets.
△ Less
Submitted 12 June, 2023; v1 submitted 5 April, 2023;
originally announced April 2023.
-
Quick Dense Retrievers Consume KALE: Post Training Kullback Leibler Alignment of Embeddings for Asymmetrical dual encoders
Authors:
Daniel Campos,
Alessandro Magnani,
ChengXiang Zhai
Abstract:
In this paper, we consider the problem of improving the inference latency of language model-based dense retrieval systems by introducing structural compression and model size asymmetry between the context and query encoders. First, we investigate the impact of pre and post-training compression on the MSMARCO, Natural Questions, TriviaQA, SQUAD, and SCIFACT, finding that asymmetry in the dual encod…
▽ More
In this paper, we consider the problem of improving the inference latency of language model-based dense retrieval systems by introducing structural compression and model size asymmetry between the context and query encoders. First, we investigate the impact of pre and post-training compression on the MSMARCO, Natural Questions, TriviaQA, SQUAD, and SCIFACT, finding that asymmetry in the dual encoders in dense retrieval can lead to improved inference efficiency. Knowing this, we introduce Kullback Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods by pruning and aligning the query encoder after training. Specifically, KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation. Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference.
△ Less
Submitted 1 June, 2023; v1 submitted 31 March, 2023;
originally announced April 2023.
-
Dense Sparse Retrieval: Using Sparse Language Models for Inference Efficient Dense Retrieval
Authors:
Daniel Campos,
ChengXiang Zhai
Abstract:
Vector-based retrieval systems have become a common staple for academic and industrial search applications because they provide a simple and scalable way of extending the search to leverage contextual representations for documents and queries. As these vector-based systems rely on contextual language models, their usage commonly requires GPUs, which can be expensive and difficult to manage. Given…
▽ More
Vector-based retrieval systems have become a common staple for academic and industrial search applications because they provide a simple and scalable way of extending the search to leverage contextual representations for documents and queries. As these vector-based systems rely on contextual language models, their usage commonly requires GPUs, which can be expensive and difficult to manage. Given recent advances in introducing sparsity into language models for improved inference efficiency, in this paper, we study how sparse language models can be used for dense retrieval to improve inference efficiency. Using the popular retrieval library Tevatron and the MSMARCO, NQ, and TriviaQA datasets, we find that sparse language models can be used as direct replacements with little to no drop in accuracy and up to 4.3x improved inference speeds
△ Less
Submitted 31 March, 2023;
originally announced April 2023.
-
oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes
Authors:
Daniel Campos,
Alexandre Marques,
Mark Kurtz,
ChengXiang Zhai
Abstract:
In this paper, we introduce the range of oBERTa language models, an easy-to-use set of language models which allows Natural Language Processing (NLP) practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. Specifically, oBERTa extends existing work on pruning, knowledge distillation, and quantization and leverages frozen embeddings improves distilla…
▽ More
In this paper, we introduce the range of oBERTa language models, an easy-to-use set of language models which allows Natural Language Processing (NLP) practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. Specifically, oBERTa extends existing work on pruning, knowledge distillation, and quantization and leverages frozen embeddings improves distillation and model initialization to deliver higher accuracy on a broad range of transfer tasks. In generating oBERTa, we explore how the highly optimized RoBERTa differs from the BERT for pruning during pre-training and finetuning. We find it less amenable to compression during fine-tuning. We explore the use of oBERTa on seven representative NLP tasks and find that the improved compression techniques allow a pruned oBERTa model to match the performance of BERTbase and exceed the performance of Prune OFA Large on the SQUAD V1.1 Question Answering dataset, despite being 8x and 2x, respectively faster in inference. We release our code, training regimes, and associated model for broad usage to encourage usage and experimentation
△ Less
Submitted 6 June, 2023; v1 submitted 29 March, 2023;
originally announced March 2023.
-
LightTS: Lightweight Time Series Classification with Adaptive Ensemble Distillation -- Extended Version
Authors:
David Campos,
Miao Zhang,
Bin Yang,
Tung Kieu,
Chenjuan Guo,
Christian S. Jensen
Abstract:
Due to the swee** digitalization of processes, increasingly vast amounts of time series data are being produced. Accurate classification of such time series facilitates decision making in multiple domains. State-of-the-art classification accuracy is often achieved by ensemble learning where results are synthesized from multiple base models. This characteristic implies that ensemble learning need…
▽ More
Due to the swee** digitalization of processes, increasingly vast amounts of time series data are being produced. Accurate classification of such time series facilitates decision making in multiple domains. State-of-the-art classification accuracy is often achieved by ensemble learning where results are synthesized from multiple base models. This characteristic implies that ensemble learning needs substantial computing resources, preventing their use in resource-limited environments, such as in edge devices. To extend the applicability of ensemble learning, we propose the LightTS framework that compresses large ensembles into lightweight models while ensuring competitive accuracy. First, we propose adaptive ensemble distillation that assigns adaptive weights to different base models such that their varying classification capabilities contribute purposefully to the training of the lightweight model. Second, we propose means of identifying Pareto optimal settings w.r.t. model accuracy and model size, thus enabling users with a space budget to select the most accurate lightweight model. We report on experiments using 128 real-world time series sets and different types of base models that justify key decisions in the design of LightTS and provide evidence that LightTS is able to outperform competitors.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
Compressing Cross-Lingual Multi-Task Models at Qualtrics
Authors:
Daniel Campos,
Daniel Perry,
Samir Joshi,
Yashmeet Gambhir,
Wei Du,
Zhengzheng Xing,
Aaron Colak
Abstract:
Experience management is an emerging business area where organizations focus on understanding the feedback of customers and employees in order to improve their end-to-end experiences. This results in a unique set of machine learning problems to help understand how people feel, discover issues they care about, and find which actions need to be taken on data that are different in content and distrib…
▽ More
Experience management is an emerging business area where organizations focus on understanding the feedback of customers and employees in order to improve their end-to-end experiences. This results in a unique set of machine learning problems to help understand how people feel, discover issues they care about, and find which actions need to be taken on data that are different in content and distribution from traditional NLP domains. In this paper, we present a case study of building text analysis applications that perform multiple classification tasks efficiently in 12 languages in the nascent business area of experience management. In order to scale up modern ML methods on experience data, we leverage cross lingual and multi-task modeling techniques to consolidate our models into a single deployment to avoid overhead. We also make use of model compression and model distillation to reduce overall inference latency and hardware cost to the level acceptable for business needs while maintaining model prediction quality. Our findings show that multi-task modeling improves task performance for a subset of experience management tasks in both XLM-R and mBert architectures. Among the compressed architectures we explored, we found that MiniLM achieved the best compression/performance tradeoff. Our case study demonstrates a speedup of up to 15.61x with 2.60% average task degradation (or 3.29x speedup with 1.71% degradation) and estimated savings of 44% over using the original full-size model. These results demonstrate a successful scaling up of text classification for the challenging new area of ML for experience management.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
Sparse*BERT: Sparse Models Generalize To New tasks and Domains
Authors:
Daniel Campos,
Alexandre Marques,
Tuan Nguyen,
Mark Kurtz,
ChengXiang Zhai
Abstract:
Large Language Models have become the core architecture upon which most modern natural language processing (NLP) systems build. These models can consistently deliver impressive accuracy and robustness across tasks and domains, but their high computational overhead can make inference difficult and expensive. To make using these models less costly, recent work has explored leveraging structured and…
▽ More
Large Language Models have become the core architecture upon which most modern natural language processing (NLP) systems build. These models can consistently deliver impressive accuracy and robustness across tasks and domains, but their high computational overhead can make inference difficult and expensive. To make using these models less costly, recent work has explored leveraging structured and unstructured pruning, quantization, and distillation to improve inference speed and decrease size. This paper studies how models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks. Our experimentation shows that models that are pruned during pretraining using general domain masked language models can transfer to novel domains and tasks without extensive hyperparameter exploration or specialized approaches. We demonstrate that our general sparse model Sparse*BERT can become SparseBioBERT simply by pretraining the compressed architecture on unstructured biomedical text. Moreover, we show that SparseBioBERT can match the quality of BioBERT with only 10\% of the parameters.
△ Less
Submitted 5 April, 2023; v1 submitted 24 May, 2022;
originally announced May 2022.
-
The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models
Authors:
Eldar Kurtic,
Daniel Campos,
Tuan Nguyen,
Elias Frantar,
Mark Kurtz,
Benjamin Fineran,
Michael Goin,
Dan Alistarh
Abstract:
Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, wi…
▽ More
Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, with low accuracy loss. In this context, this paper's contributions are two-fold. We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models. We introduce Optimal BERT Surgeon (oBERT), an efficient and accurate weight pruning method based on approximate second-order information, which we show to yield state-of-the-art results in both stages of language tasks: pre-training and fine-tuning. Specifically, oBERT extends existing work on unstructured second-order pruning by allowing for pruning blocks of weights, and by being applicable at the BERT scale. Second, we investigate the impact of this pruning method when compounding compression approaches to obtain highly compressed but accurate models for deployment on edge devices. These models significantly push boundaries of the current state-of-the-art sparse BERT models with respect to all metrics: model size, inference speed and task accuracy. For example, relative to the dense BERT-base, we obtain 10x model size compression (in MB) with < 1% accuracy drop, 10x CPU-inference speedup with < 2% accuracy drop, and 29x CPU-inference speedup with < 7.5% accuracy drop. Our code, fully integrated with Transformers and SparseML, is available at https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT.
△ Less
Submitted 17 October, 2022; v1 submitted 14 March, 2022;
originally announced March 2022.
-
Sparsity-Inducing Categorical Prior Improves Robustness of the Information Bottleneck
Authors:
Anirban Samaddar,
Sandeep Madireddy,
Prasanna Balaprakash,
Tapabrata Maiti,
Gustavo de los Campos,
Ian Fischer
Abstract:
The information bottleneck framework provides a systematic approach to learning representations that compress nuisance information in the input and extract semantically meaningful information about predictions. However, the choice of a prior distribution that fixes the dimensionality across all the data can restrict the flexibility of this approach for learning robust representations. We present a…
▽ More
The information bottleneck framework provides a systematic approach to learning representations that compress nuisance information in the input and extract semantically meaningful information about predictions. However, the choice of a prior distribution that fixes the dimensionality across all the data can restrict the flexibility of this approach for learning robust representations. We present a novel sparsity-inducing spike-slab categorical prior that uses sparsity as a mechanism to provide the flexibility that allows each data point to learn its own dimension distribution. In addition, it provides a mechanism for learning a joint distribution of the latent variable and the sparsity and hence can account for the complete uncertainty in the latent space. Through a series of experiments using in-distribution and out-of-distribution learning scenarios on the MNIST, CIFAR-10, and ImageNet data, we show that the proposed approach improves accuracy and robustness compared to traditional fixed-dimensional priors, as well as other sparsity induction mechanisms for latent variable models proposed in the literature.
△ Less
Submitted 27 October, 2022; v1 submitted 4 March, 2022;
originally announced March 2022.
-
Unsupervised Time Series Outlier Detection with Diversity-Driven Convolutional Ensembles -- Extended Version
Authors:
David Campos,
Tung Kieu,
Chenjuan Guo,
Feiteng Huang,
Kai Zheng,
Bin Yang,
Christian S. Jensen
Abstract:
With the swee** digitalization of societal, medical, industrial, and scientific processes, sensing technologies are being deployed that produce increasing volumes of time series data, thus fueling a plethora of new or improved applications. In this setting, outlier detection is frequently important, and while solutions based on neural networks exist, they leave room for improvement in terms of b…
▽ More
With the swee** digitalization of societal, medical, industrial, and scientific processes, sensing technologies are being deployed that produce increasing volumes of time series data, thus fueling a plethora of new or improved applications. In this setting, outlier detection is frequently important, and while solutions based on neural networks exist, they leave room for improvement in terms of both accuracy and efficiency. With the objective of achieving such improvements, we propose a diversity-driven, convolutional ensemble. To improve accuracy, the ensemble employs multiple basic outlier detection models built on convolutional sequence-to-sequence autoencoders that can capture temporal dependencies in time series. Further, a novel diversity-driven training method maintains diversity among the basic models, with the aim of improving the ensemble's accuracy. To improve efficiency, the approach enables a high degree of parallelism during training. In addition, it is able to transfer some model parameters from one basic model to another, which reduces training time. We report on extensive experiments using real-world multivariate time series that offer insight into the design choices underlying the new approach and offer evidence that it is capable of improved accuracy and efficiency. This is an extended version of "Unsupervised Time Series Outlier Detection with Diversity-Driven Convolutional Ensembles", to appear in PVLDB 2022.
△ Less
Submitted 22 November, 2021;
originally announced November 2021.
-
IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System
Authors:
Daniel Campos,
Heng Ji
Abstract:
Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital information is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and mac…
▽ More
Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital information is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and machine-readable molecule representations, text-based molecule descriptors like SMILES and SELFIES were created. These text-based molecule representations provide molecule generation but are unfortunately rarely present in published literature. In the absence of molecule descriptors, the generation of molecule descriptors from the 2-D images present in the literature is necessary to understand chemistry literature at scale. Successful methods such as Optical Structure Recognition Application (OSRA), and ChemSchematicResolver are able to extract the locations of molecules structures in chemistry papers and infer molecular descriptions and reactions. While effective, existing systems expect chemists to correct outputs, making them unsuitable for unsupervised large-scale data mining. Leveraging the task formulation of image captioning introduced by DECIMER, we introduce IMG2SMI, a model which leverages Deep Residual Networks for image feature extraction and an encoder-decoder Transformer layers for molecule description generation. Unlike previous Neural Network-based systems, IMG2SMI builds around the task of molecule description generation, which enables IMG2SMI to outperform OSRA-based systems by 163% in molecule similarity prediction as measured by the molecular MACCS Fingerprint Tanimoto Similarity. Additionally, to facilitate further research on this task, we release a new molecule prediction dataset. including 81 million molecules for molecule description generation
△ Less
Submitted 3 September, 2021;
originally announced September 2021.
-
A Semi-Lagrangian Approach for the Minimal Exposure Path Problem in Wireless Sensor Networks
Authors:
Armando Alves Neto,
Víctor C. da Silva Campos,
Douglas G. Macharet
Abstract:
A critical metric of the coverage quality in Wireless Sensor Networks (WSNs) is the Minimal Exposure Path (MEP), a path through the environment that least exposes an intruder to the sensor detecting nodes. Many approaches have been proposed in the last decades to solve this optimization problem, ranging from classic (grid-based and Voronoi-based) planners to genetic meta-heuristics. However, most…
▽ More
A critical metric of the coverage quality in Wireless Sensor Networks (WSNs) is the Minimal Exposure Path (MEP), a path through the environment that least exposes an intruder to the sensor detecting nodes. Many approaches have been proposed in the last decades to solve this optimization problem, ranging from classic (grid-based and Voronoi-based) planners to genetic meta-heuristics. However, most of them are limited to specific sensing models and obstacle-free spaces. Still, none of them guarantee an optimal solution, and the state-of-the-art is expensive in terms of run-time. Therefore, in this paper, we propose a novel method that models the MEP as an Optimal Control problem and solves it by using a Semi-Lagrangian approach. This framework is shown to converge to the optimal MEP while also incorporates different homogeneous and heterogeneous sensor models and geometric constraints (obstacles). Experiments show that our method dominates the state-of-the-art, improving the results by approximately 10% with a relatively lower execution time.
△ Less
Submitted 12 August, 2021;
originally announced August 2021.
-
Curriculum learning for language modeling
Authors:
Daniel Campos
Abstract:
Language Models like ELMo and BERT have provided robust representations of natural language, which serve as the language understanding component for a diverse range of downstream tasks.Curriculum learning is a method that employs a structured training regime instead, which has been leveraged in computer vision and machine translation to improve model training speed and model performance. While lan…
▽ More
Language Models like ELMo and BERT have provided robust representations of natural language, which serve as the language understanding component for a diverse range of downstream tasks.Curriculum learning is a method that employs a structured training regime instead, which has been leveraged in computer vision and machine translation to improve model training speed and model performance. While language models have proven transformational for the natural language processing community, these models have proven expensive, energy-intensive, and challenging to train. In this work, we explore the effect of curriculum learning on language model pretraining using various linguistically motivated curricula and evaluate transfer performance on the GLUE Benchmark. Despite a broad variety of training methodologies and experiments we do not find compelling evidence that curriculum learning methods improve language model training.
△ Less
Submitted 4 August, 2021;
originally announced August 2021.
-
Developers perception on the severity of test smells: an empirical study
Authors:
Denivan Campos,
Larissa Rocha,
Ivan Machado
Abstract:
Unit testing is an essential component of the software development life-cycle. A developer could easily and quickly catch and fix software faults introduced in the source code by creating and running unit tests. Despite their importance, unit tests are subject to bad design or implementation decisions, the so-called test smells. These might decrease software systems quality from various aspects, m…
▽ More
Unit testing is an essential component of the software development life-cycle. A developer could easily and quickly catch and fix software faults introduced in the source code by creating and running unit tests. Despite their importance, unit tests are subject to bad design or implementation decisions, the so-called test smells. These might decrease software systems quality from various aspects, making it harder to understand, more complex to maintain, and more prone to errors and bugs. Many studies discuss the likely effects of test smells on test code. However, there is a lack of studies that capture developers perceptions of such issues. This study empirically analyzes how developers perceive the severity of test smells in the test code they develop. Severity refers to the degree to how a test smell may negatively impact the test code. We selected six open-source software projects from GitHub and interviewed their developers to understand whether and how the test smells affected the test code. Although most of the interviewed developers considered the test smells as having a low severity to their code, they indicated that test smells might negatively impact the project, particularly in test code maintainability and evolution. Also, detecting and removing test smells from the test code may be positive for the project.
△ Less
Submitted 29 July, 2021;
originally announced July 2021.
-
MS MARCO: Benchmarking Ranking Models in the Large-Data Regime
Authors:
Nick Craswell,
Bhaskar Mitra,
Emine Yilmaz,
Daniel Campos,
Jimmy Lin
Abstract:
Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public leaderboard such as MS MARCO, are intended to encourage research and track our progress, addressing big questions in our field. However, the goal is not simply to identify which run is "best", achieving the top score. The goal is to move the field forward by develo** new robust techniques, that work in many different setting…
▽ More
Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public leaderboard such as MS MARCO, are intended to encourage research and track our progress, addressing big questions in our field. However, the goal is not simply to identify which run is "best", achieving the top score. The goal is to move the field forward by develo** new robust techniques, that work in many different settings, and are adopted in research and practice. This paper uses the MS MARCO and TREC Deep Learning Track as our case study, comparing it to the case of TREC ad hoc ranking in the 1990s. We show how the design of the evaluation effort can encourage or discourage certain outcomes, and raising questions about internal and external validity of results. We provide some analysis of certain pitfalls, and a statement of best practices for avoiding such pitfalls. We summarize the progress of the effort so far, and describe our desired end state of "robust usefulness", along with steps that might be required to get us there.
△ Less
Submitted 9 May, 2021;
originally announced May 2021.
-
TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime
Authors:
Nick Craswell,
Bhaskar Mitra,
Emine Yilmaz,
Daniel Campos,
Ellen M. Voorhees,
Ian Soboroff
Abstract:
The TREC Deep Learning (DL) Track studies ad hoc search in the large data regime, meaning that a large set of human-labeled training data is available. Results so far indicate that the best models with large data may be deep neural networks. This paper supports the reuse of the TREC DL test collections in three ways. First we describe the data sets in detail, documenting clearly and in one place s…
▽ More
The TREC Deep Learning (DL) Track studies ad hoc search in the large data regime, meaning that a large set of human-labeled training data is available. Results so far indicate that the best models with large data may be deep neural networks. This paper supports the reuse of the TREC DL test collections in three ways. First we describe the data sets in detail, documenting clearly and in one place some details that are otherwise scattered in track guidelines, overview papers and in our associated MS MARCO leaderboard pages. We intend this description to make it easy for newcomers to use the TREC DL data. Second, because there is some risk of iteration and selection bias when reusing a data set, we describe the best practices for writing a paper using TREC DL data, without overfitting. We provide some illustrative analysis. Finally we address a number of issues around the TREC DL data, including an analysis of reusability.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard
Authors:
Jimmy Lin,
Daniel Campos,
Nick Craswell,
Bhaskar Mitra,
Emine Yilmaz
Abstract:
Leaderboards are a ubiquitous part of modern research in applied machine learning. By design, they sort entries into some linear order, where the top-scoring entry is recognized as the "state of the art" (SOTA). Due to the rapid progress being made in information retrieval today, particularly with neural models, the top entry in a leaderboard is replaced with some regularity. These are touted as i…
▽ More
Leaderboards are a ubiquitous part of modern research in applied machine learning. By design, they sort entries into some linear order, where the top-scoring entry is recognized as the "state of the art" (SOTA). Due to the rapid progress being made in information retrieval today, particularly with neural models, the top entry in a leaderboard is replaced with some regularity. These are touted as improvements in the state of the art. Such pronouncements, however, are almost never qualified with significance testing. In the context of the MS MARCO document ranking leaderboard, we pose a specific question: How do we know if a run is significantly better than the current SOTA? We ask this question against the backdrop of recent IR debates on scale types: in particular, whether commonly used significance tests are even mathematically permissible. Recognizing these potential pitfalls in evaluation methodology, our study proposes an evaluation framework that explicitly treats certain outcomes as distinct and avoids aggregating them into a single-point metric. Empirical analysis of SOTA runs from the MS MARCO document ranking leaderboard reveals insights about how one run can be "significantly better" than another that are obscured by the current official evaluation metric (MRR@100).
△ Less
Submitted 25 February, 2021;
originally announced February 2021.
-
Overview of the TREC 2020 deep learning track
Authors:
Nick Craswell,
Bhaskar Mitra,
Emine Yilmaz,
Daniel Campos
Abstract:
This is the second year of the TREC Deep Learning Track, with the goal of studying ad hoc ranking in the large training data regime. We again have a document retrieval task and a passage retrieval task, each with hundreds of thousands of human-labeled training queries. We evaluate using single-shot TREC-style evaluation, to give us a picture of which ranking methods work best when large data is av…
▽ More
This is the second year of the TREC Deep Learning Track, with the goal of studying ad hoc ranking in the large training data regime. We again have a document retrieval task and a passage retrieval task, each with hundreds of thousands of human-labeled training queries. We evaluate using single-shot TREC-style evaluation, to give us a picture of which ranking methods work best when large data is available, with much more comprehensive relevance labeling on the small number of test queries. This year we have further evidence that rankers with BERT-style pretraining outperform other rankers in the large data regime.
△ Less
Submitted 15 February, 2021;
originally announced February 2021.
-
Logic Synthesis Meets Machine Learning: Trading Exactness for Generalization
Authors:
Shubham Rai,
Walter Lau Neto,
Yukio Miyasaka,
Xinpei Zhang,
Mingfei Yu,
Qingyang Yi Masahiro Fujita,
Guilherme B. Manske,
Matheus F. Pontes,
Leomar S. da Rosa Junior,
Marilton S. de Aguiar,
Paulo F. Butzen,
Po-Chun Chien,
Yu-Shan Huang,
Hoa-Ren Wang,
Jie-Hong R. Jiang,
Jiaqi Gu,
Zheng Zhao,
Zixuan Jiang,
David Z. Pan,
Brunno A. de Abreu,
Isac de Souza Campos,
Augusto Berndt,
Cristina Meinhardt,
Jonata T. Carvalho,
Mateus Grellert
, et al. (15 additional authors not shown)
Abstract:
Logic synthesis is a fundamental step in hardware design whose goal is to find structural representations of Boolean functions while minimizing delay and area. If the function is completely-specified, the implementation accurately represents the function. If the function is incompletely-specified, the implementation has to be true only on the care set. While most of the algorithms in logic synthes…
▽ More
Logic synthesis is a fundamental step in hardware design whose goal is to find structural representations of Boolean functions while minimizing delay and area. If the function is completely-specified, the implementation accurately represents the function. If the function is incompletely-specified, the implementation has to be true only on the care set. While most of the algorithms in logic synthesis rely on SAT and Boolean methods to exactly implement the care set, we investigate learning in logic synthesis, attempting to trade exactness for generalization. This work is directly related to machine learning where the care set is the training set and the implementation is expected to generalize on a validation set. We present learning incompletely-specified functions based on the results of a competition conducted at IWLS 2020. The goal of the competition was to implement 100 functions given by a set of care minterms for training, while testing the implementation using a set of validation minterms sampled from the same function. We make this benchmark suite available and offer a detailed comparative analysis of the different approaches to learning
△ Less
Submitted 15 December, 2020; v1 submitted 4 December, 2020;
originally announced December 2020.
-
A Comparison of Humanoid Robot Simulators: A Quantitative Approach
Authors:
Angel Ayala,
Francisco Cruz,
Diego Campos,
Rodrigo Rubio,
Bruno Fernandes,
Richard Dazeley
Abstract:
Research on humanoid robotic systems involves a considerable amount of computational resources, not only for the involved design but also for its development and subsequent implementation. For robotic systems to be implemented in real-world scenarios, in several situations, it is preferred to develop and test them under controlled environments in order to reduce the risk of errors and unexpected b…
▽ More
Research on humanoid robotic systems involves a considerable amount of computational resources, not only for the involved design but also for its development and subsequent implementation. For robotic systems to be implemented in real-world scenarios, in several situations, it is preferred to develop and test them under controlled environments in order to reduce the risk of errors and unexpected behavior. In this regard, a more accessible and efficient alternative is to implement the environment using robotic simulation tools. This paper presents a quantitative comparison of Gazebo, Webots, and V-REP, three simulators widely used by the research community to develop robotic systems. To compare the performance of these three simulators, elements such as CPU, memory footprint, and disk access are used to measure and compare them to each other. In order to measure the use of resources, each simulator executes 20 times a robotic scenario composed by a NAO robot that must navigate to a goal position avoiding a specific obstacle. In general terms, our results show that Webots is the simulator with the lowest use of resources, followed by V-REP, which has advantages over Gazebo, mainly because of the CPU use.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search
Authors:
Nick Craswell,
Daniel Campos,
Bhaskar Mitra,
Emine Yilmaz,
Bodo Billerbeck
Abstract:
Users of Web search engines reveal their information needs through queries and clicks, making click logs a useful asset for information retrieval. However, click logs have not been publicly released for academic use, because they can be too revealing of personally or commercially sensitive information. This paper describes a click data release related to the TREC Deep Learning Track document corpu…
▽ More
Users of Web search engines reveal their information needs through queries and clicks, making click logs a useful asset for information retrieval. However, click logs have not been publicly released for academic use, because they can be too revealing of personally or commercially sensitive information. This paper describes a click data release related to the TREC Deep Learning Track document corpus. After aggregation and filtering, including a k-anonymity requirement, we find 1.4 million of the TREC DL URLs have 18 million connections to 10 million distinct queries. Our dataset of these queries and connections to TREC documents is of similar size to proprietary datasets used in previous papers on query mining and ranking. We perform some preliminary experiments using the click data to augment the TREC DL training data, offering by comparison: 28x more queries, with 49x more connections to 4.4x more URLs in the corpus. We present a description of the dataset's generation process, characteristics, use in ranking and suggest other potential uses.
△ Less
Submitted 18 August, 2020; v1 submitted 9 June, 2020;
originally announced June 2020.
-
On the Reliability of Test Collections for Evaluating Systems of Different Types
Authors:
Emine Yilmaz,
Nick Craswell,
Bhaskar Mitra,
Daniel Campos
Abstract:
As deep learning based models are increasingly being used for information retrieval (IR), a major challenge is to ensure the availability of test collections for measuring their quality. Test collections are generated based on pooling results of various retrieval systems, but until recently this did not include deep learning systems. This raises a major challenge for reusable evaluation: Since dee…
▽ More
As deep learning based models are increasingly being used for information retrieval (IR), a major challenge is to ensure the availability of test collections for measuring their quality. Test collections are generated based on pooling results of various retrieval systems, but until recently this did not include deep learning systems. This raises a major challenge for reusable evaluation: Since deep learning based models use external resources (e.g. word embeddings) and advanced representations as opposed to traditional methods that are mainly based on lexical similarity, they may return different types of relevant document that were not identified in the original pooling. If so, test collections constructed using traditional methods are likely to lead to biased and unfair evaluation results for deep learning (neural) systems. This paper uses simulated pooling to test the fairness and reusability of test collections, showing that pooling based on traditional systems only can lead to biased evaluation of deep learning systems.
△ Less
Submitted 28 April, 2020;
originally announced April 2020.
-
XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation
Authors:
Yaobo Liang,
Nan Duan,
Yeyun Gong,
Ning Wu,
Fenfei Guo,
Weizhen Qi,
Ming Gong,
Linjun Shou,
Daxin Jiang,
Guihong Cao,
Xiaodong Fan,
Ruofei Zhang,
Rahul Agrawal,
Edward Cui,
Sining Wei,
Taroon Bharti,
Ying Qiao,
Jiun-Hung Chen,
Winnie Wu,
Shuguang Liu,
Fan Yang,
Daniel Campos,
Rangan Majumder,
Ming Zhou
Abstract:
In this paper, we introduce XGLUE, a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE(Wang et al., 2019), which is labeled in English for natural language understanding tasks only, XGLUE has two main advantages: (1) it pr…
▽ More
In this paper, we introduce XGLUE, a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE(Wang et al., 2019), which is labeled in English for natural language understanding tasks only, XGLUE has two main advantages: (1) it provides 11 diversified tasks that cover both natural language understanding and generation scenarios; (2) for each task, it provides labeled data in multiple languages. We extend a recent cross-lingual pre-trained model Unicoder(Huang et al., 2019) to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline. We also evaluate the base versions (12-layer) of Multilingual BERT, XLM and XLM-R for comparison.
△ Less
Submitted 22 May, 2020; v1 submitted 3 April, 2020;
originally announced April 2020.
-
Overview of the TREC 2019 deep learning track
Authors:
Nick Craswell,
Bhaskar Mitra,
Emine Yilmaz,
Daniel Campos,
Ellen M. Voorhees
Abstract:
The Deep Learning Track is a new track for TREC 2019, with the goal of studying ad hoc ranking in a large data regime. It is the first track with large human-labeled training sets, introducing two sets corresponding to two tasks, each with rigorous TREC-style blind evaluation and reusable test sets. The document retrieval task has a corpus of 3.2 million documents with 367 thousand training querie…
▽ More
The Deep Learning Track is a new track for TREC 2019, with the goal of studying ad hoc ranking in a large data regime. It is the first track with large human-labeled training sets, introducing two sets corresponding to two tasks, each with rigorous TREC-style blind evaluation and reusable test sets. The document retrieval task has a corpus of 3.2 million documents with 367 thousand training queries, for which we generate a reusable test set of 43 queries. The passage retrieval task has a corpus of 8.8 million passages with 503 thousand training queries, for which we generate a reusable test set of 43 queries. This year 15 groups submitted a total of 75 runs, using various combinations of deep learning, transfer learning and traditional IR ranking methods. Deep learning runs significantly outperformed traditional IR runs. Possible explanations for this result are that we introduced large training data and we included deep models trained on such data in our judging pools, whereas some past studies did not have such training data or pooling.
△ Less
Submitted 18 March, 2020; v1 submitted 17 March, 2020;
originally announced March 2020.
-
Open Domain Web Keyphrase Extraction Beyond Language Modeling
Authors:
Lee Xiong,
Chuan Hu,
Chenyan Xiong,
Daniel Campos,
Arnold Overwijk
Abstract:
This paper studies keyphrase extraction in real-world scenarios where documents are from diverse domains and have variant content quality. We curate and release OpenKP, a large scale open domain keyphrase extraction dataset with near one hundred thousand web documents and expert keyphrase annotations. To handle the variations of domain and content quality, we develop BLING-KPE, a neural keyphrase…
▽ More
This paper studies keyphrase extraction in real-world scenarios where documents are from diverse domains and have variant content quality. We curate and release OpenKP, a large scale open domain keyphrase extraction dataset with near one hundred thousand web documents and expert keyphrase annotations. To handle the variations of domain and content quality, we develop BLING-KPE, a neural keyphrase extraction model that goes beyond language understanding using visual presentations of documents and weak supervision from search queries. Experimental results on OpenKP confirm the effectiveness of BLING-KPE and the contributions of its neural architecture, visual features, and search log weak supervision. Zero-shot evaluations on DUC-2001 demonstrate the improved generalization ability of learning from the open domain data compared to a specific domain.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.
-
Experiments in Inferring Social Networks of Diffusion
Authors:
Daniel Campos,
Zoe Konrad
Abstract:
Information diffusion is a fundamental process that takes place over networks. While it is rarely realistic to observe the individual transmissions of the information diffusion process, it is typically possible to observe when individuals first publish the information. We look specifically at previously published algorithm NETINF that probabilistically identifies the optimal network that best expl…
▽ More
Information diffusion is a fundamental process that takes place over networks. While it is rarely realistic to observe the individual transmissions of the information diffusion process, it is typically possible to observe when individuals first publish the information. We look specifically at previously published algorithm NETINF that probabilistically identifies the optimal network that best explains the observed infection times. We explore how the algorithm could perform on a range of intrinsically different social and information network topologies, from news blogs and websites to Twitter to Reddit.
△ Less
Submitted 9 October, 2019;
originally announced October 2019.
-
Accurate Genomic Prediction Of Human Height
Authors:
Louis Lello,
Steven G. Avery,
Laurent Tellier,
Ana Vazquez,
Gustavo de los Campos,
Stephen D. H. Hsu
Abstract:
We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, $\sim$40, 20, and 9 percent of total variance for the three traits. For example, predicted heights corre…
▽ More
We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, $\sim$40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate $\sim$0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the "missing heritability" problem -- i.e., the gap between prediction R-squared and SNP heritability. The $\sim$20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results.
△ Less
Submitted 19 September, 2017;
originally announced September 2017.
-
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
Authors:
Payal Bajaj,
Daniel Campos,
Nick Craswell,
Li Deng,
Jianfeng Gao,
Xiaodong Liu,
Rangan Majumder,
Andrew McNamara,
Bhaskar Mitra,
Tri Nguyen,
Mir Rosenberg,
Xia Song,
Alina Stoica,
Saurabh Tiwary,
Tong Wang
Abstract:
We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that…
▽ More
We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.
△ Less
Submitted 31 October, 2018; v1 submitted 28 November, 2016;
originally announced November 2016.
-
A novel evolutionary formulation of the maximum independent set problem
Authors:
V. C. Barbosa,
L. C. D. Campos
Abstract:
We introduce a novel evolutionary formulation of the problem of finding a maximum independent set of a graph. The new formulation is based on the relationship that exists between a graph's independence number and its acyclic orientations. It views such orientations as individuals and evolves them with the aid of evolutionary operators that are very heavily based on the structure of the graph and…
▽ More
We introduce a novel evolutionary formulation of the problem of finding a maximum independent set of a graph. The new formulation is based on the relationship that exists between a graph's independence number and its acyclic orientations. It views such orientations as individuals and evolves them with the aid of evolutionary operators that are very heavily based on the structure of the graph and its acyclic orientations. The resulting heuristic has been tested on some of the Second DIMACS Implementation Challenge benchmark graphs, and has been found to be competitive when compared to several of the other heuristics that have also been tested on those graphs.
△ Less
Submitted 22 September, 2003;
originally announced September 2003.