Skip to main content

Showing 1–37 of 37 results for author: Campos, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.16828  [pdf, other

    cs.IR cs.AI cs.CL

    Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track

    Authors: Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan Nguyen, Daniel Campos, Nick Craswell, Jimmy Lin

    Abstract: Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the tradi… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  2. arXiv:2405.07767  [pdf, other

    cs.IR cs.AI

    Synthetic Test Collections for Retrieval Evaluation

    Authors: Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, Daniel Campos

    Abstract: Test collections play a vital role in evaluation of information retrieval (IR) systems. Obtaining a diverse set of user queries for test collection construction can be challenging, and acquiring relevance judgments, which indicate the appropriateness of retrieved documents to a query, is often costly and resource-intensive. Generating synthetic datasets using Large Language Models (LLMs) has recen… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

    Comments: SIGIR 2024

  3. arXiv:2405.05374  [pdf, other

    cs.CL cs.AI cs.IR

    Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models

    Authors: Luke Merrick, Danmei Xu, Gaurav Nuti, Daniel Campos

    Abstract: This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: 17 pages, 11 Figures, 9 tables

  4. arXiv:2404.13990  [pdf, other

    cs.LG cs.DB

    QCore: Data-Efficient, On-Device Continual Calibration for Quantized Models -- Extended Version

    Authors: David Campos, Bin Yang, Tung Kieu, Miao Zhang, Chenjuan Guo, Christian S. Jensen

    Abstract: We are witnessing an increasing availability of streaming data that may contain valuable information on the underlying processes. It is thus attractive to be able to deploy machine learning models on edge devices near sensors such that decisions can be made instantaneously, rather than first having to transmit incoming data to servers. To enable deployment on edge devices with limited storage and… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: 15 pages. An extended version of "QCore: Data-Efficient, On-Device Continual Calibration for Quantized Models" accepted at PVLDB 2024

  5. arXiv:2403.16859  [pdf, other

    cs.RO eess.SY

    A Semi-Lagrangian Approach for Time and Energy Path Planning Optimization in Static Flow Fields

    Authors: Víctor C. da S. Campos, Armando A. Neto, Douglas G. Macharet

    Abstract: Efficient path planning for autonomous mobile robots is a critical problem across numerous domains, where optimizing both time and energy consumption is paramount. This paper introduces a novel methodology that considers the dynamic influence of an environmental flow field and considers geometric constraints, including obstacles and forbidden zones, enriching the complexity of the planning problem… ▽ More

    Submitted 14 June, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: 12 pages, initial paper submission; Preprint submitted to Journal of the Franklin Institute

  6. arXiv:2311.07861  [pdf, other

    cs.IR cs.AI

    Overview of the TREC 2023 Product Product Search Track

    Authors: Daniel Campos, Surya Kallumadi, Corby Rosset, Cheng Xiang Zhai, Alessandro Magnani

    Abstract: This is the first year of the TREC Product search track. The focus this year was the creation of a reusable collection and evaluation of the impact of the use of metadata and multi-modal data on retrieval accuracy. This year we leverage the new product search corpus, which includes contextual metadata. Our analysis shows that in the product search domain, traditional retrieval systems are highly e… ▽ More

    Submitted 15 November, 2023; v1 submitted 13 November, 2023; originally announced November 2023.

    Comments: 14 pages, 4 figures, 11 tables - TREC 2023

  7. arXiv:2305.03431  [pdf, other

    cs.SE

    Hearing the voice of experts: Unveiling Stack Exchange communities' knowledge of test smells

    Authors: Luana Martins, Denivan Campos, Railana Santana, Joselito Mota Junior, Heitor Costa, Ivan Machado

    Abstract: Refactorings are transformations to improve the code design without changing overall functionality and observable behavior. During the refactoring process of smelly test code, practitioners may struggle to identify refactoring candidates and define and apply corrective strategies. This paper reports on an empirical study aimed at understanding how test smells and test refactorings are discussed on… ▽ More

    Submitted 5 May, 2023; originally announced May 2023.

    Comments: Preprint of the manuscript accepted for publication at CHASE 2023

  8. arXiv:2304.03401  [pdf, other

    cs.IR cs.AI cs.CL

    Noise-Robust Dense Retrieval via Contrastive Alignment Post Training

    Authors: Daniel Campos, ChengXiang Zhai, Alessandro Magnani

    Abstract: The success of contextual word representations and advances in neural information retrieval have made dense vector-based retrieval a standard approach for passage and document ranking. While effective and efficient, dual-encoders are brittle to variations in query distributions and noisy queries. Data augmentation can make models more robust but introduces overhead to training set generation and r… ▽ More

    Submitted 10 April, 2023; v1 submitted 6 April, 2023; originally announced April 2023.

    Comments: 8 pages, 6 figures, 30 tables

  9. arXiv:2304.02721  [pdf, other

    cs.CL cs.AI

    To Asymmetry and Beyond: Structured Pruning of Sequence to Sequence Models for Improved Inference Efficiency

    Authors: Daniel Campos, ChengXiang Zhai

    Abstract: Sequence-to-sequence language models can be used to produce abstractive summaries which are coherent, relevant, and concise. Still, model sizes can make deployment in latency-sensitive or web-scale implementations difficult. This paper studies the relationship between model size, structured pruning, inference efficiency, and summarization accuracy on widely used summarization datasets. We show tha… ▽ More

    Submitted 12 June, 2023; v1 submitted 5 April, 2023; originally announced April 2023.

    Comments: SustaiNLP2023 @ ACL 2023,9 pages, 6 figures, 33 tables

  10. arXiv:2304.01016  [pdf, other

    cs.CL cs.AI cs.IR

    Quick Dense Retrievers Consume KALE: Post Training Kullback Leibler Alignment of Embeddings for Asymmetrical dual encoders

    Authors: Daniel Campos, Alessandro Magnani, ChengXiang Zhai

    Abstract: In this paper, we consider the problem of improving the inference latency of language model-based dense retrieval systems by introducing structural compression and model size asymmetry between the context and query encoders. First, we investigate the impact of pre and post-training compression on the MSMARCO, Natural Questions, TriviaQA, SQUAD, and SCIFACT, finding that asymmetry in the dual encod… ▽ More

    Submitted 1 June, 2023; v1 submitted 31 March, 2023; originally announced April 2023.

    Comments: SustaiNLP2023 @ ACL 2023, 8 pages, 4 figures, 30 tables

  11. arXiv:2304.00114  [pdf, other

    cs.IR cs.AI cs.CL

    Dense Sparse Retrieval: Using Sparse Language Models for Inference Efficient Dense Retrieval

    Authors: Daniel Campos, ChengXiang Zhai

    Abstract: Vector-based retrieval systems have become a common staple for academic and industrial search applications because they provide a simple and scalable way of extending the search to leverage contextual representations for documents and queries. As these vector-based systems rely on contextual language models, their usage commonly requires GPUs, which can be expensive and difficult to manage. Given… ▽ More

    Submitted 31 March, 2023; originally announced April 2023.

  12. arXiv:2303.17612  [pdf, other

    cs.CL cs.AI cs.LG

    oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes

    Authors: Daniel Campos, Alexandre Marques, Mark Kurtz, ChengXiang Zhai

    Abstract: In this paper, we introduce the range of oBERTa language models, an easy-to-use set of language models which allows Natural Language Processing (NLP) practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. Specifically, oBERTa extends existing work on pruning, knowledge distillation, and quantization and leverages frozen embeddings improves distilla… ▽ More

    Submitted 6 June, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: SustaiNLP2023 @ ACL 2023,9 pages, 2 figures, 45 tables

  13. arXiv:2302.12721  [pdf, other

    cs.LG cs.DB

    LightTS: Lightweight Time Series Classification with Adaptive Ensemble Distillation -- Extended Version

    Authors: David Campos, Miao Zhang, Bin Yang, Tung Kieu, Chenjuan Guo, Christian S. Jensen

    Abstract: Due to the swee** digitalization of processes, increasingly vast amounts of time series data are being produced. Accurate classification of such time series facilitates decision making in multiple domains. State-of-the-art classification accuracy is often achieved by ensemble learning where results are synthesized from multiple base models. This characteristic implies that ensemble learning need… ▽ More

    Submitted 24 February, 2023; originally announced February 2023.

    Comments: 15 pages. An extended version of "LightTS: Lightweight Time Series Classification with Adaptive Ensemble Distillation" accepted at SIGMOD 2023

    Journal ref: Proceedings of the ACM on Management of Data 1, 2 (2023), 171:1-171:27

  14. arXiv:2211.15927  [pdf, ps, other

    cs.CL cs.LG

    Compressing Cross-Lingual Multi-Task Models at Qualtrics

    Authors: Daniel Campos, Daniel Perry, Samir Joshi, Yashmeet Gambhir, Wei Du, Zhengzheng Xing, Aaron Colak

    Abstract: Experience management is an emerging business area where organizations focus on understanding the feedback of customers and employees in order to improve their end-to-end experiences. This results in a unique set of machine learning problems to help understand how people feel, discover issues they care about, and find which actions need to be taken on data that are different in content and distrib… ▽ More

    Submitted 28 November, 2022; originally announced November 2022.

    Comments: accepted to IAAI-23 (part of AAAI-23)

    ACM Class: I.2.7

  15. arXiv:2205.12452  [pdf, other

    cs.CL cs.AI

    Sparse*BERT: Sparse Models Generalize To New tasks and Domains

    Authors: Daniel Campos, Alexandre Marques, Tuan Nguyen, Mark Kurtz, ChengXiang Zhai

    Abstract: Large Language Models have become the core architecture upon which most modern natural language processing (NLP) systems build. These models can consistently deliver impressive accuracy and robustness across tasks and domains, but their high computational overhead can make inference difficult and expensive. To make using these models less costly, recent work has explored leveraging structured and… ▽ More

    Submitted 5 April, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

    Comments: Presented at Sparsity in Neural Networks Workshop at ICML 2022, 6 pages, 2 figures, 4 tables

  16. arXiv:2203.07259  [pdf, other

    cs.CL cs.LG

    The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

    Authors: Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, Dan Alistarh

    Abstract: Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, wi… ▽ More

    Submitted 17 October, 2022; v1 submitted 14 March, 2022; originally announced March 2022.

    Comments: Accepted to EMNLP 2022

  17. arXiv:2203.02592  [pdf, other

    stat.ML cs.LG stat.ME

    Sparsity-Inducing Categorical Prior Improves Robustness of the Information Bottleneck

    Authors: Anirban Samaddar, Sandeep Madireddy, Prasanna Balaprakash, Tapabrata Maiti, Gustavo de los Campos, Ian Fischer

    Abstract: The information bottleneck framework provides a systematic approach to learning representations that compress nuisance information in the input and extract semantically meaningful information about predictions. However, the choice of a prior distribution that fixes the dimensionality across all the data can restrict the flexibility of this approach for learning robust representations. We present a… ▽ More

    Submitted 27 October, 2022; v1 submitted 4 March, 2022; originally announced March 2022.

  18. Unsupervised Time Series Outlier Detection with Diversity-Driven Convolutional Ensembles -- Extended Version

    Authors: David Campos, Tung Kieu, Chenjuan Guo, Feiteng Huang, Kai Zheng, Bin Yang, Christian S. Jensen

    Abstract: With the swee** digitalization of societal, medical, industrial, and scientific processes, sensing technologies are being deployed that produce increasing volumes of time series data, thus fueling a plethora of new or improved applications. In this setting, outlier detection is frequently important, and while solutions based on neural networks exist, they leave room for improvement in terms of b… ▽ More

    Submitted 22 November, 2021; originally announced November 2021.

    Comments: 14 pages. An extended version of "Unsupervised Time Series Outlier Detection with Diversity-Driven Convolutional Ensembles", to appear in PVLDB 2022

    Journal ref: Proceedings of the VLDB Endowment, 15, 3 (2022), 611-623

  19. arXiv:2109.04202  [pdf, other

    q-bio.QM cs.CV cs.LG eess.IV

    IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System

    Authors: Daniel Campos, Heng Ji

    Abstract: Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital information is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and mac… ▽ More

    Submitted 3 September, 2021; originally announced September 2021.

  20. arXiv:2108.05868  [pdf, other

    cs.RO eess.SY

    A Semi-Lagrangian Approach for the Minimal Exposure Path Problem in Wireless Sensor Networks

    Authors: Armando Alves Neto, Víctor C. da Silva Campos, Douglas G. Macharet

    Abstract: A critical metric of the coverage quality in Wireless Sensor Networks (WSNs) is the Minimal Exposure Path (MEP), a path through the environment that least exposes an intruder to the sensor detecting nodes. Many approaches have been proposed in the last decades to solve this optimization problem, ranging from classic (grid-based and Voronoi-based) planners to genetic meta-heuristics. However, most… ▽ More

    Submitted 12 August, 2021; originally announced August 2021.

  21. arXiv:2108.02170  [pdf, other

    cs.CL cs.AI

    Curriculum learning for language modeling

    Authors: Daniel Campos

    Abstract: Language Models like ELMo and BERT have provided robust representations of natural language, which serve as the language understanding component for a diverse range of downstream tasks.Curriculum learning is a method that employs a structured training regime instead, which has been leveraged in computer vision and machine translation to improve model training speed and model performance. While lan… ▽ More

    Submitted 4 August, 2021; originally announced August 2021.

  22. arXiv:2107.13902  [pdf, other

    cs.SE

    Developers perception on the severity of test smells: an empirical study

    Authors: Denivan Campos, Larissa Rocha, Ivan Machado

    Abstract: Unit testing is an essential component of the software development life-cycle. A developer could easily and quickly catch and fix software faults introduced in the source code by creating and running unit tests. Despite their importance, unit tests are subject to bad design or implementation decisions, the so-called test smells. These might decrease software systems quality from various aspects, m… ▽ More

    Submitted 29 July, 2021; originally announced July 2021.

    Comments: 14 pages

  23. arXiv:2105.04021  [pdf, other

    cs.IR cs.AI cs.LG

    MS MARCO: Benchmarking Ranking Models in the Large-Data Regime

    Authors: Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin

    Abstract: Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public leaderboard such as MS MARCO, are intended to encourage research and track our progress, addressing big questions in our field. However, the goal is not simply to identify which run is "best", achieving the top score. The goal is to move the field forward by develo** new robust techniques, that work in many different setting… ▽ More

    Submitted 9 May, 2021; originally announced May 2021.

  24. arXiv:2104.09399  [pdf, other

    cs.IR cs.AI cs.LG

    TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime

    Authors: Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees, Ian Soboroff

    Abstract: The TREC Deep Learning (DL) Track studies ad hoc search in the large data regime, meaning that a large set of human-labeled training data is available. Results so far indicate that the best models with large data may be deep neural networks. This paper supports the reuse of the TREC DL test collections in three ways. First we describe the data sets in detail, documenting clearly and in one place s… ▽ More

    Submitted 19 April, 2021; originally announced April 2021.

    Comments: arXiv admin note: text overlap with arXiv:2003.07820

  25. arXiv:2102.12887  [pdf, other

    cs.IR

    Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard

    Authors: Jimmy Lin, Daniel Campos, Nick Craswell, Bhaskar Mitra, Emine Yilmaz

    Abstract: Leaderboards are a ubiquitous part of modern research in applied machine learning. By design, they sort entries into some linear order, where the top-scoring entry is recognized as the "state of the art" (SOTA). Due to the rapid progress being made in information retrieval today, particularly with neural models, the top entry in a leaderboard is replaced with some regularity. These are touted as i… ▽ More

    Submitted 25 February, 2021; originally announced February 2021.

  26. arXiv:2102.07662  [pdf, other

    cs.IR cs.AI cs.CL cs.LG

    Overview of the TREC 2020 deep learning track

    Authors: Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos

    Abstract: This is the second year of the TREC Deep Learning Track, with the goal of studying ad hoc ranking in the large training data regime. We again have a document retrieval task and a passage retrieval task, each with hundreds of thousands of human-labeled training queries. We evaluate using single-shot TREC-style evaluation, to give us a picture of which ranking methods work best when large data is av… ▽ More

    Submitted 15 February, 2021; originally announced February 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2003.07820

  27. arXiv:2012.02530  [pdf, other

    cs.LG

    Logic Synthesis Meets Machine Learning: Trading Exactness for Generalization

    Authors: Shubham Rai, Walter Lau Neto, Yukio Miyasaka, Xinpei Zhang, Mingfei Yu, Qingyang Yi Masahiro Fujita, Guilherme B. Manske, Matheus F. Pontes, Leomar S. da Rosa Junior, Marilton S. de Aguiar, Paulo F. Butzen, Po-Chun Chien, Yu-Shan Huang, Hoa-Ren Wang, Jie-Hong R. Jiang, Jiaqi Gu, Zheng Zhao, Zixuan Jiang, David Z. Pan, Brunno A. de Abreu, Isac de Souza Campos, Augusto Berndt, Cristina Meinhardt, Jonata T. Carvalho, Mateus Grellert , et al. (15 additional authors not shown)

    Abstract: Logic synthesis is a fundamental step in hardware design whose goal is to find structural representations of Boolean functions while minimizing delay and area. If the function is completely-specified, the implementation accurately represents the function. If the function is incompletely-specified, the implementation has to be true only on the care set. While most of the algorithms in logic synthes… ▽ More

    Submitted 15 December, 2020; v1 submitted 4 December, 2020; originally announced December 2020.

    Comments: In this 23 page manuscript, we explore the connection between machine learning and logic synthesis which was the main goal for International Workshop on logic synthesis. It includes approaches applied by ten teams spanning 6 countries across the world

  28. arXiv:2008.04627  [pdf, other

    cs.RO

    A Comparison of Humanoid Robot Simulators: A Quantitative Approach

    Authors: Angel Ayala, Francisco Cruz, Diego Campos, Rodrigo Rubio, Bruno Fernandes, Richard Dazeley

    Abstract: Research on humanoid robotic systems involves a considerable amount of computational resources, not only for the involved design but also for its development and subsequent implementation. For robotic systems to be implemented in real-world scenarios, in several situations, it is preferred to develop and test them under controlled environments in order to reduce the risk of errors and unexpected b… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    Comments: Accepted in the IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), 2020

  29. arXiv:2006.05324  [pdf, other

    cs.IR cs.LG

    ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search

    Authors: Nick Craswell, Daniel Campos, Bhaskar Mitra, Emine Yilmaz, Bodo Billerbeck

    Abstract: Users of Web search engines reveal their information needs through queries and clicks, making click logs a useful asset for information retrieval. However, click logs have not been publicly released for academic use, because they can be too revealing of personally or commercially sensitive information. This paper describes a click data release related to the TREC Deep Learning Track document corpu… ▽ More

    Submitted 18 August, 2020; v1 submitted 9 June, 2020; originally announced June 2020.

  30. arXiv:2004.13486  [pdf, other

    cs.IR cs.CL cs.LG

    On the Reliability of Test Collections for Evaluating Systems of Different Types

    Authors: Emine Yilmaz, Nick Craswell, Bhaskar Mitra, Daniel Campos

    Abstract: As deep learning based models are increasingly being used for information retrieval (IR), a major challenge is to ensure the availability of test collections for measuring their quality. Test collections are generated based on pooling results of various retrieval systems, but until recently this did not include deep learning systems. This raises a major challenge for reusable evaluation: Since dee… ▽ More

    Submitted 28 April, 2020; originally announced April 2020.

  31. arXiv:2004.01401  [pdf, ps, other

    cs.CL

    XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

    Authors: Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei Zhang, Rahul Agrawal, Edward Cui, Sining Wei, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Daniel Campos, Rangan Majumder, Ming Zhou

    Abstract: In this paper, we introduce XGLUE, a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE(Wang et al., 2019), which is labeled in English for natural language understanding tasks only, XGLUE has two main advantages: (1) it pr… ▽ More

    Submitted 22 May, 2020; v1 submitted 3 April, 2020; originally announced April 2020.

  32. arXiv:2003.07820  [pdf, other

    cs.IR cs.CL cs.LG

    Overview of the TREC 2019 deep learning track

    Authors: Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees

    Abstract: The Deep Learning Track is a new track for TREC 2019, with the goal of studying ad hoc ranking in a large data regime. It is the first track with large human-labeled training sets, introducing two sets corresponding to two tasks, each with rigorous TREC-style blind evaluation and reusable test sets. The document retrieval task has a corpus of 3.2 million documents with 367 thousand training querie… ▽ More

    Submitted 18 March, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

  33. arXiv:1911.02671  [pdf, other

    cs.CL cs.IR

    Open Domain Web Keyphrase Extraction Beyond Language Modeling

    Authors: Lee Xiong, Chuan Hu, Chenyan Xiong, Daniel Campos, Arnold Overwijk

    Abstract: This paper studies keyphrase extraction in real-world scenarios where documents are from diverse domains and have variant content quality. We curate and release OpenKP, a large scale open domain keyphrase extraction dataset with near one hundred thousand web documents and expert keyphrase annotations. To handle the variations of domain and content quality, we develop BLING-KPE, a neural keyphrase… ▽ More

    Submitted 6 November, 2019; originally announced November 2019.

    Journal ref: EMNLP-IJCNLP 2019

  34. arXiv:1910.04277  [pdf

    cs.SI cs.IT

    Experiments in Inferring Social Networks of Diffusion

    Authors: Daniel Campos, Zoe Konrad

    Abstract: Information diffusion is a fundamental process that takes place over networks. While it is rarely realistic to observe the individual transmissions of the information diffusion process, it is typically possible to observe when individuals first publish the information. We look specifically at previously published algorithm NETINF that probabilistically identifies the optimal network that best expl… ▽ More

    Submitted 9 October, 2019; originally announced October 2019.

  35. arXiv:1709.06489  [pdf, ps, other

    q-bio.GN cs.LG q-bio.QM stat.ML

    Accurate Genomic Prediction Of Human Height

    Authors: Louis Lello, Steven G. Avery, Laurent Tellier, Ana Vazquez, Gustavo de los Campos, Stephen D. H. Hsu

    Abstract: We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, $\sim$40, 20, and 9 percent of total variance for the three traits. For example, predicted heights corre… ▽ More

    Submitted 19 September, 2017; originally announced September 2017.

    Comments: 17 pages, 10 figures

  36. arXiv:1611.09268  [pdf, other

    cs.CL cs.IR

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Authors: Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang

    Abstract: We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that… ▽ More

    Submitted 31 October, 2018; v1 submitted 28 November, 2016; originally announced November 2016.

  37. A novel evolutionary formulation of the maximum independent set problem

    Authors: V. C. Barbosa, L. C. D. Campos

    Abstract: We introduce a novel evolutionary formulation of the problem of finding a maximum independent set of a graph. The new formulation is based on the relationship that exists between a graph's independence number and its acyclic orientations. It views such orientations as individuals and evolves them with the aid of evolutionary operators that are very heavily based on the structure of the graph and… ▽ More

    Submitted 22 September, 2003; originally announced September 2003.

    Report number: ES-615/03 ACM Class: F.2.2; I.2.8

    Journal ref: Journal of Combinatorial Optimization 8 (2004), 419-437