Skip to main content

Showing 1–15 of 15 results for author: Hughes, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.03103  [pdf, other

    cs.RO eess.SY

    Multi-Robot Planning for Filming Groups of Moving Actors Leveraging Submodularity and Pixel Density

    Authors: Skyler Hughes, Rebecca Martin, Micah Corah, Sebastian Scherer

    Abstract: Observing and filming a group of moving actors with a team of aerial robots is a challenging problem that combines elements of multi-robot coordination, coverage, and view planning. A single camera may observe multiple actors at once, and the robot team may observe individual actors from multiple views. As actors move about, groups may split, merge, and reform, and robots filming these actors shou… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: 10 pages, 5 figures, submitted to CDC 2024

  2. arXiv:2402.19173  [pdf, other

    cs.SE cs.AI

    StarCoder 2 and The Stack v2: The Next Generation

    Authors: Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo , et al. (41 additional authors not shown)

    Abstract: The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  3. arXiv:2312.03872  [pdf, other

    cs.CY cs.AI cs.CL cs.LG cs.PL

    The BigCode Project Governance Card

    Authors: BigCode collaboration, Sean Hughes, Harm de Vries, Jennifer Robinson, Carlos Muñoz Ferrandis, Loubna Ben Allal, Leandro von Werra, Jennifer Ding, Sebastien Paquet, Yacine Jernite

    Abstract: This document serves as an overview of the different mechanisms and areas of governance in the BigCode project. It aims to support transparency by providing relevant information about choices that were made during the project to the broader public, and to serve as an example of intentional governance of an open research project that future endeavors can leverage to shape their own approach. The fi… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: 12 pages, related papers arXiv:2305.06161 and arXiv:2301.03988 and arXiv:2211.15533v1, learn more at https://www.bigcode-project.org/

  4. arXiv:2309.14323  [pdf

    cs.IR

    Cluster Language Model for Improved E-Commerce Retrieval and Ranking: Leveraging Query Similarity and Fine-Tuning for Personalized Results

    Authors: Duleep Rathgamage Don, Ying Xie, Le Yu, Simon Hughes, Yun Zhu

    Abstract: This paper proposes a novel method to improve the accuracy of product search in e-commerce by utilizing a cluster language model. The method aims to address the limitations of the bi-encoder architecture while maintaining a minimal additional training burden. The approach involves labeling top products for each query, generating semantically similar query clusters using the K-Means clustering algo… ▽ More

    Submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at The 6th Workshop on e-Commerce and NLP (ECNLP 6), KDD'23, Long Beach, CA

    ACM Class: I.2.7

  5. arXiv:2309.13061  [pdf, other

    cs.CL cs.CY

    Applying BioBERT to Extract Germline Gene-Disease Associations for Building a Knowledge Graph from the Biomedical Literature

    Authors: Armando D. Diaz Gonzalez, Kevin S. Hughes, Songhui Yue, Sean T. Hayes

    Abstract: Published biomedical information has and continues to rapidly increase. The recent advancements in Natural Language Processing (NLP), have generated considerable interest in automating the extraction, normalization, and representation of biomedical knowledge about entities such as genes and diseases. Our study analyzes germline abstracts in the construction of knowledge graphs of the of the immens… ▽ More

    Submitted 22 April, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

    Comments: 10 pages

    Journal ref: The 7th International Conference on Information System and Data Mining (ICISDM2023-ACM), Atlanta, USA, May 2023

  6. arXiv:2305.06161  [pdf, other

    cs.CL cs.AI cs.PL cs.SE

    StarCoder: may the source be with you!

    Authors: Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu , et al. (42 additional authors not shown)

    Abstract: The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large colle… ▽ More

    Submitted 13 December, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

  7. arXiv:2301.03988  [pdf, other

    cs.SE cs.AI cs.LG

    SantaCoder: don't reach for the stars!

    Authors: Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo , et al. (16 additional authors not shown)

    Abstract: The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigat… ▽ More

    Submitted 24 February, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

  8. arXiv:2211.15533  [pdf, other

    cs.CL cs.AI

    The Stack: 3 TB of permissively licensed source code

    Authors: Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries

    Abstract: Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect t… ▽ More

    Submitted 20 November, 2022; originally announced November 2022.

  9. De-Biased Modelling of Search Click Behavior with Reinforcement Learning

    Authors: Jianghong Zhou, Sayyed M. Zahiri, Simon Hughes, Khalifeh Al Jadda, Surya Kallumadi, Eugene Agichtein

    Abstract: Users' clicks on Web search results are one of the key signals for evaluating and improving web search quality and have been widely used as part of current state-of-the-art Learning-To-Rank(LTR) models. With a large volume of search logs available for major search engines, effective models of searcher click behavior have emerged to evaluate and train LTR models. However, when modeling the users' c… ▽ More

    Submitted 20 May, 2021; originally announced May 2021.

    Comments: SIGIR 2021 Short Paper

  10. arXiv:2105.00867  [pdf

    cs.IR

    Online Product Feature Recommendations with Interpretable Machine Learning

    Authors: Mingming Guo, Nian Yan, Xiquan Cui, Simon Hughes, Khalifeh Al Jadda

    Abstract: Product feature recommendations are critical for online customers to purchase the right products based on the right features. For a customer, selecting the product that has the best trade-off between price and functionality is a time-consuming step in an online shop** experience, and customers can be overwhelmed by the available choices. However, determining the set of product features that most… ▽ More

    Submitted 28 April, 2021; originally announced May 2021.

  11. arXiv:2104.11384  [pdf, other

    cs.IR cs.CL cs.LG

    APRF-Net: Attentive Pseudo-Relevance Feedback Network for Query Categorization

    Authors: Ali Ahmadvand, Sayyed M. Zahiri, Simon Hughes, Khalifa Al Jadda, Surya Kallumadi, Eugene Agichtein

    Abstract: Query categorization is an essential part of query intent understanding in e-commerce search. A common query categorization task is to select the relevant fine-grained product categories in a product taxonomy. For frequent queries, rich customer behavior (e.g., click-through data) can be used to infer the relevant product categories. However, for more rare queries, which cover a large volume of se… ▽ More

    Submitted 10 May, 2021; v1 submitted 22 April, 2021; originally announced April 2021.

  12. arXiv:2005.08146  [pdf, other

    cs.CL cs.IR

    Semi-Automating Knowledge Base Construction for Cancer Genetics

    Authors: Somin Wadhwa, Kanhua Yin, Kevin S. Hughes, Byron C. Wallace

    Abstract: In this work, we consider the exponentially growing subarea of genetics in cancer. The need to synthesize and centralize this evidence for dissemination has motivated a team of physicians to manually construct and maintain a knowledge base that distills key results reported in the literature. This is a laborious process that entails reading through full-text articles to understand the study design… ▽ More

    Submitted 25 May, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

    Comments: In proceedings of the Conference on Automated Knowledge Base Construction (AKBC), 2020

  13. arXiv:1904.12617  [pdf

    cs.IR cs.LG

    Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes

    Authors: Yujia Bao, Zhengyi Deng, Yan Wang, Heeyoon Kim, Victor Diego Armengol, Francisco Acevedo, Nofal Ouardaoui, Cathy Wang, Giovanni Parmigiani, Regina Barzilay, Danielle Braun, Kevin S Hughes

    Abstract: PURPOSE: The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools monitoring and prioritizing the literature to understand the clinical implications of the pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance (risk of cancer for germline mutation carriers) or prevalence of… ▽ More

    Submitted 24 April, 2019; originally announced April 2019.

  14. arXiv:1504.01169  [pdf, other

    stat.ML cs.LG

    Efficient Dictionary Learning via Very Sparse Random Projections

    Authors: Farhad Pourkamali-Anaraki, Stephen Becker, Shannon M. Hughes

    Abstract: Performing signal processing tasks on compressive measurements of data has received great attention in recent years. In this paper, we extend previous work on compressive dictionary learning by showing that more general random projections may be used, including sparse ones. More precisely, we examine compressive K-means clustering as a special case of compressive dictionary learning and give theor… ▽ More

    Submitted 5 April, 2015; originally announced April 2015.

    Comments: 5 pages, 2 figures, accepted in Sampling Theory and Applications (SampTA) 2015

  15. arXiv:1205.5589  [pdf, ps, other

    cs.IT

    Technical report: Two observations on probability distribution symmetries for randomly-projected data

    Authors: Hanchao Qi, Shannon M. Hughes

    Abstract: In this technical report, we will make two observations concerning symmetries of the probability distribution resulting from projection of a piece of p-dimensional data onto a random m-dimensional subspace of $\mathbb{R}^p$, where m < p. In particular, we shall observe that such distributions are unchanged by reflection across the original data vector and by rotation about the original data vector

    Submitted 24 May, 2012; originally announced May 2012.