MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
Authors:
Qi Chen,
Xiubo Geng,
Corby Rosset,
Carolyn Buractaon,
**gwen Lu,
Tao Shen,
Kun Zhou,
Chenyan Xiong,
Yeyun Gong,
Paul Bennett,
Nick Craswell,
Xing Xie,
Fan Yang,
Bryan Tower,
Nikhil Rao,
Anlei Dong,
Wenqi Jiang,
Zheng Liu,
Mingqin Li,
Chuanjie Liu,
Zengzhong Li,
Rangan Majumder,
Jennifer Neville,
Andy Oakley,
Knut Magne Risvik
, et al. (6 additional authors not shown)
Abstract:
Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of down…
▽ More
Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of downstream tasks and encourages research in various areas, such as generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models. MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks that demand innovations in both machine learning and information retrieval system research domains. As the first dataset that meets large, real and rich data requirements, MS MARCO Web Search paves the way for future advancements in AI and system research. MS MARCO Web Search dataset is available at: https://github.com/microsoft/MS-MARCO-Web-Search.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
Workgroup Map**: Visual Analysis of Collaboration Culture
Authors:
Darren Edge,
Jonathan Larson,
Nikolay Trandev,
Neha Parikh Shah,
Carolyn Buractaon,
Nicholas Caurvina,
Nathan Evans,
Christopher M. White
Abstract:
The digital transformation of work presents new opportunities to understand how informal workgroups organize around the dynamic needs of organizations, potentially in contrast to the formal, static, and idealized hierarchies depicted by org charts. We present a design study that spans multiple enabling capabilities for the visual map** and analysis of organizational workgroups, including metrics…
▽ More
The digital transformation of work presents new opportunities to understand how informal workgroups organize around the dynamic needs of organizations, potentially in contrast to the formal, static, and idealized hierarchies depicted by org charts. We present a design study that spans multiple enabling capabilities for the visual map** and analysis of organizational workgroups, including metrics for quantifying two dimensions of collaboration culture: the fluidity of collaborative relationships (measured using network machine learning) and the freedom with which workgroups form across organizational boundaries. These capabilities come together to create a turnkey pipeline that combines the analysis of a target organization, the generation of data graphics and statistics, and their integration in a template-based presentation that enables narrative visualization of results. Our metrics and visuals have supported hundreds of presentations to executives of major US-based and multinational organizations, while our engineering practices have created an ensemble of standalone tools with broad relevance to visualization and visual analytics. We present our work as an example of applied visual analytics research, describing the design iterations that allowed us to move from experimentation to production, as well as the perspectives of the research team and the customer-facing team at each stage in this process.
△ Less
Submitted 19 March, 2021; v1 submitted 1 May, 2020;
originally announced May 2020.