Search | arXiv e-print repository

arXiv:2405.20508 [pdf, other]

MyWeekInSight: Designing and Evaluating the Use of Visualization in Self-Management of Chronic Pain by Youth

Authors: Unma Desai, Haley Foladare, Katelynn E. Boerner, Tim F. Oberlander, Tamara Munzner, Karon E. MacLean

Abstract: A teenager's experience of chronic pain reverberates through multiple interacting aspects of their lives. To self-manage their symptoms, they need to understand how factors such as their sleep, social interactions, emotions and pain intersect; supporting this capability must underlie an effective personalized healthcare solution. While adult use of personal informatics for self-management of vario… ▽ More A teenager's experience of chronic pain reverberates through multiple interacting aspects of their lives. To self-manage their symptoms, they need to understand how factors such as their sleep, social interactions, emotions and pain intersect; supporting this capability must underlie an effective personalized healthcare solution. While adult use of personal informatics for self-management of various health factors has been studied, solutions intended for adults are rarely workable for teens, who face this complex and confusing situation with unique perspectives, skills and contexts. In this design study, we explore a means of facilitating self-reflection by youth living with chronic pain, through visualization of their personal health data. In collaboration with pediatric chronic pain clinicians and a health-tech industry partner, we designed and deployed MyWeekInSight, a visualization-based self-reflection tool for youth with chronic pain. We discuss our staged design approach with this intersectionally vulnerable population, in which we balanced reliance on proxy users and data with feedback from youth viewing their own data. We report on extensive formative and in-situ evaluation, including a three-week clinical deployment, and present a framework of challenges and barriers faced in clinical deployment with mitigations that can aid fellow researchers. Our reflections on the design process yield principles, surprises, and open questions. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.05229 [pdf, other]

myAURA: Personalized health library for epilepsy management via knowledge graph sparsification and visualization

Authors: Rion Brattig Correia, Jordan C. Rozum, Leonard Cross, Jack Felag, Michael Gallant, Ziqi Guo, Bruce W. Herr II, Aehong Min, Deborah Stungis Rocha, Xuan Wang, Katy Börner, Wendy Miller, Luis M. Rocha

Abstract: Objective: We report the development of the patient-centered myAURA application and suite of methods designed to aid epilepsy patients, caregivers, and researchers in making decisions about care and self-management. Materials and Methods: myAURA rests on the federation of an unprecedented collection of heterogeneous data resources relevant to epilepsy, such as biomedical databases, social media,… ▽ More Objective: We report the development of the patient-centered myAURA application and suite of methods designed to aid epilepsy patients, caregivers, and researchers in making decisions about care and self-management. Materials and Methods: myAURA rests on the federation of an unprecedented collection of heterogeneous data resources relevant to epilepsy, such as biomedical databases, social media, and electronic health records. A generalizable, open-source methodology was developed to compute a multi-layer knowledge graph linking all this heterogeneous data via the terms of a human-centered biomedical dictionary. Results: The power of the approach is first exemplified in the study of the drug-drug interaction phenomenon. Furthermore, we employ a novel network sparsification methodology using the metric backbone of weighted graphs, which reveals the most important edges for inference, recommendation, and visualization, such as pharmacology factors patients discuss on social media. The network sparsification approach also allows us to extract focused digital cohorts from social media whose discourse is more relevant to epilepsy or other biomedical problems. Finally, we present our patient-centered design and pilot-testing of myAURA, including its user interface, based on focus groups and other stakeholder input. Discussion: The ability to search and explore myAURA's heterogeneous data sources via a sparsified multi-layer knowledge graph, as well as the combination of those layers in a single map, are useful features for integrating relevant information for epilepsy. Conclusion: Our stakeholder-driven, scalable approach to integrate traditional and non-traditional data sources, enables biomedical discovery and data-powered patient self-management in epilepsy, and is generalizable to other chronic conditions. △ Less

Submitted 10 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

arXiv:2305.09925 [pdf, other]

A Scalable Method for Readable Tree Layouts

Authors: Kathryn Gray, Mingwei Li, Reyan Ahmed, Md. Khaledur Rahman, Ariful Azad, Stephen Kobourov, Katy Börner

Abstract: Large tree structures are ubiquitous and real-world relational datasets often have information associated with nodes (e.g., labels or other attributes) and edges (e.g., weights or distances) that need to be communicated to the viewers. Yet, scalable, easy to read tree layouts are difficult to achieve. We consider tree layouts to be readable if they meet some basic requirements: node labels should… ▽ More Large tree structures are ubiquitous and real-world relational datasets often have information associated with nodes (e.g., labels or other attributes) and edges (e.g., weights or distances) that need to be communicated to the viewers. Yet, scalable, easy to read tree layouts are difficult to achieve. We consider tree layouts to be readable if they meet some basic requirements: node labels should not overlap, edges should not cross, edge lengths should be preserved, and the output should be compact. There are many algorithms for drawing trees, although very few take node labels or edge lengths into account, and none optimizes all requirements above. With this in mind, we propose a new scalable method for readable tree layouts. The algorithm guarantees that the layout has no edge crossings and no label overlaps, and optimizes one of the remaining aspects: desired edge lengths and compactness. We evaluate the performance of the new algorithm by comparison with related earlier approaches using several real-world datasets, ranging from a few thousand nodes to hundreds of thousands of nodes. Tree layout algorithms can be used to visualize large general graphs, by extracting a hierarchy of progressively larger trees. We illustrate this functionality by presenting several map-like visualizations generated by the new tree layout algorithm. △ Less

Submitted 16 May, 2023; originally announced May 2023.

arXiv:2112.02159 [pdf]

doi 10.3389/frvir.2021.727344

Optimizing Performance and Satisfaction in Matching and Movement Tasks in Virtual Reality with Interventions Using the Data Visualization Literacy Framework

Authors: Andreas Bueckle, Kilian Buehling, Patrick C. Shih, Katy Borner

Abstract: Virtual reality (VR) has seen increased use for training and instruction. Designers can enable VR users to gain insights into their own performance by visualizing telemetry data from their actions in VR. Our ability to detect patterns and trends visually suggests the use of data visualization as a tool for users to identify strategies for improved performance. Typical tasks in VR training scenario… ▽ More Virtual reality (VR) has seen increased use for training and instruction. Designers can enable VR users to gain insights into their own performance by visualizing telemetry data from their actions in VR. Our ability to detect patterns and trends visually suggests the use of data visualization as a tool for users to identify strategies for improved performance. Typical tasks in VR training scenarios are manipulation of 3D objects (e.g., for learning how to maintain a jet engine) and navigation (e.g., to learn the geography of a building or landscape before traveling on-site). In this paper, we present the results of the RUI VR (84 subjects) and Luddy VR studies (68 subjects), where participants were divided into experiment and control cohorts. All subjects performed a series of tasks: 44 cube-matching tasks in RUI VR and 48 navigation tasks through a virtual building in Luddy VR (all divided into two sets). All Luddy VR subjects used VR gear; RUI VR subjects were divided across three setups: 2D Desktop (with laptop and mouse), VR Tabletop (in VR, sitting at a table), and VR Standup (in VR, standing). In an intervention called "Reflective phase," the experiment cohorts were presented with data visualizations, designed with the Data Visualization Literacy Framework (DVL-FW), of the data they generated during the first set of tasks before continuing to the second part of the study. For Luddy VR, we found that experiment users had significantly faster completion times in their second trial (p = 0.014) while scoring higher in a mid-questionnaire about the virtual building (p = 0.009). For RUI VR, we found no significant differences for completion time and accuracy between the two cohorts in the VR setups; however, 2D Desktop subjects in the experiment cohort had significantly higher rotation accuracy as well as satisfaction (p(rotation) = 0.031, p(satisfaction) = 0.040). △ Less

Submitted 3 December, 2021; originally announced December 2021.

arXiv:2104.14281 [pdf]

doi 10.1177/20552076221089092

Leveraging Online Shop** Behaviors as a Proxy for Personal Lifestyle Choices: New Insights into Chronic Disease Prevention Literacy

Authors: Yongzhen Wang, Xiaozhong Liu, Katy Börner, Jun Lin, Yingnan Ju, Changlong Sun, Luo Si

Abstract: Objective: Ubiquitous internet access is resha** the way we live, but it is accompanied by unprecedented challenges in preventing chronic diseases that are usually planted by long exposure to unhealthy lifestyles. This paper proposes leveraging online shop** behaviors as a proxy for personal lifestyle choices to improve chronic disease prevention literacy, targeted for times when e-commerce us… ▽ More Objective: Ubiquitous internet access is resha** the way we live, but it is accompanied by unprecedented challenges in preventing chronic diseases that are usually planted by long exposure to unhealthy lifestyles. This paper proposes leveraging online shop** behaviors as a proxy for personal lifestyle choices to improve chronic disease prevention literacy, targeted for times when e-commerce user experience has been assimilated into most people's everyday lives. Methods: Longitudinal query logs and purchase records from 15 million online shoppers were accessed, constructing a broad spectrum of lifestyle features covering various product categories and buyer personas. Using the lifestyle-related information preceding online shoppers' first purchases of specific prescription drugs, we could determine associations between their past lifestyle choices and whether they suffered from a particular chronic disease. Results: Novel lifestyle risk factors were discovered in two exemplars--depression and type 2 diabetes, most of which showed reasonable consistency with existing healthcare knowledge. Further, such empirical findings could be adopted to locate online shoppers at higher risk of these chronic diseases with decent accuracy [i.e., (area under the receiver operating characteristic curve) AUC=0.68 for depression and AUC=0.70 for type 2 diabetes], closely matching the performance of screening surveys benchmarked against medical diagnosis. Conclusions: Mining online shop** behaviors can point medical experts to a series of lifestyle issues associated with chronic diseases that are less explored to date. Hopefully, unobtrusive chronic disease surveillance via e-commerce sites can grant consenting individuals a privilege to be connected more readily with the medical profession and sophistication. △ Less

Submitted 9 March, 2022; v1 submitted 29 April, 2021; originally announced April 2021.

Comments: 58 pages with appendices, 5 figures, 17 tables

arXiv:2102.12030 [pdf]

doi 10.1371/journal.pone.0258103

3D Virtual Reality vs. 2D Desktop Registration User Interface Comparison

Authors: Andreas Bueckle, Kilian Buehling, Patrick C. Shih, Katy Borner

Abstract: Working with organs and extracted tissue blocks is an essential task in surgery and anatomy environments. To prepare specimens from human donors for analysis, wet-bench workers must dissect human tissue and collect metadata for downstream analysis, including information about the spatial origin of tissue. The Registration User Interface (RUI) was developed to allow stakeholders in the Human Biomol… ▽ More Working with organs and extracted tissue blocks is an essential task in surgery and anatomy environments. To prepare specimens from human donors for analysis, wet-bench workers must dissect human tissue and collect metadata for downstream analysis, including information about the spatial origin of tissue. The Registration User Interface (RUI) was developed to allow stakeholders in the Human Biomolecular Atlas Program (HuBMAP) to register tissue blocks, i.e., to record the size, position, and orientation of human tissue data with regard to reference organs. In this paper, we compare three setups for registering one 3D tissue block object to another 3D reference organ (target) object. The first setup is a 2D Desktop implementation featuring a traditional screen, mouse, and keyboard interface. The remaining setups are both virtual reality (VR) versions of the RUI: VR Tabletop, where users sit at a physical desk which is replicated in virtual space; VR Standup, where users stand upright while performing their tasks. We then ran a user study for these three setups involving 42 human subjects completing 14 increasingly difficult and then 30 identical tasks in sequence and reporting position accuracy, rotation accuracy, completion time, and satisfaction. While VR Tabletop and VR Standup users are about three times as fast and about a third more accurate in terms of rotation than 2D Desktop users (for the sequence of 30 identical tasks), there are no significant differences between the three setups for position accuracy when normalized by the height of the virtual kidney across setups. △ Less

Submitted 8 November, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

Comments: 45 pages/9 figures in main text (incl. references); 6 pages with 2 tables in Supporting Information

Journal ref: PLOS ONE 16(10): e0258103 (2021)

arXiv:2008.06610 [pdf]

Key principles for workforce upskilling via online learning: a learning analytics study of a professional course in additive manufacturing

Authors: Kylie Peppler, Joey Huang, Michael C. Richey, Michael Ginda, Katy Börner, Haden Quinlan, A. John Hart

Abstract: Effective adoption of online platforms for teaching, learning, and skill development is essential to both academic institutions and workplaces. Adoption of online learning has been abruptly accelerated by COVID19 pandemic, drawing attention to research on pedagogy and practice for effective online instruction. Online learning requires a multitude of skills and resources spanning from learning mana… ▽ More Effective adoption of online platforms for teaching, learning, and skill development is essential to both academic institutions and workplaces. Adoption of online learning has been abruptly accelerated by COVID19 pandemic, drawing attention to research on pedagogy and practice for effective online instruction. Online learning requires a multitude of skills and resources spanning from learning management platforms to interactive assessment tools, combined with multimedia content, presenting challenges to instructors and organizations. This study focuses on ways that learning sciences and visual learning analytics can be used to design, and to improve, online workforce training in advanced manufacturing. Scholars and industry experts, educational researchers, and specialists in data analysis and visualization collaborated to study the performance of a cohort of 900 professionals enrolled in an online training course focused on additive manufacturing. The course was offered through MITxPro, MIT Open Learning is a professional learning organization which hosts in a dedicated instance of the edX platform. This study combines learning objective analysis and visual learning analytics to examine the relationships among learning trajectories, engagement, and performance. The results demonstrate how visual learning analytics was used for targeted course modification, and interpretation of learner engagement and performance, such as by more direct map** of assessments to learning objectives, and to expected and actual time needed to complete each segment of the course. The study also emphasizes broader strategies for course designers and instructors to align course assignments, learning objectives, and assessment measures with learner needs and interests, and argues for a synchronized data infrastructure to facilitate effective just in time learning and continuous improvement of online courses. △ Less

Submitted 14 August, 2020; originally announced August 2020.

Comments: 36 pages, 5 figures

arXiv:2007.14474 [pdf]

Construction and Usage of a Human Body Common Coordinate Framework Comprising Clinical, Semantic, and Spatial Ontologies

Authors: Katy Börner, Ellen M. Quardokus, Bruce W. Herr II, Leonard E. Cross, Elizabeth G. Record, Yingnan Ju, Andreas D. Bueckle, James P. Sluka, Jonathan C. Silverstein, Kristen M. Browne, Sanjay Jain, Clive H. Wasserfall, Marda L. Jorgensen, Jeffrey M. Spraggins, Nathan H. Patterson, Mark A. Musen, Griffin M. Weber

Abstract: The National Institutes of Health's (NIH) Human Biomolecular Atlas Program (HuBMAP) aims to create a comprehensive high-resolution atlas of all the cells in the healthy human body. Multiple laboratories across the United States are collecting tissue specimens from different organs of donors who vary in sex, age, and body size. Integrating and harmonizing the data derived from these samples and 'ma… ▽ More The National Institutes of Health's (NIH) Human Biomolecular Atlas Program (HuBMAP) aims to create a comprehensive high-resolution atlas of all the cells in the healthy human body. Multiple laboratories across the United States are collecting tissue specimens from different organs of donors who vary in sex, age, and body size. Integrating and harmonizing the data derived from these samples and 'map**' them into a common three-dimensional (3D) space is a major challenge. The key to making this possible is a 'Common Coordinate Framework' (CCF), which provides a semantically annotated, 3D reference system for the entire body. The CCF enables contributors to HuBMAP to 'register' specimens and datasets within a common spatial reference system, and it supports a standardized way to query and 'explore' data in a spatially and semantically explicit manner. [...] This paper describes the construction and usage of a CCF for the human body and its reference implementation in HuBMAP. The CCF consists of (1) a CCF Clinical Ontology, which provides metadata about the specimen and donor (the 'who'); (2) a CCF Semantic Ontology, which describes 'what' part of the body a sample came from and details anatomical structures, cell types, and biomarkers (ASCT+B); and (3) a CCF Spatial Ontology, which indicates 'where' a tissue sample is located in a 3D coordinate system. An initial version of all three CCF ontologies has been implemented for the first HuBMAP Portal release. It was successfully used by Tissue Map** Centers to semantically annotate and spatially register 48 kidney and spleen tissue blocks. The blocks can be queried and explored in their clinical, semantic, and spatial context via the CCF user interface in the HuBMAP Portal. △ Less

Submitted 28 July, 2020; originally announced July 2020.

Comments: 24 pages with SI, 6 figures, 5 tables

arXiv:2006.13864 [pdf]

doi 10.1002/pra2.324

Community-Based Data Integration of Course and Job Data in Support of Personalized Career-Education Recommendations

Authors: Guoqing Zhu, Naga Anjaneyulu Kopalle, Yongzhen Wang, Xiaozhong Liu, Kemi Jona, Katy Börner

Abstract: How does your education impact your professional career? Ideally, the courses you take help you identify, get hired for, and perform the job you always wanted. However, not all courses provide skills that transfer to existing and future jobs; skill terms used in course descriptions might be different from those listed in job advertisements; and there might exist a considerable skill gap between wh… ▽ More How does your education impact your professional career? Ideally, the courses you take help you identify, get hired for, and perform the job you always wanted. However, not all courses provide skills that transfer to existing and future jobs; skill terms used in course descriptions might be different from those listed in job advertisements; and there might exist a considerable skill gap between what is taught in courses and what is needed for a job. In this study, we propose a novel method to integrate extensive course description and job advertisement data by leveraging heterogeneous data integration and community detection. The innovative heterogeneous graph approach along with identified skill communities enables cross-domain information recommendation, e.g., given an educational profile, job recommendations can be provided together with suggestions on education opportunities for re- and upskilling in support of lifelong learning. △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: 6 pages, 1 figure, 2 tables

arXiv:2006.02366 [pdf, other]

doi 10.1371/journal.pone.0242984

Map** the co-evolution of artificial intelligence, robotics, and the internet of things over 20 years (1998-2017)

Authors: Katy Börner, Olga Scrivner, Leonard E. Cross, Michael Gallant, Shutian Ma, Adam S. Martin, Elizabeth Record, Haici Yang, Jonathan M. Dilger

Abstract: Understanding the emergence, co-evolution, and convergence of science and technology (S&T) areas offers competitive intelligence for researchers, managers, policy makers, and others. The resulting data-driven decision support helps set proper research and development (R&D) priorities; develop future S&T investment strategies; monitor key authors, organizations, or countries; perform effective rese… ▽ More Understanding the emergence, co-evolution, and convergence of science and technology (S&T) areas offers competitive intelligence for researchers, managers, policy makers, and others. The resulting data-driven decision support helps set proper research and development (R&D) priorities; develop future S&T investment strategies; monitor key authors, organizations, or countries; perform effective research program assessment; and implement cutting-edge education/training efforts. This paper presents new funding, publication, and scholarly network metrics and visualizations that were validated via expert surveys. The metrics and visualizations exemplify the emergence and convergence of three areas of strategic interest: artificial intelligence (AI), robotics, and internet of things (IoT) over the last 20 years (1998-2017). For 32,716 publications and 4,497 NSF awards, we identify their conceptual space (using the UCSD map of science), geospatial network, and co-evolution landscape. The findings demonstrate how the transition of knowledge (through cross-discipline publications and citations) and the emergence of new concepts (through term bursting) create a tangible potential for interdisciplinary research and new disciplines. △ Less

Submitted 3 June, 2020; originally announced June 2020.

Comments: 10 figures

arXiv:2002.03060 [pdf, other]

doi 10.2478/dim-2021-0006

Chinese E-Romance: Analyzing and Visualizing 7.92 Million Alibaba Valentine's Day Purchases

Authors: Yongzhen Wang, Xiaozhong Liu, Yingnan Ju, Katy Börner, Jun Lin, Changlong Sun, Luo Si

Abstract: The days that precede Valentine's Day are characterized by extensive gift shop** activities all across the globe. In China, where much shop** takes place online, there has been an explosive growth in e-commerce sales during Valentine's Day over the recent years. This exploratory study investigates the extent to which each product category and each shopper group can exhibit romantic love within… ▽ More The days that precede Valentine's Day are characterized by extensive gift shop** activities all across the globe. In China, where much shop** takes place online, there has been an explosive growth in e-commerce sales during Valentine's Day over the recent years. This exploratory study investigates the extent to which each product category and each shopper group can exhibit romantic love within China's e-market throughout the 2 weeks leading up to 2019 Valentine's Day. Massive data from Alibaba, the biggest e-commerce retailer worldwide, are utilized to formulate an innovative romance index (RI) to quantitatively measure e-romantic values for products and shoppers. On this basis, millions of shoppers, along with their millions of products purchased around Valentine's Day, are analyzed as a case study to demonstrate their love consumption and romantic gift-giving. The results of the analysis are then illustrated to help understand Chinese e-romance based on the perspectives of different product categories and shopper groups. This empirical information visualization also contributes to improving the segmentation, targeting, and positioning of China's e-market for Valentine's Day. △ Less

Submitted 20 August, 2021; v1 submitted 7 February, 2020; originally announced February 2020.

Comments: 14 pages, 3 figures, 3 tables

arXiv:1906.05996 [pdf, other]

Multi-level tree based approach for interactive graph visualization with semantic zoom

Authors: Felice De Luca, Iqbal Hossain, Kathryn Gray, Stephen Kobourov, Katy Börner

Abstract: Human subject studies that map-like visualizations are as good or better than standard node-link representations of graphs, in terms of task performance, memorization and recall of the underlying data, and engagement [SSKB14, SSKB15]. With this in mind, we propose the Zoomable Multi-Level Tree (ZMLT) algorithm for multi-level tree-based, map-like visualization of large graphs. We propose seven des… ▽ More Human subject studies that map-like visualizations are as good or better than standard node-link representations of graphs, in terms of task performance, memorization and recall of the underlying data, and engagement [SSKB14, SSKB15]. With this in mind, we propose the Zoomable Multi-Level Tree (ZMLT) algorithm for multi-level tree-based, map-like visualization of large graphs. We propose seven desirable properties that such visualization should maintain and an algorithm that accomplishes them. (1) The abstract trees represent the underlying graph appropriately at different level of details; (2) The embedded trees represent the underlying graph appropriately at different levels of details; (3) At every level of detail we show real vertices and real paths from the underlying graph; (4) If any node or edge appears in a given level, then they also appear in all deeper levels; (5) All nodes at the current level and higher levels are labeled and there are no label overlaps; (6) There are no edge crossings on any level; (7) The drawing area is proportional to the total area of the labels. This algorithm is implemented and we have a functional prototype for the interactive interface in a web browser. △ Less

Submitted 9 December, 2019; v1 submitted 13 June, 2019; originally announced June 2019.

arXiv:1702.01090 [pdf, other]

doi 10.1371/journal.pone.0184188

Multi-level computational methods for interdisciplinary research in the HathiTrust Digital Library

Authors: Jaimie Murdock, Colin Allen, Katy Börner, Robert Light, Simon McAlister, Andrew Ravenscroft, Robert Rose, Doori Rose, Jun Otsuka, David Bourget, John Lawrence, Chris Reed

Abstract: We show how faceted search using a combination of traditional classification systems and mixed-membership topic models can go beyond keyword search to inform resource discovery, hypothesis formulation, and argument extraction for interdisciplinary research. Our test domain is the history and philosophy of scientific work on animal mind and cognition. The methods can be generalized to other researc… ▽ More We show how faceted search using a combination of traditional classification systems and mixed-membership topic models can go beyond keyword search to inform resource discovery, hypothesis formulation, and argument extraction for interdisciplinary research. Our test domain is the history and philosophy of scientific work on animal mind and cognition. The methods can be generalized to other research areas and ultimately support a system for semi-automatic identification of argument structures. We provide a case study for the application of the methods to the problem of identifying and extracting arguments about anthropomorphism during a critical period in the development of comparative psychology. We show how a combination of classification systems and mixed-membership models trained over large digital libraries can inform resource discovery in this domain. Through a novel approach of "drill-down" topic modeling---simultaneously reducing both the size of the corpus and the unit of analysis---we are able to reduce a large collection of fulltext volumes to a much smaller set of pages within six focal volumes containing arguments of interest to historians and philosophers of comparative psychology. The volumes identified in this way did not appear among the first ten results of the keyword search in the HathiTrust digital library and the pages bear the kind of "close reading" needed to generate original interpretations that is the heart of scholarly work in the humanities. Zooming back out, we provide a way to place the books onto a map of science originally constructed from very different data and for different purposes. The multilevel approach advances understanding of the intellectual and societal contexts in which writings are interpreted. △ Less

Submitted 7 June, 2017; v1 submitted 3 February, 2017; originally announced February 2017.

Comments: revised, 29 pages, 3 figures

arXiv:1605.05797 [pdf, other]

doi 10.1371/journal.pone.0159161

Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale

Authors: Scott Emmons, Stephen Kobourov, Mike Gallant, Katy Börner

Abstract: Notions of community quality underlie network clustering. While studies surrounding network clustering are increasingly common, a precise understanding of the realtionship between different cluster quality metrics is unknown. In this paper, we examine the relationship between stand-alone cluster quality metrics and information recovery metrics through a rigorous analysis of four widely-used networ… ▽ More Notions of community quality underlie network clustering. While studies surrounding network clustering are increasingly common, a precise understanding of the realtionship between different cluster quality metrics is unknown. In this paper, we examine the relationship between stand-alone cluster quality metrics and information recovery metrics through a rigorous analysis of four widely-used network clustering algorithms -- Louvain, Infomap, label propagation, and smart local moving. We consider the stand-alone quality metrics of modularity, conductance, and coverage, and we consider the information recovery metrics of adjusted Rand score, normalized mutual information, and a variant of normalized mutual information used in previous work. Our study includes both synthetic graphs and empirical data sets of sizes varying from 1,000 to 1,000,000 nodes. We find significant differences among the results of the different cluster quality metrics. For example, clustering algorithms can return a value of 0.4 out of 1 on modularity but score 0 out of 1 on information recovery. We find conductance, though imperfect, to be the stand-alone quality metric that best indicates performance on information recovery metrics. Our study shows that the variant of normalized mutual information used in previous work cannot be assumed to differ only slightly from traditional normalized mutual information. Smart local moving is the best performing algorithm in our study, but discrepancies between cluster evaluation metrics prevent us from declaring it absolutely superior. Louvain performed better than Infomap in nearly all the tests in our study, contradicting the results of previous work in which Infomap was superior to Louvain. We find that although label propagation performs poorly when clusters are less clearly defined, it scales efficiently and accurately to large graphs with well-defined clusters. △ Less

Submitted 3 August, 2016; v1 submitted 18 May, 2016; originally announced May 2016.

Journal ref: PLoS ONE 11(7): e0159161 (2016)

arXiv:1503.03287 [pdf]

Modelling the Structure and Dynamics of Science Using Books

Authors: Michael Ginda, Andrea Scharnhorst, Katy Borner

Abstract: Scientific research is a major driving force in a knowledge based economy. Income, health and wellbeing depend on scientific progress. The better we understand the inner workings of the scientific enterprise, the better we can prompt, manage, steer, and utilize scientific progress. Diverse indicators and approaches exist to evaluate and monitor research activities, from calculating the reputation… ▽ More Scientific research is a major driving force in a knowledge based economy. Income, health and wellbeing depend on scientific progress. The better we understand the inner workings of the scientific enterprise, the better we can prompt, manage, steer, and utilize scientific progress. Diverse indicators and approaches exist to evaluate and monitor research activities, from calculating the reputation of a researcher, institution, or country to analyzing and visualizing global brain circulation. However, there are very few predictive models of science that are used by key decision makers in academia, industry, or government interested to improve the quality and impact of scholarly efforts. We present a novel 'bibliographic bibliometric' analysis which we apply to a large collection of books relevant for the modelling of science. We explain the data collection together with the results of the data analyses and visualizations. In the final section we discuss how the analysis of books that describe different modelling approaches can inform the design of new models of science. △ Less

Submitted 11 March, 2015; originally announced March 2015.

Comments: data and large scale maps http://cns.iu.edu/2015-ModSci.html, Ginda, Michael, Andrea Scharnhorst, and Katy Börner. "Modelling Science". In Theories of Informetrics: A Festschrift in Honor of Blaise Cronin, edited by Sugimoto, Cassidy. Munich: De Gruyter Saur

arXiv:1404.1911 [pdf, other]

Node, Node-Link, and Node-Link-Group Diagrams: An Evaluation

Authors: Bahador Saket, Paolo Simonetto, Stephen Kobourov, Katy Borner

Abstract: Effectively showing the relationships between objects in a dataset is one of the main tasks in information visualization. Typically there is a well-defined notion of distance between pairs of objects, and traditional approaches such as principal component analysis or multi-dimensional scaling are used to place the objects as points in 2D space, so that similar objects are close to each other. In a… ▽ More Effectively showing the relationships between objects in a dataset is one of the main tasks in information visualization. Typically there is a well-defined notion of distance between pairs of objects, and traditional approaches such as principal component analysis or multi-dimensional scaling are used to place the objects as points in 2D space, so that similar objects are close to each other. In another typical setting, the dataset is visualized as a network graph, where related nodes are connected by links. More recently, datasets are also visualized as maps, where in addition to nodes and links, there is an explicit representation of groups and clusters. We consider these three Techniques, characterized by a progressive increase of the amount of encoded information: node diagrams, node-link diagrams and node-link-group diagrams. We assess these three types of diagrams with a controlled experiment that covers nine different tasks falling broadly in three categories: node-based tasks, network-based tasks and group-based tasks. Our findings indicate that adding links, or links and group representations, does not negatively impact performance (time and accuracy) of node-based tasks. Similarly, adding group representations does not negatively impact the performance of network-based tasks. Node-link-group diagrams outperform the others on group-based tasks. These conclusions contradict results in other studies, in similar but subtly different settings. Taken together, however, such results can have significant implications for the design of standard and domain specific visualizations tools. △ Less

Submitted 7 April, 2014; originally announced April 2014.

arXiv:1304.1067 [pdf, other]

Collective allocation of science funding: from funding agencies to scientific agency

Authors: Johan Bollen, David Crandall, Damion Junk, Ying Ding, Katy Boerner

Abstract: Public agencies like the U.S. National Science Foundation (NSF) and the National Institutes of Health (NIH) award tens of billions of dollars in annual science funding. How can this money be distributed as efficiently as possible to best promote scientific innovation and productivity? The present system relies primarily on peer review of project proposals. In 2010 alone, NSF convened more than 15,… ▽ More Public agencies like the U.S. National Science Foundation (NSF) and the National Institutes of Health (NIH) award tens of billions of dollars in annual science funding. How can this money be distributed as efficiently as possible to best promote scientific innovation and productivity? The present system relies primarily on peer review of project proposals. In 2010 alone, NSF convened more than 15,000 scientists to review 55,542 proposals. Although considered the scientific gold standard, peer review requires significant overhead costs, and may be subject to biases, inconsistencies, and oversights. We investigate a class of funding models in which all participants receive an equal portion of yearly funding, but are then required to anonymously donate a fraction of their funding to peers. The funding thus flows from one participant to the next, each acting as if he or she were a funding agency themselves. Here we show through a simulation conducted over large-scale citation data (37M articles, 770M citations) that such a distributed system for science may yield funding patterns similar to existing NIH and NSF distributions, but may do so at much lower overhead while exhibiting a range of other desirable features. Self-correcting mechanisms in scientific peer evaluation can yield an efficient and fair distribution of funding. The proposed model can be applied in many situations in which top-down or bottom-up allocation of public resources is either impractical or undesirable, e.g. public investments, distribution chains, and shared resource management. △ Less

Submitted 3 April, 2013; originally announced April 2013.

Comments: main paper: 7 pages, excl. references + supplemental materials (9), 4 figures

arXiv:1301.5177 [pdf]

"Seed+Expand": A validated methodology for creating high quality publication oeuvres of individual researchers

Authors: Linda Reijnhoudt, Rodrigo Costas, Ed Noyons, Katy Boerner, Andrea Scharnhorst

Abstract: The study of science at the individual micro-level frequently requires the disambiguation of author names. The creation of author's publication oeuvres involves matching the list of unique author names to names used in publication databases. Despite recent progress in the development of unique author identifiers, e.g., ORCID, VIVO, or DAI, author disambiguation remains a key problem when it comes… ▽ More The study of science at the individual micro-level frequently requires the disambiguation of author names. The creation of author's publication oeuvres involves matching the list of unique author names to names used in publication databases. Despite recent progress in the development of unique author identifiers, e.g., ORCID, VIVO, or DAI, author disambiguation remains a key problem when it comes to large-scale bibliometric analysis using data from multiple databases. This study introduces and validates a new methodology called seed+expand for semi-automatic bibliographic data collection for a given set of individual authors. Specifically, we identify the oeuvre of a set of Dutch full professors during the period 1980-2011. In particular, we combine author records from the National Research Information System (NARCIS) with publication records from the Web of Science. Starting with an initial list of 8,378 names, we identify "seed publications" for each author using five different approaches. Subsequently, we "expand" the set of publication in three different approaches. The different approaches are compared and resulting oeuvres are evaluated on precision and recall using a "gold standard" dataset of authors for which verified publications in the period 2001-2010 are available. △ Less

Submitted 22 April, 2013; v1 submitted 22 January, 2013; originally announced January 2013.

Comments: Paper accepted for the ISSI 2013, small changes in the text due to referee comments, one figure added (Fig 3)

ACM Class: H.3.3; H.3.7; J.1

arXiv:1210.1480 [pdf, other]

doi 10.1140/epjst/e2012-01692-1

Theoretical And Technological Building Blocks For An Innovation Accelerator

Authors: Frank van Harmelen, George Kampis, Katy Borner, Peter van den Besselaar, Erik Schultes, Carole Goble, Paul Groth, Barend Mons, Stuart Anderson, Stefan Decker, Conor Hayes, Thierry Buecheler, Dirk Helbing

Abstract: The scientific system that we use today was devised centuries ago and is inadequate for our current ICT-based society: the peer review system encourages conservatism, journal publications are monolithic and slow, data is often not available to other scientists, and the independent validation of results is limited. Building on the Innovation Accelerator paper by Helbing and Balietti (2011) this pap… ▽ More The scientific system that we use today was devised centuries ago and is inadequate for our current ICT-based society: the peer review system encourages conservatism, journal publications are monolithic and slow, data is often not available to other scientists, and the independent validation of results is limited. Building on the Innovation Accelerator paper by Helbing and Balietti (2011) this paper takes the initial global vision and reviews the theoretical and technological building blocks that can be used for implementing an innovation (in first place: science) accelerator platform driven by re-imagining the science system. The envisioned platform would rest on four pillars: (i) Redesign the incentive scheme to reduce behavior such as conservatism, herding and hy**; (ii) Advance scientific publications by breaking up the monolithic paper unit and introducing other building blocks such as data, tools, experiment workflows, resources; (iii) Use machine readable semantics for publications, debate structures, provenance etc. in order to include the computer as a partner in the scientific process, and (iv) Build an online platform for collaboration, including a network of trust and reputation among the different types of stakeholders in the scientific system: scientists, educators, funding agencies, policy makers, students and industrial innovators among others. Any such improvements to the scientific system must support the entire scientific process (unlike current tools that chop up the scientific process into disconnected pieces), must facilitate and encourage collaboration and interdisciplinarity (again unlike current tools), must facilitate the inclusion of intelligent computing in the scientific process, must facilitate not only the core scientific process, but also accommodate other stakeholders such science policy makers, industrial innovators, and the general public. △ Less

Submitted 4 October, 2012; originally announced October 2012.

arXiv:0903.3562 [pdf]

Visual Conceptualizations and Models of Science

Authors: Katy Boerner, Andrea Scharnhorst

Abstract: This Journal of Informetrics special issue aims to improve our understanding of the structure and dynamics of science by reviewing and advancing existing conceptualizations and models of scholarly activity. Several of these conceptualizations and models have visual manifestations supporting the combination and comparison of theories and approaches developed in different disciplines of science. S… ▽ More This Journal of Informetrics special issue aims to improve our understanding of the structure and dynamics of science by reviewing and advancing existing conceptualizations and models of scholarly activity. Several of these conceptualizations and models have visual manifestations supporting the combination and comparison of theories and approaches developed in different disciplines of science. Subsequently, we discuss challenges towards a theoretically grounded and practically useful science of science and provide a brief chronological review of relevant work. Then, we exemplarily present three conceptualizations of science that attempt to provide frameworks for the comparison and combination of existing approaches, theories, laws, and measurements. Finally, we discuss the contributions of and interlinkages among the eight papers included in this issue. Each paper makes a unique contribution towards conceptualizations and models of science and roots this contribution in a review and comparison with existing work. △ Less

Submitted 20 March, 2009; originally announced March 2009.

Comments: Guest Editor's Introduction to the 2009 Journal of Informetrics Special Issue on the Science of Science

arXiv:cs/0512085 [pdf]

Analyzing and Visualizing the Semantic Coverage of Wikipedia and Its Authors

Authors: Todd Holloway, Miran Bozicevic, Katy Börner

Abstract: This paper presents a novel analysis and visualization of English Wikipedia data. Our specific interest is the analysis of basic statistics, the identification of the semantic structure and age of the categories in this free online encyclopedia, and the content coverage of its highly productive authors. The paper starts with an introduction of Wikipedia and a review of related work. We then intr… ▽ More This paper presents a novel analysis and visualization of English Wikipedia data. Our specific interest is the analysis of basic statistics, the identification of the semantic structure and age of the categories in this free online encyclopedia, and the content coverage of its highly productive authors. The paper starts with an introduction of Wikipedia and a review of related work. We then introduce a suite of measures and approaches to analyze and map the semantic structure of Wikipedia. The results show that co-occurrences of categories within individual articles have a power-law distribution, and when mapped reveal the nicely clustered semantic structure of Wikipedia. The results also reveal the content coverage of the article's authors, although the roles these authors play are as varied as the authors themselves. We conclude with a discussion of major results and planned future work. △ Less

Submitted 21 December, 2005; originally announced December 2005.

arXiv:cs/0402029 [pdf]

doi 10.1073/pnas.0307626100

Map** Topics and Topic Bursts in PNAS

Authors: Ketan Mane, Katy Börner

Abstract: Scientific research is highly dynamic. New areas of science continually evolve;others gain or lose importance, merge or split. Due to the steady increase in the number of scientific publications it is hard to keep an overview of the structure and dynamic development of one's own field of science, much less all scientific domains. However, knowledge of hot topics, emergent research frontiers, or… ▽ More Scientific research is highly dynamic. New areas of science continually evolve;others gain or lose importance, merge or split. Due to the steady increase in the number of scientific publications it is hard to keep an overview of the structure and dynamic development of one's own field of science, much less all scientific domains. However, knowledge of hot topics, emergent research frontiers, or change of focus in certain areas is a critical component of resource allocation decisions in research labs, governmental institutions, and corporations. This paper demonstrates the utilization of Kleinberg's burst detection algorithm, co-word occurrence analysis, and graph layout techniques to generate maps that support the identification of major research topics and trends. The approach was applied to analyze and map the complete set of papers published in the Proceedings of the National Academy of Sciences (PNAS) in the years 1982-2001. Six domain experts examined and commented on the resulting maps in an attempt to reconstruct the evolution of major research areas covered by PNAS. △ Less

Submitted 13 February, 2004; originally announced February 2004.

ACM Class: H.3.3; H.1.2

Showing 1–22 of 22 results for author: Börner, K