Skip to main content

Showing 1–50 of 104 results for author: Nelson, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.05933  [pdf, other

    cs.CR

    A Relevance Model for Threat-Centric Ranking of Cybersecurity Vulnerabilities

    Authors: Corren McCoy, Ross Gore, Michael L. Nelson, Michele C. Weigle

    Abstract: The relentless process of tracking and remediating vulnerabilities is a top concern for cybersecurity professionals. The key challenge is trying to identify a remediation scheme specific to in-house, organizational objectives. Without a strategy, the result is a patchwork of fixes applied to a tide of vulnerabilities, any one of which could be the point of failure in an otherwise formidable defens… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: 24 pages, 8 figures, 14 tables

    ACM Class: K.6.5

  2. arXiv:2406.02268  [pdf, other

    cs.LG

    Analyzing the Benefits of Prototypes for Semi-Supervised Category Learning

    Authors: Liyi Zhang, Logan Nelson, Thomas L. Griffiths

    Abstract: Categories can be represented at different levels of abstraction, from prototypes focused on the most typical members to remembering all observed exemplars of the category. These representations have been explored in the context of supervised learning, where stimuli are presented with known category labels. We examine the benefits of prototype-based representations in a less-studied domain: semi-s… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 7 pages, 3 figures

    ACM Class: I.2; I.5

  3. arXiv:2401.04887  [pdf, other

    cs.DL

    Cited But Not Archived: Analyzing the Status of Code References in Scholarly Articles

    Authors: Emily Escamilla, Martin Klein, Talya Cooper, Vicky Rampin, Michele C. Weigle, Michael L. Nelson

    Abstract: One in five arXiv articles published in 2021 contained a URI to a Git Hosting Platform (GHP), which demonstrates the growing prevalence of GHP URIs in scholarly publications. However, GHP URIs are vulnerable to the same reference rot that plagues the Web at large. The disappearance of software hosting platforms, like Gitorious and Google Code, and the source code they contain threatens research re… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

  4. arXiv:2312.03330  [pdf, other

    cs.CL cs.CY cs.LG

    Measuring Misogyny in Natural Language Generation: Preliminary Results from a Case Study on two Reddit Communities

    Authors: Aaron J. Snoswell, Lucinda Nelson, Hao Xue, Flora D. Salim, Nicolas Suzor, Jean Burgess

    Abstract: Generic `toxicity' classifiers continue to be used for evaluating the potential for harm in natural language generation, despite mounting evidence of their shortcomings. We consider the challenge of measuring misogyny in natural language generation, and argue that generic `toxicity' classifiers are inadequate for this task. We use data from two well-characterised `Incel' communities on Reddit that… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: This extended abstract was presented at the Generation, Evaluation and Metrics workshop at Empirical Methods in Natural Language Processing in 2023 (GEM@EMNLP 2023) in Singapore

  5. Automatic Digitization and Orientation of Scanned Mesh Data for Floor Plan and 3D Model Generation

    Authors: Ritesh Sharma, Eric Bier, Lester Nelson, Mahabir Bhandari, Niraj Kunwar

    Abstract: This paper describes a novel approach for generating accurate floor plans and 3D models of building interiors using scanned mesh data. Unlike previous methods, which begin with a high resolution point cloud from a laser range-finder, our approach begins with triangle mesh data, as from a Microsoft HoloLens. It generates two types of floor plans, a "pen-and-ink" style that preserves details and a d… ▽ More

    Submitted 14 November, 2023; originally announced November 2023.

    Comments: Computer Graphics International 2023

  6. arXiv:2309.10919  [pdf, other

    cs.RO

    Empirical Study of Ground Proximity Effects for Small-scale Electroaerodynamic Thrusters

    Authors: Grant Nations, C. Luke Nelson, Daniel S. Drew

    Abstract: Electroaerodynamic (EAD) propulsion, where thrust is produced by collisions between electrostatically-accelerated ions and neutral air, is a potentially transformative method for indoor flight owing to its silent and solid-state nature. Like rotors, EAD thrusters exhibit changes in performance based on proximity to surfaces. Unlike rotors, they have no fragile and quickly spinning parts that have… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  7. arXiv:2308.09840  [pdf, other

    cs.RO

    High Aspect Ratio Multi-stage Ducted Electroaerodynamic Thrusters for Micro Air Vehicle Propulsion

    Authors: C. Luke Nelson, Daniel S. Drew

    Abstract: Electroaerodynamic propulsion, where force is produced through collisions between electrostatically accelerated ions and neutral air molecules, is an attractive alternative to propeller- and flap** wing-based methods for micro air vehicle (MAV) flight due to its silent and solid-state nature. One major barrier to adoption is its limited thrust efficiency at useful disk loading levels. Ducted act… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

  8. arXiv:2307.14469  [pdf, other

    cs.DL

    It's Not Just GitHub: Identifying Data and Software Sources Included in Publications

    Authors: Emily Escamilla, Lamia Salsabil, Martin Klein, Jian Wu, Michele C. Weigle, Michael L. Nelson

    Abstract: Paper publications are no longer the only form of research product. Due to recent initiatives by publication venues and funding institutions, open access datasets and software products are increasingly considered research products and URIs to these products are growing more prevalent in scholarly publications. However, as with all URIs, resources found on the live Web are not permanent. Archivists… ▽ More

    Submitted 26 July, 2023; originally announced July 2023.

    Comments: 13 pages, 7 figures, pre-print of publication for Theory and Practice of Digital Libraries 2023

  9. arXiv:2306.08236  [pdf, other

    cs.IR

    Extracting Information from Twitter Screenshots

    Authors: Tarannum Zaki, Michael L. Nelson, Michele C. Weigle

    Abstract: Screenshots are prevalent on social media as a common approach for information sharing. Users rarely verify before sharing a screenshot whether the post it contains is fake or real. Information sharing through fake screenshots can be highly responsible for misinformation and disinformation spread on social media. Our ultimate goal is to develop a tool that could take a screenshot of a tweet and pr… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

  10. arXiv:2305.01071  [pdf, other

    cs.DL

    Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering

    Authors: Michele C. Weigle, Michael L. Nelson, Sawood Alam, Mark Graham

    Abstract: Many web sites are transitioning how they construct their pages. The conventional model is where the content is embedded server-side in the HTML and returned to the client in an HTTP response. Increasingly, sites are moving to a model where the initial HTTP response contains only an HTML skeleton plus JavaScript that makes API calls to a variety of servers for the content (typically in JSON format… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

    Comments: 20 pages, preprint version of paper accepted at the 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

  11. arXiv:2305.00546  [pdf, other

    cs.IR cs.DL

    Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives

    Authors: Lesley Frew, Michael L. Nelson, Michele C. Weigle

    Abstract: Webpages change over time, and web archives hold copies of historical versions of webpages. Users of web archives, such as journalists, want to find and view changes on webpages over time. However, the current search interfaces for web archives do not support this task. For the web archives that include a full-text search feature, multiple versions of the same webpage that match the search query a… ▽ More

    Submitted 30 April, 2023; originally announced May 2023.

    Comments: In Proceedings of JCDL 2023; 20 pages, 11 figures, 2 tables

    ACM Class: H.3.3; H.3.7

  12. arXiv:2212.05322  [pdf

    cs.SI cs.CR

    Twitter DM Videos Are Accessible to Unauthenticated Users

    Authors: Michael L. Nelson

    Abstract: Videos shared in Twitter Direct Messages (DMs) have opaque URLs based on hashes of their content, but are otherwise available to unauthenticated HTTP users. These DM video URLs are thus hard to guess, but if they were somehow discovered, they are available to any user, including users without Twitter credentials (i.e., twitter.com specific HTTP Cookie or Authorization request headers). This includ… ▽ More

    Submitted 22 December, 2022; v1 submitted 10 December, 2022; originally announced December 2022.

    Comments: 22 pages, 7 figures, v2 adds "available this way since 2016" and "http/https" discussion

    ACM Class: H.3.5

  13. arXiv:2212.00760  [pdf, other

    cs.NI cs.DL

    Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests

    Authors: Kritika Garg, Himarsha R. Jayanetti, Sawood Alam, Michele C. Weigle, Michael L. Nelson

    Abstract: Upon replay, JavaScript on archived web pages can generate recurring HTTP requests that lead to unnecessary traffic to the web archive. In one example, an archived page averaged more than 1000 requests per minute. These requests are not visible to the user, so if a user leaves such an archived page open in a browser tab, they would be unaware that their browser is continuing to generate traffic to… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

  14. arXiv:2211.09681  [pdf, other

    cs.IR

    Did They Really Tweet That? Querying Fact-Checking Sites and Politwoops to Determine Tweet Misattribution

    Authors: Caleb Bradford, Michael L. Nelson

    Abstract: Screenshots of social media posts have become common place on social media sites. While screenshots definitely serve a purpose, their ubiquity enables the spread of fabricated screenshots of posts that were never actually made, thereby proliferating misattribution disinformation. With the motivation of detecting this type of disinformation, we researched develo** methods of querying the Web for… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

    Comments: 20 pages

  15. arXiv:2211.02188  [pdf, other

    cs.DL

    Web Archiving as Entertainment

    Authors: Travis Reid, Michael L. Nelson, Michele C. Weigle

    Abstract: We want to make web archiving entertaining so that it can be enjoyed like a spectator sport. To this end, we have been working on a proof of concept that involves gamification of the web archiving process and integrating video games and web archiving. Our vision for this proof of concept involves a web archiving live stream and a gaming live stream. We are creating web archiving live streams that… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: This is an extended version of a paper from ICADL 2022. 20 pages and 10 figures

  16. arXiv:2209.08649  [pdf, other

    cs.DL

    Creating Structure in Web Archives With Collections: Different Concepts From Web Archivists

    Authors: Himarsha R. Jayanetti, Shawn M. Jones, Martin Klein, Alex Osbourne, Paul Koerbin, Michael L. Nelson, Michele C. Weigle

    Abstract: As web archives' holdings grow, archivists subdivide them into collections so they are easier to understand and manage. In this work, we review the collection structures of eight web archive platforms: : Archive-It, Conifer, the Croatian Web Archive (HAW), the Internet Archive's user account web archives, Library of Congress (LC), PANDORA, Trove, and the UK Web Archive (UKWA). We note a plethora o… ▽ More

    Submitted 18 September, 2022; originally announced September 2022.

    Comments: 5 figures, 16 pages, accepted for publication at TPDL 2022

  17. Robots Still Outnumber Humans in Web Archives, But Less Than Before

    Authors: Himarsha R. Jayanetti, Kritika Garg, Sawood Alam, Michael L. Nelson, Michele C. Weigle

    Abstract: To identify robots and humans and analyze their respective access patterns, we used the Internet Archive's (IA) Wayback Machine access logs from 2012 and 2019, as well as Arquivo.pt's (Portuguese Web Archive) access logs from 2019. We identified user sessions in the access logs and classified those sessions as human or robot based on their browsing behavior. To better understand how users navigate… ▽ More

    Submitted 26 August, 2022; originally announced August 2022.

  18. arXiv:2208.04895  [pdf, other

    cs.DL

    The Rise of GitHub in Scholarly Publications

    Authors: Emily Escamilla, Martin Klein, Talya Cooper, Vicky Rampin, Michele C. Weigle, Michael L. Nelson

    Abstract: The definition of scholarly content has expanded to include the data and source code that contribute to a publication. While major archiving efforts to preserve conventional scholarly content, typically in PDFs (e.g., LOCKSS, CLOCKSS, Portico), are underway, no analogous effort has yet emerged to preserve the data and code referenced in those PDFs, particularly the scholarly code hosted online on… ▽ More

    Submitted 9 August, 2022; originally announced August 2022.

    Comments: 4 figures, 15 pages, accepted for publication at TPDL 2022

  19. arXiv:2108.12092  [pdf, other

    cs.DL

    Replaying Archived Twitter: When your bird is broken, will it bring you down?

    Authors: Kritika Garg, Himarsha R. Jayanetti, Sawood Alam, Michele C. Weigle, Michael L. Nelson

    Abstract: Historians and researchers trust web archives to preserve social media content that no longer exists on the live web. However, what we see on the live web and how it is replayed in the archive are not always the same. In this paper, we document and analyze the problems in archiving Twitter ever since Twitter forced the use of its new UI in June 2020. Most web archives were unable to archive the ne… ▽ More

    Submitted 26 August, 2021; originally announced August 2021.

  20. arXiv:2108.05939  [pdf, other

    cs.DL

    Where Did the Web Archive Go?

    Authors: Mohamed Aturban, Michael L. Nelson, Michele C. Weigle

    Abstract: To perform a longitudinal investigation of web archives and detecting variations and changes replaying individual archived pages, or mementos, we created a sample of 16,627 mementos from 17 public web archives. Over the course of our 14-month study (November, 2017 - January, 2019), we found that four web archives changed their base URIs and did not leave a machine-readable method of locating their… ▽ More

    Submitted 12 August, 2021; originally announced August 2021.

    Comments: 18 pages

  21. arXiv:2108.03311  [pdf, other

    cs.DL cs.IR

    Profiling Web Archival Voids for Memento Routing

    Authors: Sawood Alam, Michele C. Weigle, Michael L. Nelson

    Abstract: Prior work on web archive profiling were focused on Archival Holdings to describe what is present in an archive. This work defines and explores Archival Voids to establish a means to represent portions of URI spaces that are not present in a web archive. Archival Holdings and Archival Voids profiles can work independently or as complements to each other to maximize the Accuracy of Memento Aggregat… ▽ More

    Submitted 6 August, 2021; originally announced August 2021.

    Comments: Accepted in JCDL 2021 (10 pages, 7 figures, 7 tables)

  22. arXiv:2107.02680  [pdf, other

    cs.DL

    Garbage, Glitter, or Gold: Assigning Multi-dimensional Quality Scores to Social Media Seeds for Web Archive Collections

    Authors: Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

    Abstract: From popular uprisings to pandemics, the Web is an essential source consulted by scientists and historians for reconstructing and studying past events. Unfortunately, the Web is plagued by reference rot which causes important Web resources to disappear. Web archive collections help reduce the costly effects of reference rot by saving Web resources that chronicle important stories/events before the… ▽ More

    Submitted 6 July, 2021; originally announced July 2021.

    Comments: This is an extended version of the ACM/IEEE Joint Conference on Digital Libraries (JCDL2021) paper

  23. arXiv:2104.14041  [pdf, other

    cs.DL

    What Did It Look Like: A service for creating website timelapses using the Memento framework

    Authors: Dhruv Patel, Alexander C. Nwala, Michael L. Nelson, Michele C. Weigle

    Abstract: Popular web pages are archived frequently, which makes it difficult to visualize the progression of the site through the years at web archives. The What Did It Look Like (WDILL) Twitter bot shows web page transitions by creating a timelapse of a given website using one archived copy from each calendar year. Originally implemented in 2015, we recently added new features to WDILL, such as date range… ▽ More

    Submitted 28 April, 2021; originally announced April 2021.

    Comments: 11 pages

  24. It's All About The Cards: Sharing on Social Media Probably Encouraged HTML Metadata Growth

    Authors: Shawn M. Jones, Valentina Neblitt-Jones, Michele C. Weigle, Martin Klein, Michael L. Nelson

    Abstract: In a perfect world, all articles consistently contain sufficient metadata to describe the resource. We know this is not the reality, so we are motivated to investigate the evolution of the metadata that is present when authors and publishers supply their own. Because applying metadata takes time, we recognize that each news article author has a limited metadata budget with which to spend their tim… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: 10 pages, 10 figures, 3 tables

  25. Automatically Selecting Striking Images for Social Cards

    Authors: Shawn M. Jones, Michele C. Weigle, Martin Klein, Michael L. Nelson

    Abstract: To allow previewing a web page, social media platforms have developed social cards: visualizations consisting of vital information about the underlying resource. At a minimum, social cards often include features such as the web resource's title, text summary, striking image, and domain name. News and scholarly articles on the web are frequently subject to social card creation when being shared on… ▽ More

    Submitted 8 March, 2021; originally announced March 2021.

    Comments: 10 pages, 5 figures, 10 tables

  26. Mixture Model Framework for Traumatic Brain Injury Prognosis Using Heterogeneous Clinical and Outcome Data

    Authors: Alan D. Kaplan, Qi Cheng, K. Aditya Mohan, Lindsay D. Nelson, Sonia Jain, Harvey Levin, Abel Torres-Espin, Austin Chou, J. Russell Huie, Adam R. Ferguson, Michael McCrea, Joseph Giacino, Shivshankar Sundaram, Amy J. Markowitz, Geoffrey T. Manley

    Abstract: Prognoses of Traumatic Brain Injury (TBI) outcomes are neither easily nor accurately determined from clinical indicators. This is due in part to the heterogeneity of damage inflicted to the brain, ultimately resulting in diverse and complex outcomes. Using a data-driven approach on many distinct data elements may be necessary to describe this large set of outcomes and thereby robustly depict the n… ▽ More

    Submitted 20 July, 2021; v1 submitted 22 December, 2020; originally announced December 2020.

    Comments: 12 pages, 5 figures

  27. Modeling Updates of Scholarly Webpages Using Archived Data

    Authors: Yasith Jayawardana, Alexander C. Nwala, Gavindya Jayawardena, Jian Wu, Sampath Jayarathna, Michael L. Nelson, C. Lee Giles

    Abstract: The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the sch… ▽ More

    Submitted 6 December, 2020; originally announced December 2020.

    Comments: 12 pages, 2 appendix pages, 18 figures, to be published in Proceedings of IEEE Big Data 2020 - 5th Computational Archival Science (CAS) Workshop

  28. arXiv:2008.11680  [pdf, other

    cs.DL

    A 25 Year Retrospective on D-Lib Magazine

    Authors: Michael L. Nelson, Herbert Van de Sompel

    Abstract: In July, 1995 the first issue of D-Lib Magazine was published as an on-line, HTML-only, open access magazine, serving as the focal point for the then emerging digital library research community. In 2017 it ceased publication, in part due to the maturity of the community it served as well as the increasing availability of and competition from eprints, institutional repositories, conferences, social… ▽ More

    Submitted 27 August, 2020; v1 submitted 26 August, 2020; originally announced August 2020.

    Comments: 44 pages, 29 figures. Minor fixes

    ACM Class: H.3.7

  29. arXiv:2008.00139  [pdf, other

    cs.DL cs.HC cs.IR

    SHARI -- An Integration of Tools to Visualize the Story of the Day

    Authors: Shawn M. Jones, Alexander C. Nwala, Martin Klein, Michele C. Weigle, Michael L. Nelson

    Abstract: Tools such as Google News and Flipboard exist to convey daily news, but what about the past? In this paper, we describe how to combine several existing tools with web archive holdings to perform news analysis and visualization of the "biggest story" for a given date. StoryGraph clusters news articles together to identify a common news story. Hypercane leverages ArchiveNow to store URLs produced by… ▽ More

    Submitted 31 July, 2020; originally announced August 2020.

    Comments: 19 pages, 16 figures, 1 Table

    ACM Class: H.3.7; H.3.6; H.3.4

    Journal ref: Presented at the Web Archiving and Digital Libraries 2020 Workshop

  30. arXiv:2008.00137  [pdf, other

    cs.DL cs.HC cs.IR

    MementoEmbed and Raintale for Web Archive Storytelling

    Authors: Shawn M. Jones, Martin Klein, Michele C. Weigle, Michael L. Nelson

    Abstract: For traditional library collections, archivists can select a representative sample from a collection and display it in a featured physical or digital library space. Web archive collections may consist of thousands of archived pages, or mementos. How should an archivist display this sample to drive visitors to their collection? Search engines and social media platforms often represent web pages as… ▽ More

    Submitted 31 July, 2020; originally announced August 2020.

    Comments: 54 pages, 5 tables, 46 figures

    ACM Class: H.3.7; H.3.6; H.3.4

    Journal ref: Presented at the Web Archiving and Digital Libraries 2020 Workshop

  31. arXiv:2006.02487  [pdf, other

    cs.DL

    Visualizing Webpage Changes Over Time

    Authors: Abigail Mabe, Dhruv Patel, Maheedhar Gunnam, Surbhi Shankar, Mat Kelly, Sawood Alam, Michael L. Nelson, Michele C. Weigle

    Abstract: We report on the development of TMVis, a web service to provide visualizations of how individual webpages have changed over time. We leverage past research on summarizing collections of webpages with thumbnail-sized screenshots and on choosing a small number of representative past archived webpages from a large collection. We offer four visualizations: image grid, image slider, timeline, and anima… ▽ More

    Submitted 3 June, 2020; originally announced June 2020.

    Comments: 13 pages

  32. arXiv:2003.09989  [pdf, other

    cs.IR cs.CL cs.SI

    365 Dots in 2019: Quantifying Attention of News Sources

    Authors: Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

    Abstract: We investigate the overlap of topics of online news articles from a variety of sources. To do this, we provide a platform for studying the news by measuring this overlap and scoring news stories according to the degree of attention in near-real time. This can enable multiple studies, including identifying topics that receive the most attention from news organizations and identifying slow news days… ▽ More

    Submitted 22 March, 2020; originally announced March 2020.

    Comments: This is an extended version of the paper accepted at Computation + Journalism Symposium 2020, which has been postponed because of COVID-19

  33. arXiv:1908.02819  [pdf, other

    cs.DL cs.IR

    Making Recommendations from Web Archives for "Lost" Web Pages

    Authors: Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle

    Abstract: When a user requests a web page from a web archive, the user will typically either get an HTTP 200 if the page is available, or an HTTP 404 if the web page has not been archived. This is because web archives are typically accessed by URI lookup, and the response is binary: the archive either has the page or it does not, and the user will not know of other archived web pages that exist and are pote… ▽ More

    Submitted 7 August, 2019; originally announced August 2019.

    Comments: 12 pages

  34. arXiv:1906.07141  [pdf, other

    cs.DL

    Impact of HTTP Cookie Violations in Web Archives

    Authors: Sawood Alam, Plinio Vargas, Michele C. Weigle, Michael L. Nelson

    Abstract: Certain HTTP Cookies on certain sites can be a source of content bias in archival crawls. Accommodating Cookies at crawl time, but not utilizing them at replay time may cause cookie violations, resulting in defaced composite mementos that never existed on the live web. To address these issues, we propose that crawlers store Cookies with short expiration time and archival replay systems account for… ▽ More

    Submitted 17 June, 2019; originally announced June 2019.

    Comments: Presented at WADL 2019 (http://fox.cs.vt.edu/wadl2019.html). Slides: https://www.slideshare.net/ibnesayeed/impact-of-http-cookie-violations-in-web-archives

  35. arXiv:1906.07104  [pdf, other

    cs.NI cs.CR cs.CY cs.DL

    Supporting Web Archiving via Web Packaging

    Authors: Sawood Alam, Michele C. Weigle, Michael L. Nelson, Martin Klein, Herbert Van de Sompel

    Abstract: We describe challenges related to web archiving, replaying archived web resources, and verifying their authenticity. We show that Web Packaging has significant potential to help address these challenges and identify areas in which changes are needed in order to fully realize that potential.

    Submitted 17 June, 2019; originally announced June 2019.

    Comments: This is a position paper accepted at the ESCAPE Workshop 2019. https://www.iab.org/activities/workshops/escape-workshop/

  36. arXiv:1905.12607  [pdf, other

    cs.DL

    MementoMap Framework for Flexible and Adaptive Web Archive Profiling

    Authors: Sawood Alam, Michele C. Weigle, Michael L. Nelson, Fernando Melo, Daniel Bicho, Daniel Gomes

    Abstract: In this work we propose MementoMap, a flexible and adaptive framework to efficiently summarize holdings of a web archive. We described a simple, yet extensible, file format suitable for MementoMap. We used the complete index of the Arquivo.pt comprising 5B mementos (archived web pages/files) to understand the nature and shape of its holdings. We generated MementoMaps with varying amount of detail… ▽ More

    Submitted 29 May, 2019; originally announced May 2019.

    Comments: In Proceedings of JCDL 2019; 13 pages, 9 tables, 13 figures, 3 code samples, and 1 equation

  37. arXiv:1905.12565  [pdf, other

    cs.DL

    Archive Assisted Archival Fixity Verification Framework

    Authors: Mohamed Aturban, Sawood Alam, Michael L. Nelson, Michele C. Weigle

    Abstract: The number of public and private web archives has increased, and we implicitly trust content delivered by these archives. Fixity is checked to ensure an archived resource has remained unaltered since the time it was captured. Some web archives do not allow users to access fixity information and, more importantly, even if fixity information is available, it is provided by the same archive from whic… ▽ More

    Submitted 29 May, 2019; originally announced May 2019.

    Comments: 16 pages

  38. arXiv:1905.12220  [pdf, other

    cs.DL cs.IR

    Using Micro-collections in Social Media to Generate Seeds for Web Archive Collections

    Authors: Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

    Abstract: In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events ranging from elections to disease outbreaks. These archived collections start with seed URIs (Uniform Resource Identifiers) hand-selected by curators. Curators produce high quality seeds by removing non-relevant URIs and adding URIs from cre… ▽ More

    Submitted 29 May, 2019; originally announced May 2019.

    Comments: This is an extended version of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2019) full paper. Some figures have been enlarged, and appendices of additional figures included

  39. arXiv:1905.11342  [pdf, other

    cs.DL cs.HC cs.SI

    Social Cards Probably Provide For Better Understanding Of Web Archive Collections

    Authors: Shawn M. Jones, Michele C. Weigle, Michael L. Nelson

    Abstract: Used by a variety of researchers, web archive collections have become invaluable sources of evidence. If a researcher is presented with a web archive collection that they did not create, how do they know what is inside so that they can use it for their own research? Search engine results and social media links are represented as surrogates, small easily digestible summaries of the underlying page.… ▽ More

    Submitted 29 May, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

    Comments: 58 pages, 53 figures

    ACM Class: H.3.7; H.3.6; H.3.5; H.5.2

  40. arXiv:1905.03836  [pdf, other

    cs.DL

    Collecting 16K archived web pages from 17 public web archives

    Authors: Mohamed Aturban, Michael L. Nelson, Michele C. Weigle, Martin Klein, Herbert Van de Sompel

    Abstract: We document the creation of a data set of 16,627 archived web pages, or mementos, of 3,698 unique live web URIs (Uniform Resource Identifiers) from 17 public web archives. We used four different methods to collect the dataset. First, we used the Los Alamos National Laboratory (LANL) Memento Aggregator to collect mementos of an initial set of URIs obtained from four sources: (a) the Moz Top 500, (b… ▽ More

    Submitted 9 May, 2019; originally announced May 2019.

    Comments: 21 pages

  41. arXiv:1806.09082  [pdf, other

    cs.DL cs.IR cs.SI

    Measuring News Similarity Across Ten U.S. News Sites

    Authors: Grant C. Atkins, Alexander Nwala, Michele C. Weigle, Michael L. Nelson

    Abstract: News websites make editorial decisions about what stories to include on their website homepages and what stories to emphasize (e.g., large font size for main story). The emphasized stories on a news website are often highly similar to many other news websites (e.g, a terrorist event story). The selective emphasis of a top news story and the similarity of news across different news organizations ar… ▽ More

    Submitted 1 July, 2018; v1 submitted 24 June, 2018; originally announced June 2018.

    Comments: This is an extended version of the paper to appear in the proceedings of the 15th International Conference on Digital Preservation (iPres 2018)

  42. The Many Shapes of Archive-It

    Authors: Shawn M. Jones, Alexander Nwala, Michele C. Weigle, Michael L. Nelson

    Abstract: Web archives, a key area of digital preservation, meet the needs of journalists, social scientists, historians, and government organizations. The use cases for these groups often require that they guide the archiving process themselves, selecting their own original resources, or seeds, and creating their own web archive collections. We focus on the collections within Archive-It, a subscription ser… ▽ More

    Submitted 18 June, 2018; originally announced June 2018.

    Comments: 10 pages, 12 figures, to appear in the proceedings of the 15th International Conference on Digital Preservation (iPres 2018)

    ACM Class: H.3.7; H.3.1

  43. The Off-Topic Memento Toolkit

    Authors: Shawn M. Jones, Michele C. Weigle, Michael L. Nelson

    Abstract: Web archive collections are created with a particular purpose in mind. A curator selects seeds, or original resources, which are then captured by an archiving system and stored as archived web pages, or mementos. The systems that build web archive collections are often configured to revisit the same original resource multiple times. This is incredibly useful for understanding an unfolding news sto… ▽ More

    Submitted 17 September, 2018; v1 submitted 18 June, 2018; originally announced June 2018.

    Comments: 10 pages, 14 figures, to appear in the proceedings of the 15th International Conference on Digital Preservation (iPres 2018)

    ACM Class: H.3.7; H.3.6; H.3.4

  44. A Framework for Aggregating Private and Public Web Archives

    Authors: Mat Kelly, Michael L. Nelson, Michele C. Weigle

    Abstract: Personal and private Web archives are proliferating due to the increase in the tools to create them and the realization that Internet Archive and other public Web archives are unable to capture personalized (e.g., Facebook) and private (e.g., banking) Web pages. We introduce a framework to mitigate issues of aggregation in private, personal, and public Web archives without compromising potential s… ▽ More

    Submitted 3 June, 2018; originally announced June 2018.

    Comments: Preprint version of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2018) full paper, accessible at the DOI

  45. Scra** SERPs for Archival Seeds: It Matters When You Start

    Authors: Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

    Abstract: Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper, we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Goog… ▽ More

    Submitted 25 May, 2018; originally announced May 2018.

    Comments: This is an extended version of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2018) full paper: https://doi.org/10.1145/3197026.3197056. Some of the figure numbers have changed

  46. arXiv:1712.03140  [pdf, other

    cs.DL

    Difficulties of Timestam** Archived Web Pages

    Authors: Mohamed Aturban, Michael L. Nelson, Michele C. Weigle

    Abstract: We show that state-of-the-art services for creating trusted timestamps in blockchain-based networks do not adequately allow for timestam** of web pages. They accept data by value (e.g., images and text), but not by reference (e.g., URIs of web pages). Also, we discuss difficulties in repeatedly generating the same cryptographic hash value of an archived web page. We then introduce several requir… ▽ More

    Submitted 8 December, 2017; originally announced December 2017.

    Comments: 27 pages

  47. arXiv:1708.05790  [pdf, other

    cs.DL cs.SI

    University Twitter Engagement: Using Twitter Followers to Rank Universities

    Authors: Corren G. McCoy, Michael L. Nelson, Michele C. Weigle

    Abstract: We examine and rank a set of 264 U.S. universities extracted from the National Collegiate Athletic Association (NCAA) Division I membership and global lists published in U.S. News, Times Higher Education, Academic Ranking of World Universities, and Money Magazine. Our University Twitter Engagement (UTE) rank is based on the friend and extended follower network of primary and affiliated secondary T… ▽ More

    Submitted 18 August, 2017; originally announced August 2017.

    Comments: 14 pages, 4 figures

  48. arXiv:1705.06218  [pdf, other

    cs.DL

    Stories From the Past Web

    Authors: Yasmin AlNoamany, Michele C. Weigle, Michael L. Nelson

    Abstract: Archiving Web pages into themed collections is a method for ensuring these resources are available for posterity. Services such as Archive-It exists to allow institutions to develop, curate, and preserve collections of Web resources. Understanding the contents and boundaries of these archived collections is a challenge for most people, resulting in the paradox of the larger the collection, the har… ▽ More

    Submitted 17 May, 2017; originally announced May 2017.

  49. Impact of URI Canonicalization on Memento Count

    Authors: Mat Kelly, Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle, Herbert Van de Sompel

    Abstract: Quantifying the captures of a URI over time is useful for researchers to identify the extent to which a Web page has been archived. Memento TimeMaps provide a format to list mementos (URI-Ms) for captures along with brief metadata, like Memento-Datetime, for each URI-M. However, when some URI-Ms are dereferenced, they simply provide a redirect to a different URI-M (instead of a unique representati… ▽ More

    Submitted 9 March, 2017; originally announced March 2017.

    Comments: 43 pages, 8 figures

  50. arXiv:1605.06154  [pdf, other

    cs.DL

    Web Infrastructure to Support e-Journal Preservation (and More)

    Authors: Herbert Van de Sompel, David S. H. Rosenthal, Michael L. Nelson

    Abstract: E-journal preservation systems have to ingest millions of articles each year. Ingest, especially of the "long tail" of journals from small publishers, is the largest element of their cost. Cost is the major reason that archives contain less than half the content they should. Automation is essential to minimize these costs. This paper examines the potential for automation beyond the status quo base… ▽ More

    Submitted 19 May, 2016; originally announced May 2016.

    Comments: 23 pages, 5 figures

    ACM Class: H.3.7