Search | arXiv e-print repository

Introducing A Dark Web Archival Framework

Authors: Justin F. Brunelle, Ryan Farley, Grant Atkins, Trevor Bostic, Marites Hendrix, Zak Zebrowski

Abstract: We present a framework for web-scale archiving of the dark web. While commonly associated with illicit and illegal activity, the dark web provides a way to privately access web information. This is a valuable and socially beneficial tool to global citizens, such as those wishing to access information while under oppressive political regimes that work to limit information availability. However, lit… ▽ More We present a framework for web-scale archiving of the dark web. While commonly associated with illicit and illegal activity, the dark web provides a way to privately access web information. This is a valuable and socially beneficial tool to global citizens, such as those wishing to access information while under oppressive political regimes that work to limit information availability. However, little institutional archiving is performed on the dark web (limited to the Archive.is dark web presence, a page-at-a-time archiver). We use surface web tools, techniques, and procedures (TTPs) and adapt them for archiving the dark web. We demonstrate the viability of our framework in a proof-of-concept and narrowly scoped prototype, implemented with the following lightly adapted open source tools: the Brozzler crawler for capture, WARC file for storage, and pywb for replay. Using these tools, we demonstrate the viability of modified surface web archiving TTPs for archiving the dark web. △ Less

Submitted 8 July, 2021; originally announced July 2021.

arXiv:1908.02804 [pdf, ps, other]

Exploring the Intersections of Web Science and Accessibility

Authors: Trevor Bostic, Jeff Stanley, John Higgins, Rachael L. Bradley-Montgomery, Justin F. Brunelle, Daniel Chudnov

Abstract: The web is the prominent way information is exchanged in the 21st century. However, ensuring web-based information is accessible is complicated, particularly with web applications that rely on JavaScript and other technologies to deliver and build representations; representations are often the HTML, images, or other code a server delivers for a web resource. Static representations are becoming rar… ▽ More The web is the prominent way information is exchanged in the 21st century. However, ensuring web-based information is accessible is complicated, particularly with web applications that rely on JavaScript and other technologies to deliver and build representations; representations are often the HTML, images, or other code a server delivers for a web resource. Static representations are becoming rarer and assessing the accessibility of web-based information to ensure it is available to all users is increasingly difficult given the dynamic nature of representations. In this work, we survey three ongoing research threads that can inform web accessibility solutions: assessing web accessibility, modeling web user activity, and web application crawling. Current web accessibility research is continually focused on increasing the percentage of automatically testable standards, but still relies heavily upon manual testing for complex interactive applications. Along-side web accessibility research, there are mechanisms developed by researchers that replicate user interactions with web pages based on usage patterns. Crawling web applications is a broad research domain; exposing content in web applications is difficult because of incompatibilities in web crawlers and the technologies used to create the applications. We describe research on crawling the deep web by exercising user forms. We close with a thought exercise regarding the convergence of these three threads and the future of automated, web-based accessibility evaluation and assurance through a use case in web archiving. These research efforts provide insight into how users interact with websites, how to automate and simulate user interactions, how to record the results of user interactions, and how to analyze, evaluate, and map resulting website content to determine its relative accessibility. △ Less

Submitted 7 August, 2019; originally announced August 2019.

Comments: 10 pages, Latex

arXiv:1601.05142 [pdf, other]

Adapting the Hypercube Model to Archive Deferred Representations and Their Descendants

Authors: Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson

Abstract: The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are increasingly interactive, resulting in pages that are increasingly difficult to archive. Client-side technologies (e.g., JavaScript) enable interactions that can potentially change the client-side state of a representation. We refer to representations that… ▽ More The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are increasingly interactive, resulting in pages that are increasingly difficult to archive. Client-side technologies (e.g., JavaScript) enable interactions that can potentially change the client-side state of a representation. We refer to representations that load embedded resources via JavaScript as deferred representations. It is difficult to archive all of the resources in deferred representations and the result is archives with web pages that are either incomplete or that erroneously load embedded resources from the live web. We propose a method of discovering and crawling deferred representations and their descendants (representation states that are only reachable through client-side events). We adapt the Dincturk et al. Hypercube model to construct a model for archiving descendants, and we measure the number of descendants and requisite embedded resources discovered in a proof-of-concept crawl. Our approach identified an average of 38.5 descendants per seed URI crawled, 70.9% of which are reached through an onclick event. This approach also added 15.6 times more embedded resources than Heritrix to the crawl frontier, but at a rate that was 38.9 times slower than simply using Heritrix. We show that our dataset has two levels of descendants. We conclude with proposed crawl policies and an analysis of the storage requirements for archiving descendants. △ Less

Submitted 19 January, 2016; originally announced January 2016.

arXiv:1508.02315 [pdf, other]

Archiving Deferred Representations Using a Two-Tiered Crawling Approach

Authors: Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson

Abstract: Web resources are increasingly interactive, resulting in resources that are increasingly difficult to archive. The archival difficulty is based on the use of client-side technologies (e.g., JavaScript) to change the client-side state of a representation after it has initially loaded. We refer to these representations as deferred representations. We can better archive deferred representations using… ▽ More Web resources are increasingly interactive, resulting in resources that are increasingly difficult to archive. The archival difficulty is based on the use of client-side technologies (e.g., JavaScript) to change the client-side state of a representation after it has initially loaded. We refer to these representations as deferred representations. We can better archive deferred representations using tools like headless browsing clients. We use 10,000 seed Universal Resource Identifiers (URIs) to explore the impact of including PhantomJS -- a headless browsing tool -- into the crawling process by comparing the performance of wget (the baseline), PhantomJS, and Heritrix. Heritrix crawled 2.065 URIs per second, 12.15 times faster than PhantomJS and 2.4 times faster than wget. However, PhantomJS discovered 531,484 URIs, 1.75 times more than Heritrix and 4.11 times more than wget. To take advantage of the performance benefits of Heritrix and the URI discovery of PhantomJS, we recommend a tiered crawling strategy in which a classifier predicts whether a representation will be deferred or not, and only resources with deferred representations are crawled with PhantomJS while resources without deferred representations are crawled with Heritrix. We show that this approach is 5.2 times faster than using only PhantomJS and creates a frontier (set of URIs to be crawled) 1.8 times larger than using only Heritrix. △ Less

Submitted 10 August, 2015; originally announced August 2015.

Comments: To appear at iPRES2015 11 pages

ACM Class: H.3.7

arXiv:1307.8067 [pdf, other]

doi 10.1007/978-3-642-40501-3_5

On the Change in Archivability of Websites Over Time

Authors: Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson

Abstract: As web technologies evolve, web archivists work to keep up so that our digital history is preserved. Recent advances in web technologies have introduced client-side executed scripts that load data without a referential identifier or that require user interaction (e.g., content loading when the page has scrolled). These advances have made automating methods for capturing web pages more difficult. B… ▽ More As web technologies evolve, web archivists work to keep up so that our digital history is preserved. Recent advances in web technologies have introduced client-side executed scripts that load data without a referential identifier or that require user interaction (e.g., content loading when the page has scrolled). These advances have made automating methods for capturing web pages more difficult. Because of the evolving schemes of publishing web pages along with the progressive capability of web preservation tools, the archivability of pages on the web has varied over time. In this paper we show that the archivability of a web page can be deduced from the type of page being archived, which aligns with that page's accessibility in respect to dynamic content. We show concrete examples of when these technologies were introduced by referencing mementos of pages that have persisted through a long evolution of available technologies. Identifying these reasons for the inability of these web pages to be archived in the past in respect to accessibility serves as a guide for ensuring that content that has longevity is published using good practice methods that make it available for preservation. △ Less

Submitted 30 July, 2013; originally announced July 2013.

Comments: 12 pages, 8 figures, Theory and Practice of Digital Libraries (TPDL) 2013, Valletta, Malta

arXiv:1307.5685 [pdf, other]

An Evaluation of Caching Policies for Memento TimeMaps

Authors: Justin F. Brunelle, Michael L. Nelson

Abstract: As defined by the Memento Framework, TimeMaps are ma-chine-readable lists of time-specific copies -- called "mementos" -- of an archived original resource. In theory, as an archive acquires additional mementos over time, a TimeMap should be monotonically increasing. However, there are reasons why the number of mementos in a TimeMap would decrease, for example: archival redaction of some or all of… ▽ More As defined by the Memento Framework, TimeMaps are ma-chine-readable lists of time-specific copies -- called "mementos" -- of an archived original resource. In theory, as an archive acquires additional mementos over time, a TimeMap should be monotonically increasing. However, there are reasons why the number of mementos in a TimeMap would decrease, for example: archival redaction of some or all of the mementos, archival restructuring, and transient errors on the part of one or more archives. We study TimeMaps for 4,000 original resources over a three month period, note their change patterns, and develop a caching algorithm for TimeMaps suitable for a reverse proxy in front of a Memento aggregator. We show that TimeMap cardinality is constant or monotonically increasing for 80.2% of all TimeMap downloads observed in the observation period. The goal of the caching algorithm is to exploit the ideally monotonically increasing nature of TimeMaps and not cache responses with fewer mementos than the already cached TimeMap. This new caching algorithm uses conditional cache replacement and a Time To Live (TTL) value to ensure the user has access to the most complete TimeMap available. Based on our empirical data, a TTL of 15 days will minimize the number of mementos missed by users, and minimize the load on archives contributing to TimeMaps. △ Less

Submitted 22 July, 2013; originally announced July 2013.

Comments: JCDL2013

arXiv:1209.1811 [pdf, other]

Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool

Authors: Justin F. Brunelle, Michael L. Nelson

Abstract: Conventional Web archives are created by periodically crawling a web site and archiving the responses from the Web server. Although easy to implement and common deployed, this form of archiving typically misses updates and may not be suitable for all preservation scenarios, for example a site that is required (perhaps for records compliance) to keep a copy of all pages it has served. In contrast,… ▽ More Conventional Web archives are created by periodically crawling a web site and archiving the responses from the Web server. Although easy to implement and common deployed, this form of archiving typically misses updates and may not be suitable for all preservation scenarios, for example a site that is required (perhaps for records compliance) to keep a copy of all pages it has served. In contrast, transactional archives work in conjunction with a Web server to record all pages that have been served. Los Alamos National Laboratory has developed SiteSory, an open-source transactional archive written in Java solution that runs on Apache Web servers, provides a Memento compatible access interface, and WARC file export features. We used the ApacheBench utility on a pre-release version of to measure response time and content delivery time in different environments and on different machines. The performance tests were designed to determine the feasibility of SiteStory as a production-level solution for high fidelity automatic Web archiving. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources. △ Less

Submitted 5 October, 2012; v1 submitted 9 September, 2012; originally announced September 2012.

Comments: 13 pages, Technical Report

arXiv:1009.2208 [pdf]

Gamed-based iSTART Practice: From MiBoard to Self-Explanation Showdown

Authors: Justin F. Brunelle, G. Tanner Jackson, Kyle Dempsey, Chutima Boonthum, Irwin B. Levinstein, Danielle S. McNamara

Abstract: MiBoard (Multiplayer Interactive Board Game) is an online, turnbased board game that was developed to assess the integration of game characteristics (point rewards, game-like interaction, and peer feedback) and how that might affect student engagement and learning efficacy. This online board game was designed to fit within the Extended Practice module of iSTART (Interactive Strategy Training for A… ▽ More MiBoard (Multiplayer Interactive Board Game) is an online, turnbased board game that was developed to assess the integration of game characteristics (point rewards, game-like interaction, and peer feedback) and how that might affect student engagement and learning efficacy. This online board game was designed to fit within the Extended Practice module of iSTART (Interactive Strategy Training for Active Reading and Thinking). Unfortunately, preliminary research shows that MiBoard actually reduces engagement and does not benefit the quality of student self-explanations when compared to the original Extended Practice module. Consequently the MiBoard framework has been revamped to create Self-Explanation Showdown, a faster-paced, less analytically oriented game that adds competition to the creation of self-explanations. Students are evaluated on the quality of their self-explanations using the same assessment algorithms from iSTART Extended Practice module (this includes both word-based and LSA-based assessments). The technical issues involved in development of MiBoard and Self- Explanation Showdown are described. The lessons learned from the MiBoard experience are also discussed in this paper. △ Less