-
Introducing A Dark Web Archival Framework
Authors:
Justin F. Brunelle,
Ryan Farley,
Grant Atkins,
Trevor Bostic,
Marites Hendrix,
Zak Zebrowski
Abstract:
We present a framework for web-scale archiving of the dark web. While commonly associated with illicit and illegal activity, the dark web provides a way to privately access web information. This is a valuable and socially beneficial tool to global citizens, such as those wishing to access information while under oppressive political regimes that work to limit information availability. However, lit…
▽ More
We present a framework for web-scale archiving of the dark web. While commonly associated with illicit and illegal activity, the dark web provides a way to privately access web information. This is a valuable and socially beneficial tool to global citizens, such as those wishing to access information while under oppressive political regimes that work to limit information availability. However, little institutional archiving is performed on the dark web (limited to the Archive.is dark web presence, a page-at-a-time archiver). We use surface web tools, techniques, and procedures (TTPs) and adapt them for archiving the dark web. We demonstrate the viability of our framework in a proof-of-concept and narrowly scoped prototype, implemented with the following lightly adapted open source tools: the Brozzler crawler for capture, WARC file for storage, and pywb for replay. Using these tools, we demonstrate the viability of modified surface web archiving TTPs for archiving the dark web.
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
Exploring the Intersections of Web Science and Accessibility
Authors:
Trevor Bostic,
Jeff Stanley,
John Higgins,
Rachael L. Bradley-Montgomery,
Justin F. Brunelle,
Daniel Chudnov
Abstract:
The web is the prominent way information is exchanged in the 21st century. However, ensuring web-based information is accessible is complicated, particularly with web applications that rely on JavaScript and other technologies to deliver and build representations; representations are often the HTML, images, or other code a server delivers for a web resource. Static representations are becoming rar…
▽ More
The web is the prominent way information is exchanged in the 21st century. However, ensuring web-based information is accessible is complicated, particularly with web applications that rely on JavaScript and other technologies to deliver and build representations; representations are often the HTML, images, or other code a server delivers for a web resource. Static representations are becoming rarer and assessing the accessibility of web-based information to ensure it is available to all users is increasingly difficult given the dynamic nature of representations.
In this work, we survey three ongoing research threads that can inform web accessibility solutions: assessing web accessibility, modeling web user activity, and web application crawling. Current web accessibility research is continually focused on increasing the percentage of automatically testable standards, but still relies heavily upon manual testing for complex interactive applications. Along-side web accessibility research, there are mechanisms developed by researchers that replicate user interactions with web pages based on usage patterns. Crawling web applications is a broad research domain; exposing content in web applications is difficult because of incompatibilities in web crawlers and the technologies used to create the applications. We describe research on crawling the deep web by exercising user forms. We close with a thought exercise regarding the convergence of these three threads and the future of automated, web-based accessibility evaluation and assurance through a use case in web archiving. These research efforts provide insight into how users interact with websites, how to automate and simulate user interactions, how to record the results of user interactions, and how to analyze, evaluate, and map resulting website content to determine its relative accessibility.
△ Less
Submitted 7 August, 2019;
originally announced August 2019.
-
Adapting the Hypercube Model to Archive Deferred Representations and Their Descendants
Authors:
Justin F. Brunelle,
Michele C. Weigle,
Michael L. Nelson
Abstract:
The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are increasingly interactive, resulting in pages that are increasingly difficult to archive. Client-side technologies (e.g., JavaScript) enable interactions that can potentially change the client-side state of a representation. We refer to representations that…
▽ More
The web is today's primary publication medium, making web archiving an important activity for historical and analytical purposes. Web pages are increasingly interactive, resulting in pages that are increasingly difficult to archive. Client-side technologies (e.g., JavaScript) enable interactions that can potentially change the client-side state of a representation. We refer to representations that load embedded resources via JavaScript as deferred representations. It is difficult to archive all of the resources in deferred representations and the result is archives with web pages that are either incomplete or that erroneously load embedded resources from the live web.
We propose a method of discovering and crawling deferred representations and their descendants (representation states that are only reachable through client-side events). We adapt the Dincturk et al. Hypercube model to construct a model for archiving descendants, and we measure the number of descendants and requisite embedded resources discovered in a proof-of-concept crawl. Our approach identified an average of 38.5 descendants per seed URI crawled, 70.9% of which are reached through an onclick event. This approach also added 15.6 times more embedded resources than Heritrix to the crawl frontier, but at a rate that was 38.9 times slower than simply using Heritrix. We show that our dataset has two levels of descendants. We conclude with proposed crawl policies and an analysis of the storage requirements for archiving descendants.
△ Less
Submitted 19 January, 2016;
originally announced January 2016.
-
Archiving Deferred Representations Using a Two-Tiered Crawling Approach
Authors:
Justin F. Brunelle,
Michele C. Weigle,
Michael L. Nelson
Abstract:
Web resources are increasingly interactive, resulting in resources that are increasingly difficult to archive. The archival difficulty is based on the use of client-side technologies (e.g., JavaScript) to change the client-side state of a representation after it has initially loaded. We refer to these representations as deferred representations. We can better archive deferred representations using…
▽ More
Web resources are increasingly interactive, resulting in resources that are increasingly difficult to archive. The archival difficulty is based on the use of client-side technologies (e.g., JavaScript) to change the client-side state of a representation after it has initially loaded. We refer to these representations as deferred representations. We can better archive deferred representations using tools like headless browsing clients. We use 10,000 seed Universal Resource Identifiers (URIs) to explore the impact of including PhantomJS -- a headless browsing tool -- into the crawling process by comparing the performance of wget (the baseline), PhantomJS, and Heritrix. Heritrix crawled 2.065 URIs per second, 12.15 times faster than PhantomJS and 2.4 times faster than wget. However, PhantomJS discovered 531,484 URIs, 1.75 times more than Heritrix and 4.11 times more than wget. To take advantage of the performance benefits of Heritrix and the URI discovery of PhantomJS, we recommend a tiered crawling strategy in which a classifier predicts whether a representation will be deferred or not, and only resources with deferred representations are crawled with PhantomJS while resources without deferred representations are crawled with Heritrix. We show that this approach is 5.2 times faster than using only PhantomJS and creates a frontier (set of URIs to be crawled) 1.8 times larger than using only Heritrix.
△ Less
Submitted 10 August, 2015;
originally announced August 2015.
-
On the Change in Archivability of Websites Over Time
Authors:
Mat Kelly,
Justin F. Brunelle,
Michele C. Weigle,
Michael L. Nelson
Abstract:
As web technologies evolve, web archivists work to keep up so that our digital history is preserved. Recent advances in web technologies have introduced client-side executed scripts that load data without a referential identifier or that require user interaction (e.g., content loading when the page has scrolled). These advances have made automating methods for capturing web pages more difficult. B…
▽ More
As web technologies evolve, web archivists work to keep up so that our digital history is preserved. Recent advances in web technologies have introduced client-side executed scripts that load data without a referential identifier or that require user interaction (e.g., content loading when the page has scrolled). These advances have made automating methods for capturing web pages more difficult. Because of the evolving schemes of publishing web pages along with the progressive capability of web preservation tools, the archivability of pages on the web has varied over time. In this paper we show that the archivability of a web page can be deduced from the type of page being archived, which aligns with that page's accessibility in respect to dynamic content. We show concrete examples of when these technologies were introduced by referencing mementos of pages that have persisted through a long evolution of available technologies. Identifying these reasons for the inability of these web pages to be archived in the past in respect to accessibility serves as a guide for ensuring that content that has longevity is published using good practice methods that make it available for preservation.
△ Less
Submitted 30 July, 2013;
originally announced July 2013.
-
An Evaluation of Caching Policies for Memento TimeMaps
Authors:
Justin F. Brunelle,
Michael L. Nelson
Abstract:
As defined by the Memento Framework, TimeMaps are ma-chine-readable lists of time-specific copies -- called "mementos" -- of an archived original resource. In theory, as an archive acquires additional mementos over time, a TimeMap should be monotonically increasing. However, there are reasons why the number of mementos in a TimeMap would decrease, for example: archival redaction of some or all of…
▽ More
As defined by the Memento Framework, TimeMaps are ma-chine-readable lists of time-specific copies -- called "mementos" -- of an archived original resource. In theory, as an archive acquires additional mementos over time, a TimeMap should be monotonically increasing. However, there are reasons why the number of mementos in a TimeMap would decrease, for example: archival redaction of some or all of the mementos, archival restructuring, and transient errors on the part of one or more archives. We study TimeMaps for 4,000 original resources over a three month period, note their change patterns, and develop a caching algorithm for TimeMaps suitable for a reverse proxy in front of a Memento aggregator. We show that TimeMap cardinality is constant or monotonically increasing for 80.2% of all TimeMap downloads observed in the observation period. The goal of the caching algorithm is to exploit the ideally monotonically increasing nature of TimeMaps and not cache responses with fewer mementos than the already cached TimeMap. This new caching algorithm uses conditional cache replacement and a Time To Live (TTL) value to ensure the user has access to the most complete TimeMap available. Based on our empirical data, a TTL of 15 days will minimize the number of mementos missed by users, and minimize the load on archives contributing to TimeMaps.
△ Less
Submitted 22 July, 2013;
originally announced July 2013.
-
Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool
Authors:
Justin F. Brunelle,
Michael L. Nelson
Abstract:
Conventional Web archives are created by periodically crawling a web site and archiving the responses from the Web server. Although easy to implement and common deployed, this form of archiving typically misses updates and may not be suitable for all preservation scenarios, for example a site that is required (perhaps for records compliance) to keep a copy of all pages it has served. In contrast,…
▽ More
Conventional Web archives are created by periodically crawling a web site and archiving the responses from the Web server. Although easy to implement and common deployed, this form of archiving typically misses updates and may not be suitable for all preservation scenarios, for example a site that is required (perhaps for records compliance) to keep a copy of all pages it has served. In contrast, transactional archives work in conjunction with a Web server to record all pages that have been served. Los Alamos National Laboratory has developed SiteSory, an open-source transactional archive written in Java solution that runs on Apache Web servers, provides a Memento compatible access interface, and WARC file export features. We used the ApacheBench utility on a pre-release version of to measure response time and content delivery time in different environments and on different machines. The performance tests were designed to determine the feasibility of SiteStory as a production-level solution for high fidelity automatic Web archiving. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources.
△ Less
Submitted 5 October, 2012; v1 submitted 9 September, 2012;
originally announced September 2012.
-
Gamed-based iSTART Practice: From MiBoard to Self-Explanation Showdown
Authors:
Justin F. Brunelle,
G. Tanner Jackson,
Kyle Dempsey,
Chutima Boonthum,
Irwin B. Levinstein,
Danielle S. McNamara
Abstract:
MiBoard (Multiplayer Interactive Board Game) is an online, turnbased board game that was developed to assess the integration of game characteristics (point rewards, game-like interaction, and peer feedback) and how that might affect student engagement and learning efficacy. This online board game was designed to fit within the Extended Practice module of iSTART (Interactive Strategy Training for A…
▽ More
MiBoard (Multiplayer Interactive Board Game) is an online, turnbased board game that was developed to assess the integration of game characteristics (point rewards, game-like interaction, and peer feedback) and how that might affect student engagement and learning efficacy. This online board game was designed to fit within the Extended Practice module of iSTART (Interactive Strategy Training for Active Reading and Thinking). Unfortunately, preliminary research shows that MiBoard actually reduces engagement and does not benefit the quality of student self-explanations when compared to the original Extended Practice module. Consequently the MiBoard framework has been revamped to create Self-Explanation Showdown, a faster-paced, less analytically oriented game that adds competition to the creation of self-explanations. Students are evaluated on the quality of their self-explanations using the same assessment algorithms from iSTART Extended Practice module (this includes both word-based and LSA-based assessments). The technical issues involved in development of MiBoard and Self- Explanation Showdown are described. The lessons learned from the MiBoard experience are also discussed in this paper.
△ Less
Submitted 11 September, 2010;
originally announced September 2010.
-
MiBoard: A Digital Game from a Physical World
Authors:
Kyle B Dempsey,
G. Tanner Jackson,
Justin F. Brunelle,
Michael Rowe,
Danielle S. McNamara
Abstract:
Increasing user engagement is constant challenge for Intelligent Tutoring Systems researchers. A current trend in the ITS field is to increase engagement of proven learning systems by integrating them within games, or adding in game like components. Incorporating proven learning methods within a game based environment is expected to add to the overall experience without detracting from the origina…
▽ More
Increasing user engagement is constant challenge for Intelligent Tutoring Systems researchers. A current trend in the ITS field is to increase engagement of proven learning systems by integrating them within games, or adding in game like components. Incorporating proven learning methods within a game based environment is expected to add to the overall experience without detracting from the original goals, however, the current study demonstrates two important issues with regard to ITS design. First, effective designs from the physical world do not always translate into the digital world. Second, games do not necessarily improve engagement, and in some cases, they may have the opposite effect. The current study discusses the development and a brief assessment of MiBoard a multiplayer collaborative online board game designed to closely emulate a previously developed physical board game, iSTART: The Board Game.
△ Less
Submitted 11 September, 2010;
originally announced September 2010.
-
MiBoard: Multiplayer Interactive Board Game
Authors:
Kyle B. Dempsey,
Justin F. Brunelle,
G. Tanner Jackson,
Chutima Boonthum,
Irwin B. Levinstein,
Danielle S. McNamara
Abstract:
Serious games have recently emerged as an avenue for curriculum delivery. Serious games incorporate motivation and entertainment while providing pointed curriculum for the user. This paper presents a serious game, called MiBoard, currently being developed from the iSTART Intelligent Tutoring System. MiBoard incorporates a multiplayer interaction that iSTART was previously unable to provide. This m…
▽ More
Serious games have recently emerged as an avenue for curriculum delivery. Serious games incorporate motivation and entertainment while providing pointed curriculum for the user. This paper presents a serious game, called MiBoard, currently being developed from the iSTART Intelligent Tutoring System. MiBoard incorporates a multiplayer interaction that iSTART was previously unable to provide. This multiplayer interaction produces a wide variation across game trials, while also increasing the repeat playability for users. This paper presents a demonstration of the MiBoard system and the expectations for its application.
△ Less
Submitted 11 September, 2010;
originally announced September 2010.
-
MiBoard: iSTART Metacognitive Training through Gaming
Authors:
Justin F. Brunelle,
Kyle B. Dempsey,
G. Tanner Jackson,
Chutima Boonthum,
Irwin B. Levinstein,
Danielle S. McNamara
Abstract:
MiBoard (Multiplayer Interactive Board Game) is an online, turn-based board game, which is a supplement of the iSTART (Interactive Strategy Training for Active Reading and Thinking) application. MiBoard is developed to test the hypothesis that integrating game characteristics (point rewards, game-like interaction, and peer feedback) into the iSTART trainer will significantly improve its effectiven…
▽ More
MiBoard (Multiplayer Interactive Board Game) is an online, turn-based board game, which is a supplement of the iSTART (Interactive Strategy Training for Active Reading and Thinking) application. MiBoard is developed to test the hypothesis that integrating game characteristics (point rewards, game-like interaction, and peer feedback) into the iSTART trainer will significantly improve its effectiveness on students' learning. It was shown by M. Rowe that a physical board game did in fact enhance students' performance. MiBoard is a computer-based version of Rowe's board game that eliminates constraints on locality while retaining the crucial practice components that were the game's objective. MiBoard gives incentives for participation and provides a more enjoyable and social practice environment compared to the online individual practice component of the original trainer
△ Less
Submitted 11 September, 2010;
originally announced September 2010.
-
MiBoard: Metacognitive Training Through Gaming in iSTART
Authors:
Justin F. Brunelle,
Irwin B. Levinstein,
Chutima Boonthum
Abstract:
MiBoard (Multiplayer Interactive Board Game) is an online, turn-based board game, which is a supplement of the iSTART (Interactive Strategy Training for Active Reading and Thinking) application. MiBoard is developed to test the hypothesis that integrating game characteristics (point rewards, game-like interaction, and peer feedback) into the iSTART trainer will significantly improve its effectiven…
▽ More
MiBoard (Multiplayer Interactive Board Game) is an online, turn-based board game, which is a supplement of the iSTART (Interactive Strategy Training for Active Reading and Thinking) application. MiBoard is developed to test the hypothesis that integrating game characteristics (point rewards, game-like interaction, and peer feedback) into the iSTART trainer will significantly improve its effectiveness on students' learning. It was shown by M. Rowe that a physical board game did in fact enhance students' performance. MiBoard is a computer-based version of Rowe's board game that eliminates constraints on locality while retaining the crucial practice components that were the game's objective. MiBoard gives incentives for participation and provides a more enjoyable and social practice environment compared to the online individual practice component of the original trainer.
△ Less
Submitted 11 September, 2010;
originally announced September 2010.