Search | arXiv e-print repository

CRIU -- Checkpoint Restore in Userspace for computational simulations and scientific applications

Authors: Fabio Andrijauskas, Igor Sfiligoi, Diego Davila, Aashay Arora, Jonathan Guiang, Brian Bockelman, Greg Thain, Frank Wurthwein

Abstract: Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks, some cannot and may take hours or weeks to run to completion. To better manage those longer-running jobs, it would be desirable to stop them at any arbitrary p… ▽ More Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks, some cannot and may take hours or weeks to run to completion. To better manage those longer-running jobs, it would be desirable to stop them at any arbitrary point in time and later continue their computation on another compute resource; this is usually referred to as checkpointing. While some applications can manage checkpointing programmatically, it would be preferable if the batch scheduling system could do that independently. This paper evaluates the feasibility of using CRIU (Checkpoint Restore in Userspace), an open-source tool for the GNU/Linux environments, emphasizing the OSG's OSPool HTCondor setup. CRIU allows checkpointing the process state into a disk image and can deal with both open files and established network connections seamlessly. Furthermore, it can checkpoint traditional Linux processes and containerized workloads. The functionality seems adequate for many scenarios supported in the OSPool. However, some limitations prevent it from being usable in all circumstances. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: 26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS - 2023

arXiv:2312.12589 [pdf, other]

400Gbps benchmark of XRootD HTTP-TPC

Authors: Aashay Arora, Jonathan Guiang, Diego Davila, Frank Würthwein, Justas Balcas, Harvey Newman

Abstract: Due to the increased demand of network traffic expected during the HL-LHC era, the T2 sites in the USA will be required to have 400Gbps of available bandwidth to their storage solution. With the above in mind we are pursuing a scale test of XRootD software when used to perform Third Party Copy transfers using the HTTP protocol. Our main objective is to understand the possible limitations in the so… ▽ More Due to the increased demand of network traffic expected during the HL-LHC era, the T2 sites in the USA will be required to have 400Gbps of available bandwidth to their storage solution. With the above in mind we are pursuing a scale test of XRootD software when used to perform Third Party Copy transfers using the HTTP protocol. Our main objective is to understand the possible limitations in the software stack to achieve the target transfer rate; to that end we have set up a testbed of multiple XRootD servers in both UCSD and Caltech which are connected through a dedicated link capable of 400 Gbps end-to-end. Building upon our experience deploying containerized XRootD servers, we use Kubernetes to easily deploy and test different configurations of our testbed. In this work, we will present our experience doing these tests and the lessons learned. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 8 pages, 4 figures, submitted to CHEP'23

arXiv:2307.11069 [pdf, other]

doi 10.1109/ICNC57223.2023.10074058

Effectiveness and predictability of in-network storage cache for scientific workflows

Authors: Caitlin Sim, Kesheng Wu, Alex Sim, Inder Monga, Chin Guok, Frank Wurthwein, Diego Davila, Harvey Newman, Justas Balcas

Abstract: Large scientific collaborations often have multiple scientists accessing the same set of files while doing different analyses, which create repeated accesses to the large amounts of shared data located far away. These data accesses have long latency due to distance and occupy the limited bandwidth available over the wide-area network. To reduce the wide-area network traffic and the data access lat… ▽ More Large scientific collaborations often have multiple scientists accessing the same set of files while doing different analyses, which create repeated accesses to the large amounts of shared data located far away. These data accesses have long latency due to distance and occupy the limited bandwidth available over the wide-area network. To reduce the wide-area network traffic and the data access latency, regional data storage caches have been installed as a new networking service. To study the effectiveness of such a cache system in scientific applications, we examine the Southern California Petabyte Scale Cache for a high-energy physics experiment. By examining about 3TB of operational logs, we show that this cache removed 67.6% of file requests from the wide-area network and reduced the traffic volume on wide-area network by 12.3TB (or 35.4%) an average day. The reduction in the traffic volume (35.4%) is less than the reduction in file counts (67.6%) because the larger files are less likely to be reused. Due to this difference in data access patterns, the cache system has implemented a policy to avoid evicting smaller files when processing larger files. We also build a machine learning model to study the predictability of the cache behavior. Tests show that this model is able to accurately predict the cache accesses, cache misses, and network throughput, making the model useful for future studies on resource provisioning and planning. △ Less

Submitted 20 July, 2023; originally announced July 2023.

arXiv:2211.04656 [pdf, other]

MEVID: Multi-view Extended Videos with Identities for Video Person Re-Identification

Authors: Daniel Davila, Dawei Du, Bryon Lewis, Christopher Funk, Joseph Van Pelt, Roderick Collins, Kellie Corona, Matt Brown, Scott McCloskey, Anthony Hoogs, Brian Clipp

Abstract: In this paper, we present the Multi-view Extended Videos with Identities (MEVID) dataset for large-scale, video person re-identification (ReID) in the wild. To our knowledge, MEVID represents the most-varied video person ReID dataset, spanning an extensive indoor and outdoor environment across nine unique dates in a 73-day window, various camera viewpoints, and entity clothing changes. Specificall… ▽ More In this paper, we present the Multi-view Extended Videos with Identities (MEVID) dataset for large-scale, video person re-identification (ReID) in the wild. To our knowledge, MEVID represents the most-varied video person ReID dataset, spanning an extensive indoor and outdoor environment across nine unique dates in a 73-day window, various camera viewpoints, and entity clothing changes. Specifically, we label the identities of 158 unique people wearing 598 outfits taken from 8, 092 tracklets, average length of about 590 frames, seen in 33 camera views from the very large-scale MEVA person activities dataset. While other datasets have more unique identities, MEVID emphasizes a richer set of information about each individual, such as: 4 outfits/identity vs. 2 outfits/identity in CCVID, 33 viewpoints across 17 locations vs. 6 in 5 simulated locations for MTA, and 10 million frames vs. 3 million for LS-VID. Being based on the MEVA video dataset, we also inherit data that is intentionally demographically balanced to the continental United States. To accelerate the annotation process, we developed a semi-automatic annotation framework and GUI that combines state-of-the-art real-time models for object detection, pose estimation, person ReID, and multi-object tracking. We evaluate several state-of-the-art methods on MEVID challenge problems and comprehensively quantify their robustness in terms of changes of outfit, scale, and background location. Our quantitative analysis on the realistic, unique aspects of MEVID shows that there are significant remaining challenges in video person ReID and indicates important directions for future research. △ Less

Submitted 10 November, 2022; v1 submitted 8 November, 2022; originally announced November 2022.

Comments: This paper was accepted to WACV 2023

arXiv:2210.01503 [pdf]

Dynamic control of high-voltage actuator arrays by light-pattern projection on photoconductive switches

Authors: Vesna Bacheva, Amir Firouzeh, Edouard Leroy, Aiste Balciunaite, Diana Davila, Israel Gabay, Federico Paratore, Moran Bercovici, Herbert Shea, Govind Kaigala

Abstract: The ability to control high-voltage actuator arrays relies, to date, on expensive microelectronic processes or on individual wiring of each actuator to a single off-chip high-voltage switch. Here we present an alternative approach that uses on-chip photoconductive switches together with a light projection system to individually address high-voltage actuators. Each actuator is connected to one or m… ▽ More The ability to control high-voltage actuator arrays relies, to date, on expensive microelectronic processes or on individual wiring of each actuator to a single off-chip high-voltage switch. Here we present an alternative approach that uses on-chip photoconductive switches together with a light projection system to individually address high-voltage actuators. Each actuator is connected to one or more switches that are nominally OFF unless turned ON using direct light illumination. We selected hydrogenated amorphous silicon as our photoconductive material, and we provide complete characterization of its light to dark conductance, breakdown field, and spectral response. The resulting switches are very robust, and we provide full details of their fabrication processes. We demonstrate that the switches can be integrated in different architectures to support both AC and DC-driven actuators and provide engineering guidelines for their functional design. To demonstrate the versatility of our approach, we demonstrate the use of the photoconductive switches in two distinctly different applications control of micrometer-sized gate electrodes for patterning flow fields in a microfluidic chamber, and control of centimeter-sized electrostatic actuators for creating mechanical deformations for haptic displays. △ Less

Submitted 4 October, 2022; originally announced October 2022.

arXiv:2209.13714 [pdf, other]

doi 10.1109/XLOOP56614.2022.00008

Managed Network Services for Exascale Data Movement Across Large Global Scientific Collaborations

Authors: Frank Würthwein, Jonathan Guiang, Aashay Arora, Diego Davila, John Graham, Dima Mishin, Thomas Hutton, Igor Sfiligoi, Harvey Newman, Justas Balcas, Tom Lehman, Xi Yang, Chin Guok

Abstract: Unique scientific instruments designed and operated by large global collaborations are expected to produce Exabyte-scale data volumes per year by 2030. These collaborations depend on globally distributed storage and compute to turn raw data into science. While all of these infrastructures have batch scheduling capabilities to share compute, Research and Education networks lack those capabilities.… ▽ More Unique scientific instruments designed and operated by large global collaborations are expected to produce Exabyte-scale data volumes per year by 2030. These collaborations depend on globally distributed storage and compute to turn raw data into science. While all of these infrastructures have batch scheduling capabilities to share compute, Research and Education networks lack those capabilities. There is thus uncontrolled competition for bandwidth between and within collaborations. As a result, data "hogs" disk space at processing facilities for much longer than it takes to process, leading to vastly over-provisioned storage infrastructures. Integrated co-scheduling of networks as part of high-level managed workflows might reduce these storage needs by more than an order of magnitude. This paper describes such a solution, demonstrates its functionality in the context of the Large Hadron Collider (LHC) at CERN, and presents the next-steps towards its use in production. △ Less

Submitted 27 September, 2022; originally announced September 2022.

Comments: Submitted to the proceedings of the XLOOP workshop held in conjunction with Supercomputing 22

arXiv:2205.05598 [pdf, other]

doi 10.1145/3526064.3534111

Studying Scientific Data Lifecycle in On-demand Distributed Storage Caches

Authors: Julian Bellavita, Alex Sim, Kesheng Wu, Inder Monga, Chin Guok, Frank Würthwein, Diego Davila

Abstract: The XRootD system is used to transfer, store, and cache large datasets from high-energy physics (HEP). In this study we focus on its capability as distributed on-demand storage cache. Through exploring a large set of daily log files between 2020 and 2021, we seek to understand the data access patterns that might inform future cache design. Our study begins with a set of summary statistics regardin… ▽ More The XRootD system is used to transfer, store, and cache large datasets from high-energy physics (HEP). In this study we focus on its capability as distributed on-demand storage cache. Through exploring a large set of daily log files between 2020 and 2021, we seek to understand the data access patterns that might inform future cache design. Our study begins with a set of summary statistics regarding file read operations, file lifetimes, and file transfers. We observe that the number of read operations on each file remains nearly constant, while the average size of a read operation grows over time. Furthermore, files tend to have a consistent length of time during which they remain open and are in use. Based on this comprehensive study of the cache access statistics, we developed a cache simulator to explore the behavior of caches of different sizes. Within a certain size range, we find that increasing the XRootD cache size improves the cache hit rate, yielding faster overall file access. In particular, we find that increase the cache size from 40TB to 56TB could increase the hit rate from 0.62 to 0.89, which is a significant increase in cache effectiveness for modest cost. △ Less

Submitted 11 May, 2022; originally announced May 2022.

arXiv:2205.05563 [pdf, other]

doi 10.1145/3526064.3534110

Access Trends of In-network Cache for Scientific Data

Authors: Ruize Han, Alex Sim, Kesheng Wu, Inder Monga, Chin Guok, Frank Würthwein, Diego Davila, Justas Balcas, Harvey Newman

Abstract: Scientific collaborations are increasingly relying on large volumes of data for their work and many of them employ tiered systems to replicate the data to their worldwide user communities. Each user in the community often selects a different subset of data for their analysis tasks; however, members of a research group often are working on related research topics that require similar data objects.… ▽ More Scientific collaborations are increasingly relying on large volumes of data for their work and many of them employ tiered systems to replicate the data to their worldwide user communities. Each user in the community often selects a different subset of data for their analysis tasks; however, members of a research group often are working on related research topics that require similar data objects. Thus, there is a significant amount of data sharing possible. In this work, we study the access traces of a federated storage cache known as the Southern California Petabyte Scale Cache. By studying the access patterns and potential for network traffic reduction by this caching system, we aim to explore the predictability of the cache uses and the potential for a more general in-network data caching. Our study shows that this distributed storage cache is able to reduce the network traffic volume by a factor of 2.35 during a part of the study period. We further show that machine learning models could predict cache utilization with an accuracy of 0.88. This demonstrates that such cache usage is predictable, which could be useful for managing complex networking resources such as in-network caching. △ Less

Submitted 11 May, 2022; originally announced May 2022.

arXiv:2203.09642 [pdf, other]

Cascade Transformers for End-to-End Person Search

Authors: Rui Yu, Dawei Du, Rodney LaLonde, Daniel Davila, Christopher Funk, Anthony Hoogs, Brian Clipp

Abstract: The goal of person search is to localize a target person from a gallery set of scene images, which is extremely challenging due to large scale variations, pose/viewpoint changes, and occlusions. In this paper, we propose the Cascade Occluded Attention Transformer (COAT) for end-to-end person search. Our three-stage cascade design focuses on detecting people in the first stage, while later stages s… ▽ More The goal of person search is to localize a target person from a gallery set of scene images, which is extremely challenging due to large scale variations, pose/viewpoint changes, and occlusions. In this paper, we propose the Cascade Occluded Attention Transformer (COAT) for end-to-end person search. Our three-stage cascade design focuses on detecting people in the first stage, while later stages simultaneously and progressively refine the representation for person detection and re-identification. At each stage the occluded attention transformer applies tighter intersection over union thresholds, forcing the network to learn coarse-to-fine pose/scale invariant features. Meanwhile, we calculate each detection's occluded attention to differentiate a person's tokens from other people or the background. In this way, we simulate the effect of other objects occluding a person of interest at the token-level. Through comprehensive experiments, we demonstrate the benefits of our method by achieving state-of-the-art performance on two benchmark datasets. △ Less

Submitted 17 March, 2022; originally announced March 2022.

Comments: Accepted to CVPR 2022 Code can be found at https://github.com/Kitware/COAT

arXiv:2203.08280 [pdf]

Data Transfer and Network Services management for Domain Science Workflows

Authors: Tom Lehman, Xi Yang, Chin Guok, Frank Wuerthwein, Igor Sfiligoi, John Graham, Aashay Arora, Dima Mishin, Diego Davila, Jonathan Guiang, Tom Hutton, Harvey Newman, Justas Balcas

Abstract: This paper describes a vision and work in progress to elevate network resources and data transfer management to the same level as compute and storage in the context of services access, scheduling, life cycle management, and orchestration. While domain science workflows often include active compute resource allocation and management, the data transfers and associated network resource coordination i… ▽ More This paper describes a vision and work in progress to elevate network resources and data transfer management to the same level as compute and storage in the context of services access, scheduling, life cycle management, and orchestration. While domain science workflows often include active compute resource allocation and management, the data transfers and associated network resource coordination is not handled in a similar manner. As a result data transfers can introduce a degree of uncertainty in workflow operations, and the associated lack of network information does not allow for either the workflow operations or the network use to be optimized. The net result is that domain science workflow processes are forced to view the network as an opaque infrastructure into which they inject data and hope that it emerges at the destination with an acceptable Quality of Experience. There is little ability for applications to interact with the network to exchange information, negotiate performance parameters, discover expected performance metrics, or receive status/troubleshooting information in real time. Develo** mechanisms to allow an application workflow to obtain information regarding the network services, capabilities, and options, to a degree similar to what is possible for compute resources is the primary motivation for this work. The initial focus is on the Open Science Grid (OSG)/Compact Muon Solenoid (CMS) Large Hadron Collider (LHC) workflows with Rucio/FTS/XRootD based data transfers and the interoperation with the ESnet SENSE (Software-Defined Network for End-to-end Networked Science at the Exascale) system. △ Less

Submitted 20 March, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

Comments: contribution to Snowmass 2022

arXiv:2201.10366 [pdf, other]

ADAPT: An Open-Source sUAS Payload for Real-Time Disaster Prediction and Response with AI

Authors: Daniel Davila, Joseph VanPelt, Alexander Lynch, Adam Romlein, Peter Webley, Matthew S. Brown

Abstract: Small unmanned aircraft systems (sUAS) are becoming prominent components of many humanitarian assistance and disaster response (HADR) operations. Pairing sUAS with onboard artificial intelligence (AI) substantially extends their utility in covering larger areas with fewer support personnel. A variety of missions, such as search and rescue, assessing structural damage, and monitoring forest fires,… ▽ More Small unmanned aircraft systems (sUAS) are becoming prominent components of many humanitarian assistance and disaster response (HADR) operations. Pairing sUAS with onboard artificial intelligence (AI) substantially extends their utility in covering larger areas with fewer support personnel. A variety of missions, such as search and rescue, assessing structural damage, and monitoring forest fires, floods, and chemical spills, can be supported simply by deploying the appropriate AI models. However, adoption by resource-constrained groups, such as local municipalities, regulatory agencies, and researchers, has been hampered by the lack of a cost-effective, readily-accessible baseline platform that can be adapted to their unique missions. To fill this gap, we have developed the free and open-source ADAPT multi-mission payload for deploying real-time AI and computer vision onboard a sUAS during local and beyond-line-of-site missions. We have emphasized a modular design with low-cost, readily-available components, open-source software, and thorough documentation (https://kitware.github.io/adapt/). The system integrates an inertial navigation system, high-resolution color camera, computer, and wireless downlink to process imagery and broadcast georegistered analytics back to a ground station. Our goal is to make it easy for the HADR community to build their own copies of the ADAPT payload and leverage the thousands of hours of engineering we have devoted to develo** and testing. In this paper, we detail the development and testing of the ADAPT payload. We demonstrate the example mission of real-time, in-flight ice segmentation to monitor river ice state and provide timely predictions of catastrophic flooding events. We deploy a novel active learning workflow to annotate river ice imagery, train a real-time deep neural network for ice segmentation, and demonstrate operation in the field. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: To be published in Workshop on Practical Deep Learning in the Wild at AAAI Conference on Artificial Intelligence 2022, 9 pages, 5 figures

arXiv:2110.13187 [pdf]

doi 10.1007/978-981-19-0898-9_20

Data intensive physics analysis in Azure cloud

Authors: Igor Sfiligoi, Frank Würthwein, Diego Davila

Abstract: The Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) is one of the largest data producers in the scientific world, with standard data products centrally produced, and then used by often competing teams within the collaboration. This work is focused on how a local institution, University of California San Diego (UCSD), partnered with the Open Science Grid (OSG) to use Azure… ▽ More The Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) is one of the largest data producers in the scientific world, with standard data products centrally produced, and then used by often competing teams within the collaboration. This work is focused on how a local institution, University of California San Diego (UCSD), partnered with the Open Science Grid (OSG) to use Azure cloud resources to augment its available computing to accelerate time to results for multiple analyses pursued by a small group of collaborators. The OSG is a federated infrastructure allowing many independent resource providers to serve many independent user communities in a transparent manner. Historically the resources would come from various research institutions, spanning small universities to large HPC centers, based on either community needs or grant allocations, so adding commercial clouds as resource providers is a natural evolution. The OSG technology allows for easy integration of cloud resources, but the data-intensive nature of CMS compute jobs required the deployment of additional data caching infrastructure to ensure high efficiency. △ Less

Submitted 25 October, 2021; originally announced October 2021.

Comments: 11 pages, 5 figures, to be published in proceedings of ICOCBI 2021

Journal ref: Lecture Notes on Data Engineering and Communications Technologies, vol 117. Springer, Singapore. 2022

arXiv:2105.00964 [pdf, other]

doi 10.1145/3452411.3464441

Analyzing scientific data sharing patterns for in-network data caching

Authors: Elizabeth Copps, Huiyi Zhang, Alex Sim, Kesheng Wu, Inder Monga, Chin Guok, Frank Würthwein, Diego Davila, Edgar Fajardo

Abstract: The volume of data moving through a network increases with new scientific experiments and simulations. Network bandwidth requirements also increase proportionally to deliver data within a certain time frame. We observe that a significant portion of the popular dataset is transferred multiple times to different users as well as to the same user for various reasons. In-network data caching for the s… ▽ More The volume of data moving through a network increases with new scientific experiments and simulations. Network bandwidth requirements also increase proportionally to deliver data within a certain time frame. We observe that a significant portion of the popular dataset is transferred multiple times to different users as well as to the same user for various reasons. In-network data caching for the shared data has shown to reduce the redundant data transfers and consequently save network traffic volume. In addition, overall application performance is expected to improve with in-network caching because access to the locally cached data results in lower latency. This paper shows how much data was shared over the study period, how much network traffic volume was consequently saved, and how much the temporary in-network caching increased the scientific application performance. It also analyzes data access patterns in applications and the impacts of caching nodes on the regional data repository. From the results, we observed that the network bandwidth demand was reduced by nearly a factor of 3 over the study period. △ Less

Submitted 3 May, 2021; originally announced May 2021.

arXiv:2104.03091 [pdf]

doi 10.1021/acsnano.1c02932

Ultra-Thin Lubricant-Infused Vertical Graphene Nanoscaffolds for High-Performance Dropwise Condensation

Authors: Abinash Tripathy, Cheuk Wing Edmond Lam, Diana Davila, Matteo Donati, Athanasios Milionis, Chander Shekhar Sharma, Dimos Poulikakos

Abstract: Lubricant-infused surfaces (LIS) are highly efficient in repelling water and constitute a very promising family of materials for condensation processes occurring in a broad range of energy applications. However, the performance of LIS in such processes is limited by the inherent thermal resistance imposed by the thickness of the lubricant and supporting surface structure, as well as by the gradual… ▽ More Lubricant-infused surfaces (LIS) are highly efficient in repelling water and constitute a very promising family of materials for condensation processes occurring in a broad range of energy applications. However, the performance of LIS in such processes is limited by the inherent thermal resistance imposed by the thickness of the lubricant and supporting surface structure, as well as by the gradual depletion of the lubricant over time. Here we present a remarkable, ultra-thin (~70 nm) and conductive LIS architecture, obtained by infusing lubricant into a vertically grown graphene nanoscaffold on copper. The ultra-thin nature of the scaffold, combined with the high in-plane thermal conductivity of graphene, drastically minimize earlier limitations, effectively doubling the heat transfer performance compared to a state-of-the-art CuO LIS surface. We show that the effect of the thermal resistance to the heat transfer performance of a LIS surface, although often overlooked, can be so detrimental that a simple nanostructured CuO surface can outperform a CuO LIS surface, despite film condensation on the former. The present vertical graphene LIS is also found to be resistant to lubricant depletion, maintaining stable dropwise condensation for at least ~7 hours with no significant change of advancing contact angle and contact angle hysteresis. The lubricant consumed by the vertical graphene LIS is 52.6% less than the existing state-of-the-art CuO LIS, making also the fabrication process more economical. △ Less

Submitted 7 April, 2021; originally announced April 2021.

Journal ref: ACS Nano 2021, 15, 9, 14305-14315

arXiv:2103.12116 [pdf, other]

doi 10.1051/epjconf/202125102001

Systematic benchmarking of HTTPS third party copy on 100Gbps links using XRootD

Authors: Edgar Fajardo, Aashay Arora, Diego Davila, Richard Gao, Frank Würthwein, Brian Bockelman

Abstract: The High Luminosity Large Hadron Collider provides a data challenge. The amount of data recorded from the experiments and transported to hundreds of sites will see a thirty fold increase in annual data volume. A systematic approach to contrast the performance of different Third Party Copy(TPC) transfer protocols arises. Two contenders, XRootD-HTTPS and the GridFTP are evaluated in their performanc… ▽ More The High Luminosity Large Hadron Collider provides a data challenge. The amount of data recorded from the experiments and transported to hundreds of sites will see a thirty fold increase in annual data volume. A systematic approach to contrast the performance of different Third Party Copy(TPC) transfer protocols arises. Two contenders, XRootD-HTTPS and the GridFTP are evaluated in their performance for transferring files from one server to an-other over 100Gbps interfaces. The benchmarking is done by scheduling pods on the Pacific Research Platform Kubernetes cluster to ensure reproducible and repeatable results. This opens a future pathway for network testing of any TPC transfer protocol. △ Less

Submitted 22 March, 2021; originally announced March 2021.

Comments: 7 pages, 8 figures

arXiv:2003.02319 [pdf, other]

doi 10.1051/epjconf/202024504042

Moving the California distributed CMS xcache from bare metal into containers using Kubernetes

Authors: Edgar Fajardo, Matevz Tadel, Justas Balcas, Alja Tadel, Frank Wuerthwein, Diego Davila, Jonathan Guiang, Igor Sfiligoi

Abstract: The University of California system has excellent networking between all of its campuses as well as a number of other Universities in CA, including Caltech, most of them being connected at 100 Gbps. UCSD and Caltech have thus joined their disk systems into a single logical xcache system, with worker nodes from both sites accessing data from disks at either site. This setup has been in place for a… ▽ More The University of California system has excellent networking between all of its campuses as well as a number of other Universities in CA, including Caltech, most of them being connected at 100 Gbps. UCSD and Caltech have thus joined their disk systems into a single logical xcache system, with worker nodes from both sites accessing data from disks at either site. This setup has been in place for a couple years now and has shown to work very well. Coherently managing nodes at multiple physical locations has however not been trivial, and we have been looking for ways to improve operations. With the Pacific Research Platform (PRP) now providing a Kubernetes resource pool spanning resources in the science DMZs of all the UC campuses, we have recently migrated the xcache services from being hosted bare-metal into containers. This paper presents our experience in both migrating to and operating in the new environment. △ Less

Submitted 4 March, 2020; originally announced March 2020.

Showing 1–16 of 16 results for author: Davila, D