Search | arXiv e-print repository

Smart meter data processing: a showcase for simple and efficient textual processing

Authors: Miguel Ferreira, André Neves, Rodrigo Gorjão, Carlos Cruz, Miguel L. Pardal

Abstract: The increase in the production and collection of data from devices is an ongoing trend due to the roll-out of more cyber-physical applications. Smart meters, because of their importance in power grids, are a class of such devices whose produced data requires meticulous processing. In this paper, we use Unicage, a data processing system based on classic Unix shell scripting, that delivers excellent… ▽ More The increase in the production and collection of data from devices is an ongoing trend due to the roll-out of more cyber-physical applications. Smart meters, because of their importance in power grids, are a class of such devices whose produced data requires meticulous processing. In this paper, we use Unicage, a data processing system based on classic Unix shell scripting, that delivers excellent performance in a simple package. We use this methodology to process smart meter data in XML format, subjected to the constraints posed by a real use case. We develop a solution that parses, validates and performs a simple aggregation of 27 million XML files in less than 10 minutes. We present a study of the solution as well as the benefits of its adoption. △ Less

Submitted 27 December, 2022; originally announced December 2022.

Comments: 11 pages, 5 figures, 1 table, 9 listings. Accepted after review for the 1st Workshop on High-Performance and Reliable Big Data (HPBD 2021), which was held virtually on September 20th 2021, and was co-located with the 40th International Symposium on Reliable Distributed Systems (SRDS 2021)

arXiv:2212.13647 [pdf, other]

Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts

Authors: Duarte M. Nascimento, Miguel Ferreira, Miguel L. Pardal

Abstract: The paradigm of big data is characterized by the need to collect and process data sets of great volume, arriving at the systems with great velocity, in a variety of formats. Spark is a widely used big data processing system that can be integrated with Hadoop to provide powerful abstractions to developers, such as distributed storage through HDFS and resource management through YARN. When all the r… ▽ More The paradigm of big data is characterized by the need to collect and process data sets of great volume, arriving at the systems with great velocity, in a variety of formats. Spark is a widely used big data processing system that can be integrated with Hadoop to provide powerful abstractions to developers, such as distributed storage through HDFS and resource management through YARN. When all the required configurations are made, Spark can also provide quality attributes, such as scalability, fault tolerance, and security. However, all of these benefits come at the cost of complexity, with high memory requirements, and additional latency in processing. An alternative approach is to use a lean software stack, like Unicage, that delegates most control back to the developer. In this work we evaluated the performance of big data processing with Spark versus Unicage, in a cluster environment hosted in the IBM Cloud. Two sets of experiments were performed: batch processing of unstructured data sets, and query processing of structured data sets. The input data sets were of significant size, ranging from 64 GB to 8192 GB in volume. The results show that the performance of Unicage scripts is superior to Spark for search workloads like grep and select, but that the abstractions of distributed storage and resource management from the Hadoop stack enable Spark to execute workloads with inter-record dependencies, such as sort and join, with correct outputs. △ Less

Submitted 27 December, 2022; originally announced December 2022.

Comments: 10 pages, 14 figures

arXiv:2208.04741 [pdf, other]

Lisbon Hotspots: Wi-Fi access point dataset for time-bound location proofs

Authors: Rui Claro, Samih Eisa, Miguel L. Pardal

Abstract: Wi-Fi hotspots are a valuable resource for people on the go, especially tourists, as they provide a means to connect personal devices to the Internet. This extra connectivity can be helpful in many situations, e.g., to enable map and chat applications to operate outdoors when cellular connectivity is unavailable or is expensive. Retail stores and many public services have recognized that hotspots… ▽ More Wi-Fi hotspots are a valuable resource for people on the go, especially tourists, as they provide a means to connect personal devices to the Internet. This extra connectivity can be helpful in many situations, e.g., to enable map and chat applications to operate outdoors when cellular connectivity is unavailable or is expensive. Retail stores and many public services have recognized that hotspots have potential to attract and retain customers, so many of them offer free and open Wi-Fi. In busy cities, with many locals and visitors, the number of hotspots is very significant. Some of these hotspots are available for long periods of time, while others are short-lived. When we have many users with devices collecting hotspot observations, they can be used to detect the location -- using the long-lived hotspots -- and to prove the time when the location was visited -- using the short-lived hotspots observed by others users at the location. In this article, we present a dataset of collected Wi-Fi data from the most important tourist locations in the city of Lisbon, Portugal, over a period of months, that was used to show the feasibility of using hotspot data for location detection and proof. The obtained data and algorithms were assessed for a specific use case: smart tourism. We also present the data model used to store the observations and the algorithms developed to detect and prove location of a user device at a specific time. The Lisbon Hotspots dataset, LXspots, is made publicly available to the scientific community so that other researchers can also make use of it to develop new and innovative mobile and Internet of Things applications. △ Less

Submitted 5 August, 2022; originally announced August 2022.

Comments: 14 pages

ACM Class: E.1

arXiv:2208.00525 [pdf, other]

doi 10.1109/ACCESS.2023.3287405

Learning to generate Reliable Broadcast Algorithms

Authors: Diogo Vaz, David R. Matos, Miguel L. Pardal, Miguel Correia

Abstract: Modern distributed systems are supported by fault-tolerant algorithms, like Reliable Broadcast and Consensus, that assure the correct operation of the system even when some of the nodes of the system fail. However, the development of distributed algorithms is a manual and complex process, resulting in scientific papers that usually present a single algorithm or variations of existing ones. To auto… ▽ More Modern distributed systems are supported by fault-tolerant algorithms, like Reliable Broadcast and Consensus, that assure the correct operation of the system even when some of the nodes of the system fail. However, the development of distributed algorithms is a manual and complex process, resulting in scientific papers that usually present a single algorithm or variations of existing ones. To automate the process of develo** such algorithms, this work presents an intelligent agent that uses Reinforcement Learning to generate correct and efficient fault-tolerant distributed algorithms. We show that our approach is able to generate correct fault-tolerant Reliable Broadcast algorithms with the same performance of others available in the literature, in only 12,000 learning episodes. △ Less

Submitted 31 July, 2022; originally announced August 2022.

Showing 1–4 of 4 results for author: Pardal, M L