-
Smart meter data processing: a showcase for simple and efficient textual processing
Authors:
Miguel Ferreira,
André Neves,
Rodrigo Gorjão,
Carlos Cruz,
Miguel L. Pardal
Abstract:
The increase in the production and collection of data from devices is an ongoing trend due to the roll-out of more cyber-physical applications. Smart meters, because of their importance in power grids, are a class of such devices whose produced data requires meticulous processing. In this paper, we use Unicage, a data processing system based on classic Unix shell scripting, that delivers excellent…
▽ More
The increase in the production and collection of data from devices is an ongoing trend due to the roll-out of more cyber-physical applications. Smart meters, because of their importance in power grids, are a class of such devices whose produced data requires meticulous processing. In this paper, we use Unicage, a data processing system based on classic Unix shell scripting, that delivers excellent performance in a simple package. We use this methodology to process smart meter data in XML format, subjected to the constraints posed by a real use case. We develop a solution that parses, validates and performs a simple aggregation of 27 million XML files in less than 10 minutes. We present a study of the solution as well as the benefits of its adoption.
△ Less
Submitted 27 December, 2022;
originally announced December 2022.
-
Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts
Authors:
Duarte M. Nascimento,
Miguel Ferreira,
Miguel L. Pardal
Abstract:
The paradigm of big data is characterized by the need to collect and process data sets of great volume, arriving at the systems with great velocity, in a variety of formats. Spark is a widely used big data processing system that can be integrated with Hadoop to provide powerful abstractions to developers, such as distributed storage through HDFS and resource management through YARN. When all the r…
▽ More
The paradigm of big data is characterized by the need to collect and process data sets of great volume, arriving at the systems with great velocity, in a variety of formats. Spark is a widely used big data processing system that can be integrated with Hadoop to provide powerful abstractions to developers, such as distributed storage through HDFS and resource management through YARN. When all the required configurations are made, Spark can also provide quality attributes, such as scalability, fault tolerance, and security. However, all of these benefits come at the cost of complexity, with high memory requirements, and additional latency in processing. An alternative approach is to use a lean software stack, like Unicage, that delegates most control back to the developer. In this work we evaluated the performance of big data processing with Spark versus Unicage, in a cluster environment hosted in the IBM Cloud. Two sets of experiments were performed: batch processing of unstructured data sets, and query processing of structured data sets. The input data sets were of significant size, ranging from 64 GB to 8192 GB in volume. The results show that the performance of Unicage scripts is superior to Spark for search workloads like grep and select, but that the abstractions of distributed storage and resource management from the Hadoop stack enable Spark to execute workloads with inter-record dependencies, such as sort and join, with correct outputs.
△ Less
Submitted 27 December, 2022;
originally announced December 2022.
-
Lisbon Hotspots: Wi-Fi access point dataset for time-bound location proofs
Authors:
Rui Claro,
Samih Eisa,
Miguel L. Pardal
Abstract:
Wi-Fi hotspots are a valuable resource for people on the go, especially tourists, as they provide a means to connect personal devices to the Internet. This extra connectivity can be helpful in many situations, e.g., to enable map and chat applications to operate outdoors when cellular connectivity is unavailable or is expensive. Retail stores and many public services have recognized that hotspots…
▽ More
Wi-Fi hotspots are a valuable resource for people on the go, especially tourists, as they provide a means to connect personal devices to the Internet. This extra connectivity can be helpful in many situations, e.g., to enable map and chat applications to operate outdoors when cellular connectivity is unavailable or is expensive. Retail stores and many public services have recognized that hotspots have potential to attract and retain customers, so many of them offer free and open Wi-Fi. In busy cities, with many locals and visitors, the number of hotspots is very significant. Some of these hotspots are available for long periods of time, while others are short-lived. When we have many users with devices collecting hotspot observations, they can be used to detect the location -- using the long-lived hotspots -- and to prove the time when the location was visited -- using the short-lived hotspots observed by others users at the location.
In this article, we present a dataset of collected Wi-Fi data from the most important tourist locations in the city of Lisbon, Portugal, over a period of months, that was used to show the feasibility of using hotspot data for location detection and proof. The obtained data and algorithms were assessed for a specific use case: smart tourism. We also present the data model used to store the observations and the algorithms developed to detect and prove location of a user device at a specific time. The Lisbon Hotspots dataset, LXspots, is made publicly available to the scientific community so that other researchers can also make use of it to develop new and innovative mobile and Internet of Things applications.
△ Less
Submitted 5 August, 2022;
originally announced August 2022.
-
Learning to generate Reliable Broadcast Algorithms
Authors:
Diogo Vaz,
David R. Matos,
Miguel L. Pardal,
Miguel Correia
Abstract:
Modern distributed systems are supported by fault-tolerant algorithms, like Reliable Broadcast and Consensus, that assure the correct operation of the system even when some of the nodes of the system fail. However, the development of distributed algorithms is a manual and complex process, resulting in scientific papers that usually present a single algorithm or variations of existing ones. To auto…
▽ More
Modern distributed systems are supported by fault-tolerant algorithms, like Reliable Broadcast and Consensus, that assure the correct operation of the system even when some of the nodes of the system fail. However, the development of distributed algorithms is a manual and complex process, resulting in scientific papers that usually present a single algorithm or variations of existing ones. To automate the process of develo** such algorithms, this work presents an intelligent agent that uses Reinforcement Learning to generate correct and efficient fault-tolerant distributed algorithms. We show that our approach is able to generate correct fault-tolerant Reliable Broadcast algorithms with the same performance of others available in the literature, in only 12,000 learning episodes.
△ Less
Submitted 31 July, 2022;
originally announced August 2022.