Search | arXiv e-print repository

Combining Twitter and Mobile Phone Data to Observe Border-Rush: The Turkish-European Border Opening

Authors: Carlos Arcila Calderón, Bilgeçağ Aydoğdu, Tuba Bircan, Bünyamin Gündüz, Onur Önes, Albert Ali Salah, Alina Sîrbu

Abstract: Following Turkey's 2020 decision to revoke border controls, many individuals journeyed towards the Greek, Bulgarian, and Turkish borders. However, the lack of verifiable statistics on irregular migration and discrepancies between media reports and actual migration patterns require further exploration. The objective of this study is to bridge this knowledge gap by harnessing novel data sources, spe… ▽ More Following Turkey's 2020 decision to revoke border controls, many individuals journeyed towards the Greek, Bulgarian, and Turkish borders. However, the lack of verifiable statistics on irregular migration and discrepancies between media reports and actual migration patterns require further exploration. The objective of this study is to bridge this knowledge gap by harnessing novel data sources, specifically mobile phone and Twitter data, to construct estimators of cross-border mobility and to cultivate a qualitative comprehension of the unfolding events. By employing a migration diplomacy framework, we analyse emergent mobility patterns at the border. Our findings demonstrate the potential of mobile phone data for quantitative metrics and Twitter data for qualitative understanding. We underscore the ethical implications of leveraging Big Data, particularly considering the vulnerability of the population under study. This underscores the imperative for exhaustive research into the socio-political facets of human mobility, with the aim of discerning the potentialities, limitations, and risks inherent in these data sources and their integration. This scholarly endeavour contributes to a more nuanced understanding of migration dynamics and paves the way for the formulation of regulations that preclude misuse and oppressive surveillance, thereby ensuring a more accurate representation of migration realities. △ Less

Submitted 22 May, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

arXiv:2404.10175 [pdf, other]

PD-L1 Classification of Weakly-Labeled Whole Slide Images of Breast Cancer

Authors: Giacomo Cignoni, Cristian Scatena, Chiara Frascarelli, Nicola Fusco, Antonio Giuseppe Naccarato, Giuseppe Nicoló Fanelli, Alina Sîrbu

Abstract: Specific and effective breast cancer therapy relies on the accurate quantification of PD-L1 positivity in tumors, which appears in the form of brown stainings in high resolution whole slide images (WSIs). However, the retrieval and extensive labeling of PD-L1 stained WSIs is a time-consuming and challenging task for pathologists, resulting in low reproducibility, especially for borderline images.… ▽ More Specific and effective breast cancer therapy relies on the accurate quantification of PD-L1 positivity in tumors, which appears in the form of brown stainings in high resolution whole slide images (WSIs). However, the retrieval and extensive labeling of PD-L1 stained WSIs is a time-consuming and challenging task for pathologists, resulting in low reproducibility, especially for borderline images. This study aims to develop and compare models able to classify PD-L1 positivity of breast cancer samples based on WSI analysis, relying only on WSI-level labels. The task consists of two phases: identifying regions of interest (ROI) and classifying tumors as PD-L1 positive or negative. For the latter, two model categories were developed, with different feature extraction methodologies. The first encodes images based on the colour distance from a base color. The second uses a convolutional autoencoder to obtain embeddings of WSI tiles, and aggregates them into a WSI-level embedding. For both model types, features are fed into downstream ML classifiers. Two datasets from different clinical centers were used in two different training configurations: (1) training on one dataset and testing on the other; (2) combining the datasets. We also tested the performance with or without human preprocessing to remove brown artefacts Colour distance based models achieve the best performances on testing configuration (1) with artefact removal, while autoencoder-based models are superior in the remaining cases, which are prone to greater data variability. △ Less

Submitted 15 April, 2024; originally announced April 2024.

ACM Class: J.3; I.4.9

arXiv:2204.14223 [pdf, other]

doi 10.5281/zenodo.6500885

Dataset of Multi-aspect Integrated Migration Indicators

Authors: D. Goglia, L. Pollacci, A. Sirbu

Abstract: Nowadays, new branches of research are proposing the use of non-traditional data sources for the study of migration trends in order to find an original methodology to answer open questions about the human mobility framework. In this context we presents the Multi-aspect Integrated Migration Indicators (MIMI) dataset, an new dataset of migration drivers, resulting from the process of acquisition, tr… ▽ More Nowadays, new branches of research are proposing the use of non-traditional data sources for the study of migration trends in order to find an original methodology to answer open questions about the human mobility framework. In this context we presents the Multi-aspect Integrated Migration Indicators (MIMI) dataset, an new dataset of migration drivers, resulting from the process of acquisition, transformation and merge of both official data about international flows and stocks and original indicators not typically used in migration studies, such as online social networks. This work describes the process of gathering, embedding and merging traditional and novel features, resulting in this new multidisciplinary dataset that we believe could significantly contribute to nowcast to forecast both present and future bilateral migration trends. △ Less

Submitted 2 May, 2022; v1 submitted 26 April, 2022; originally announced April 2022.

Comments: 23 pages, 19 figures, corrections on typos and references

arXiv:2204.10646 [pdf, other]

Measuring the Salad Bowl: Superdiversity on Twitter

Authors: Laura Pollacci, Alina Sirbu, Fosca Giannotti, Dino Pedreschi

Abstract: Superdiversity refers to large cultural diversity in a population due to immigration. In this paper, we introduce a superdiversity index based on the changes in the emotional content of words used by a multi-cultural community, compared to the standard language. To compute our index we use Twitter data and we develop an algorithm to extend a dictionary for lexicon-based sentiment analysis. We vali… ▽ More Superdiversity refers to large cultural diversity in a population due to immigration. In this paper, we introduce a superdiversity index based on the changes in the emotional content of words used by a multi-cultural community, compared to the standard language. To compute our index we use Twitter data and we develop an algorithm to extend a dictionary for lexicon-based sentiment analysis. We validate our index by comparing it with official immigration statistics available from the European Commission's Joint Research Center, through the D4I data challenge. We show that, in general, our measure correlates with immigration rates, at various geographical resolutions. Our method produces very good results across languages, being tested here both on English and Italian tweets. We argue that our index has predictive power in regions where exact data on immigration is not available, paving the way for a nowcasting model of immigration rates. △ Less

Submitted 22 April, 2022; originally announced April 2022.

arXiv:2103.03710 [pdf, other]

Characterising different communities of Twitter users: Migrants and natives

Authors: Jisu Kim, Alina Sîrbu, Fosca Giannotti, Giulio Rossetti

Abstract: Today, many users are actively using Twitter to express their opinions and to share information. Thanks to the availability of the data, researchers have studied behaviours and social networks of these users. International migration studies have also benefited from this social media platform to improve migration statistics. Although diverse types of social networks have been studied so far on Twit… ▽ More Today, many users are actively using Twitter to express their opinions and to share information. Thanks to the availability of the data, researchers have studied behaviours and social networks of these users. International migration studies have also benefited from this social media platform to improve migration statistics. Although diverse types of social networks have been studied so far on Twitter, social networks of migrants and natives have not been studied before. This paper aims to fill this gap by studying characteristics and behaviours of migrants and natives on Twitter. To do so, we perform a general assessment of features including profiles and tweets, and an extensive network analysis on the network. We find that migrants have more followers than friends. They have also tweeted more despite that both of the groups have similar account ages. More interestingly, the assortativity scores showed that users tend to connect based on nationality more than country of residence, and this is more the case for migrants than natives. Furthermore, both natives and migrants tend to connect mostly with natives. △ Less

Submitted 5 March, 2021; originally announced March 2021.

Comments: 17 pages, 12 figures

arXiv:2102.11398 [pdf, other]

Home and destination attachment: study of cultural integration on Twitter

Authors: Jisu Kim, Alina Sîrbu, Giulio Rossetti, Fosca Giannotti, Hillel Rapoport

Abstract: The cultural integration of immigrants conditions their overall socio-economic integration as well as natives' attitudes towards globalisation in general and immigration in particular. At the same time, excessive integration -- or acculturation -- can be detrimental in that it implies forfeiting one's ties to the home country and eventually translates into a loss of diversity (from the viewpoint o… ▽ More The cultural integration of immigrants conditions their overall socio-economic integration as well as natives' attitudes towards globalisation in general and immigration in particular. At the same time, excessive integration -- or acculturation -- can be detrimental in that it implies forfeiting one's ties to the home country and eventually translates into a loss of diversity (from the viewpoint of host countries) and of global connections (from the viewpoint of both host and home countries). Cultural integration can be described using two dimensions: the preservation of links to the home country and culture, which we call home attachment, and the creation of new links together with the adoption of cultural traits from the new residence country, which we call destination attachment. In this paper we introduce a means to quantify these two aspects based on Twitter data. We build home and destination attachment indexes and analyse their possible determinants (e.g., language proximity, distance between countries), also in relation to Hofstede's cultural dimension scores. The results stress the importance of host language proficiency to explain destination attachment, but also the link between language and home attachment. In particular, the common language between home and destination countries corresponds to increased home attachment, as does low proficiency in the host language. Common geographical borders also seem to increase both home and destination attachment. Regarding cultural dimensions, larger differences among home and destination country in terms of Individualism, Masculinity and Uncertainty appear to correspond to larger destination attachment and lower home attachment. △ Less

Submitted 22 February, 2021; originally announced February 2021.

Comments: 19 pages, 10 figure and 4 tables

arXiv:2007.14241 [pdf, other]

doi 10.1016/j.future.2019.11.029

A Machine Learning Approach to Online Fault Classification in HPC Systems

Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

Abstract: As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which i… ▽ More As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay. △ Less

Submitted 27 July, 2020; originally announced July 2020.

Comments: arXiv admin note: text overlap with arXiv:1807.10056, arXiv:1810.11208

Journal ref: Future Generation Computer Systems, Volume 110, September 2020, Pages 1009-1022

arXiv:1810.11208 [pdf, other]

Online Fault Classification in HPC Systems through Machine Learning

Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

Abstract: As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures will be essential for continued operation. In this paper, we propose a fault classification method for… ▽ More As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures will be essential for continued operation. In this paper, we propose a fault classification method for HPC systems based on machine learning that has been designed specifically to operate with live streamed data. We cast the problem and its solution within realistic operating constraints of online use. Our results show that almost perfect classification accuracy can be reached for different fault types with low computational overhead and minimal delay. We have based our study on a local dataset, which we make publicly available, that was acquired by injecting faults to an in-house experimental HPC system. △ Less

Submitted 11 July, 2019; v1 submitted 26 October, 2018; originally announced October 2018.

Comments: Accepted for publication at the Euro-Par 2019 conference

arXiv:1807.10056 [pdf, other]

FINJ: A Fault Injection Tool for HPC Systems

Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

Abstract: We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing use… ▽ More We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing users to create and monitor a variety of highly-complex and diverse fault conditions in HPC systems that would be difficult to recreate in practice. FINJ is suitable for experiments involving many, potentially interacting nodes, making it a very versatile design and evaluation tool. △ Less

Submitted 1 September, 2018; v1 submitted 26 July, 2018; originally announced July 2018.

Comments: To be presented at the 11th Resilience Workshop in the 2018 Euro-Par conference

arXiv:1803.02111 [pdf, other]

doi 10.1371/journal.pone.0213246

Algorithmic bias amplifies opinion polarization: A bounded confidence model

Authors: Alina Sîrbu, Dino Pedreschi, Fosca Giannotti, János Kertész

Abstract: The flow of information reaching us via the online media platforms is optimized not by the information content or relevance but by popularity and proximity to the target. This is typically performed in order to maximise platform usage. As a side effect, this introduces an algorithmic bias that is believed to enhance polarization of the societal debate. To study this phenomenon, we modify the well-… ▽ More The flow of information reaching us via the online media platforms is optimized not by the information content or relevance but by popularity and proximity to the target. This is typically performed in order to maximise platform usage. As a side effect, this introduces an algorithmic bias that is believed to enhance polarization of the societal debate. To study this phenomenon, we modify the well-known continuous opinion dynamics model of bounded confidence in order to account for the algorithmic bias and investigate its consequences. In the simplest version of the original model the pairs of discussion participants are chosen at random and their opinions get closer to each other if they are within a fixed tolerance level. We modify the selection rule of the discussion partners: there is an enhanced probability to choose individuals whose opinions are already close to each other, thus mimicking the behavior of online media which suggest interaction with similar peers. As a result we observe: a) an increased tendency towards polarization, which emerges also in conditions where the original model would predict convergence, and b) a dramatic slowing down of the speed at which the convergence at the asymptotic state is reached, which makes the system highly unstable. Polarization is augmented by a fragmented initial population. △ Less

Submitted 6 March, 2018; originally announced March 2018.

arXiv:1802.04830 [pdf, other]

Prediction of next career moves from scientific profiles

Authors: Charlotte James, Luca Pappalardo, Alina Sirbu, Filippo Simini

Abstract: Changing institution is a scientist's key career decision, which plays an important role in education, scientific productivity, and the generation of scientific knowledge. Yet, our understanding of the factors influencing a relocation decision is very limited. In this paper we investigate how the scientific profile of a scientist determines their decision to move (i.e., change institution). To thi… ▽ More Changing institution is a scientist's key career decision, which plays an important role in education, scientific productivity, and the generation of scientific knowledge. Yet, our understanding of the factors influencing a relocation decision is very limited. In this paper we investigate how the scientific profile of a scientist determines their decision to move (i.e., change institution). To this aim, we describe a scientist's profile by three main aspects: the scientist's recent scientific career, the quality of their scientific environment and the structure of their scientific collaboration network. We then design and implement a two-stage predictive model: first, we use data mining to predict which researcher will move in the next year on the basis of their scientific profile; second we predict which institution they will choose by using a novel social-gravity model, an adaptation of the traditional gravity model of human mobility. Experiments on a massive dataset of scientific publications show that our approach performs well in both the stages, resulting in a 85% reduction of the prediction error with respect to the state-of-the-art approaches. △ Less

Submitted 13 February, 2018; originally announced February 2018.

arXiv:1801.05854 [pdf, other]

doi 10.1007/s41060-017-0086-6

NDlib: a Python Library to Model and Analyze Diffusion Processes Over Complex Networks

Authors: Giulio Rossetti, Letizia Milli, Salvatore Rinzivillo, Alina Sirbu, Fosca Giannotti, Dino Pedreschi

Abstract: Nowadays the analysis of dynamics of and on networks represents a hot topic in the Social Network Analysis playground. To support students, teachers, developers and researchers in this work we introduce a novel framework, namely NDlib, an environment designed to describe diffusion simulations. NDlib is designed to be a multi-level ecosystem that can be fruitfully used by different user segments. F… ▽ More Nowadays the analysis of dynamics of and on networks represents a hot topic in the Social Network Analysis playground. To support students, teachers, developers and researchers in this work we introduce a novel framework, namely NDlib, an environment designed to describe diffusion simulations. NDlib is designed to be a multi-level ecosystem that can be fruitfully used by different user segments. For this reason, upon NDlib, we designed a simulation server that allows remote execution of experiments as well as an online visualization tool that abstracts its programmatic interface and makes available the simulation platform to non-technicians. △ Less

Submitted 14 December, 2017; originally announced January 2018.

MSC Class: 05C85; 60J60; 90C35 ACM Class: G.2.2; F.2.1

Journal ref: International Journal of Data Science and Analytics, 2018

arXiv:1606.04456 [pdf, other]

doi 10.1007/s10586-016-0564-y

Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics

Authors: Alina Sîrbu, Ozalp Babaoglu

Abstract: Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-leve… ▽ More Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using live data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating predictive models for node failures. Our results support the practicality of a data-driven approach by showing the effectiveness of predictive models based on data found in typical data center logs. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing node state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if nodes will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%.This level of performance allows us to recover large fraction of jobs' executions (by redirecting them to other nodes when a failure of the present node is predicted) that would otherwise have been wasted due to failures. [...] △ Less

Submitted 14 June, 2016; originally announced June 2016.

Journal ref: Cluster Computing, Volume 19, Issue 2, pp 865-878, 2016

arXiv:1605.09530 [pdf, other]

Predicting System-level Power for a Hybrid Supercomputer

Authors: Alina Sîrbu, Ozalp Babaoglu

Abstract: For current High Performance Computing systems to scale towards the holy grail of ExaFLOP performance, their power consumption has to be reduced by at least one order of magnitude. This goal can be achieved only through a combination of hardware and software advances. Being able to model and accurately predict the power consumption of large computational systems is necessary for software-level inn… ▽ More For current High Performance Computing systems to scale towards the holy grail of ExaFLOP performance, their power consumption has to be reduced by at least one order of magnitude. This goal can be achieved only through a combination of hardware and software advances. Being able to model and accurately predict the power consumption of large computational systems is necessary for software-level innovations such as proactive and power-aware scheduling, resource allocation and fault tolerance techniques. In this paper we present a 2-layer model of power consumption for a hybrid supercomputer (which held the top spot of the Green500 list on July 2013) that combines CPU, GPU and MIC technologies to achieve higher energy efficiency. Our model takes as input workload information - the number and location of resources that are used by each job at a certain time - and calculates the resulting system-level power consumption. When jobs are submitted to the system, the workload configuration can be foreseen based on the scheduler policies, and our model can then be applied to predict the ensuing system-level power consumption. Additionally, alternative workload configurations can be evaluated from a power perspective and more efficient ones can be selected. Applications of the model include not only power-aware scheduling but also prediction of anomalous behavior. △ Less

Submitted 31 May, 2016; originally announced May 2016.

Comments: 8 pages, 8 figures, HPCS 2016

arXiv:1605.06326 [pdf, other]

doi 10.1007/978-3-319-25658-0_17

Opinion dynamics: models, extensions and external effects

Authors: Alina Sîrbu, Vittorio Loreto, Vito D. P. Servedio, Francesca Tria

Abstract: Recently, social phenomena have received a lot of attention not only from social scientists, but also from physicists, mathematicians and computer scientists, in the emerging interdisciplinary field of complex system science. Opinion dynamics is one of the processes studied, since opinions are the drivers of human behaviour, and play a crucial role in many global challenges that our complex world… ▽ More Recently, social phenomena have received a lot of attention not only from social scientists, but also from physicists, mathematicians and computer scientists, in the emerging interdisciplinary field of complex system science. Opinion dynamics is one of the processes studied, since opinions are the drivers of human behaviour, and play a crucial role in many global challenges that our complex world and societies are facing: global financial crises, global pandemics, growth of cities, urbanisation and migration patterns, and last but not least important, climate change and environmental sustainability and protection. Opinion formation is a complex process affected by the interplay of different elements, including the individual predisposition, the influence of positive and negative peer interaction (social networks playing a crucial role in this respect), the information each individual is exposed to, and many others. Several models inspired from those in use in physics have been developed to encompass many of these elements, and to allow for the identification of the mechanisms involved in the opinion formation process and the understanding of their role, with the practical aim of simulating opinion formation and spreading under various conditions. These modelling schemes range from binary simple models such as the voter model, to multi-dimensional continuous approaches. Here, we provide a review of recent methods, focusing on models employing both peer interaction and external information, and emphasising the role that less studied mechanisms, such as disagreement, has in driving the opinion dynamics. [...] △ Less

Submitted 20 May, 2016; originally announced May 2016.

Comments: 42 pages, 6 figures

Journal ref: Participatory Sensing, Opinions and Collective Awareness, 363-401, Understanding Complex Systems, Springer, 2016

arXiv:1601.05961 [pdf, other]

Power Consumption Modeling and Prediction in a Hybrid CPU-GPU-MIC Supercomputer (preliminary version)

Authors: Alina Sîrbu, Ozalp Babaoglu

Abstract: Power consumption is a major obstacle for High Performance Computing (HPC) systems in their quest towards the holy grail of ExaFLOP performance. Significant advances in power efficiency have to be made before this goal can be attained and accurate modeling is an essential step towards power efficiency by optimizing system operating parameters to match dynamic energy needs. In this paper we present… ▽ More Power consumption is a major obstacle for High Performance Computing (HPC) systems in their quest towards the holy grail of ExaFLOP performance. Significant advances in power efficiency have to be made before this goal can be attained and accurate modeling is an essential step towards power efficiency by optimizing system operating parameters to match dynamic energy needs. In this paper we present a study of power consumption by jobs in Eurora, a hybrid CPU-GPU-MIC system installed at the largest Italian data center. Using data from a dedicated monitoring framework, we build a data-driven model of power consumption for each user in the system and use it to predict the power requirements of future jobs. We are able to achieve good prediction results for over 80% of the users in the system. For the remaining users, we identify possible reasons why prediction performance is not as good. Possible applications for our predictive modeling results include scheduling optimization, power-aware billing and system-scale power modeling. All the scripts used for the study have been made available on GitHub. △ Less

Submitted 31 May, 2016; v1 submitted 22 January, 2016; originally announced January 2016.

Comments: 13 pages, 4 figures, 2 tables, Euro-Par 2016

arXiv:1509.00773 [pdf, other]

doi 10.1007/s00607-015-0480-7

A Big Data Analyzer for Large Trace Logs

Authors: Alkida Balliu, Dennis Olivetti, Ozalp Babaoglu, Moreno Marzolla, Alina Sîrbu

Abstract: Current generation of Internet-based services are typically hosted on large data centers that take the form of warehouse-size structures housing tens of thousands of servers. Continued availability of a modern data center is the result of a complex orchestration among many internal and external actors including computing hardware, multiple layers of intricate software, networking and storage devic… ▽ More Current generation of Internet-based services are typically hosted on large data centers that take the form of warehouse-size structures housing tens of thousands of servers. Continued availability of a modern data center is the result of a complex orchestration among many internal and external actors including computing hardware, multiple layers of intricate software, networking and storage devices, electrical power and cooling plants. During the course of their operation, many of these components produce large amounts of data in the form of event and error logs that are essential not only for identifying and resolving problems but also for improving data center efficiency and management. Most of these activities would benefit significantly from data analytics techniques to exploit hidden statistical patterns and correlations that may be present in the data. The sheer volume of data to be analyzed makes uncovering these correlations and patterns a challenging task. This paper presents BiDAl, a prototype Java tool for log-data analysis that incorporates several Big Data technologies in order to simplify the task of extracting information from data traces produced by large clusters and server farms. BiDAl provides the user with several analysis languages (SQL, R and Hadoop MapReduce) and storage backends (HDFS and SQLite) that can be freely mixed and matched so that a custom tool for a specific task can be easily constructed. BiDAl has a modular architecture so that it can be extended with other backends and analysis languages in the future. In this paper we present the design of BiDAl and describe our experience using it to analyze publicly-available traces from Google data clusters, with the goal of building a realistic model of a complex data center. △ Less

Submitted 2 September, 2015; originally announced September 2015.

Comments: 26 pages, 10 figures

Journal ref: Computing, 98(12), Dec 2016, pp. 1225-1249

arXiv:1505.04935 [pdf, other]

Towards Data-Driven Autonomics in Data Centers

Authors: Alina Sîrbu, Ozalp Babaoglu

Abstract: Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-leve… ▽ More Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using generated data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating a predictive model for node failures. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing machine state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if machines will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%. We discuss the practicality of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available from the authors' website. △ Less

Submitted 6 July, 2015; v1 submitted 19 May, 2015; originally announced May 2015.

Comments: 12 pages, 6 figures

arXiv:1410.4449 [pdf, other]

A Holistic Approach to Log Data Analysis in High-Performance Computing Systems: The Case of IBM Blue Gene/Q

Authors: Alina Sîrbu, Ozalp Babaoglu

Abstract: The complexity and cost of managing high-performance computing infrastructures are on the rise. Automating management and repair through predictive models to minimize human interventions is an attempt to increase system availability and contain these costs. Building predictive models that are accurate enough to be useful in automatic management cannot be based on restricted log data from subsystem… ▽ More The complexity and cost of managing high-performance computing infrastructures are on the rise. Automating management and repair through predictive models to minimize human interventions is an attempt to increase system availability and contain these costs. Building predictive models that are accurate enough to be useful in automatic management cannot be based on restricted log data from subsystems but requires a holistic approach to data analysis from disparate sources. Here we provide a detailed multi-scale characterization study based on four datasets reporting power consumption, temperature, workload, and hardware/software events for an IBM Blue Gene/Q installation. We show that the system runs a rich parallel workload, with low correlation among its components in terms of temperature and power, but higher correlation in terms of events. As expected, power and temperature correlate strongly, while events display negative correlations with load and power. Power and workload show moderate correlations, and only at the scale of components. The aim of the study is a systematic, integrated characterization of the computing infrastructure and discovery of correlation sources and levels to serve as basis for future predictive modeling efforts. △ Less

Submitted 7 September, 2015; v1 submitted 16 October, 2014; originally announced October 2014.

Comments: 12 pages, 7 Figures

arXiv:1410.1309 [pdf, other]

BiDAl: Big Data Analyzer for Cluster Traces

Authors: Alkida Balliu, Dennis Olivetti, Ozalp Babaoglu, Moreno Marzolla, Alina Sîrbu

Abstract: Modern data centers that provide Internet-scale services are stadium-size structures housing tens of thousands of heterogeneous devices (server clusters, networking equipment, power and cooling infrastructures) that must operate continuously and reliably. As part of their operation, these devices produce large amounts of data in the form of event and error logs that are essential not only for iden… ▽ More Modern data centers that provide Internet-scale services are stadium-size structures housing tens of thousands of heterogeneous devices (server clusters, networking equipment, power and cooling infrastructures) that must operate continuously and reliably. As part of their operation, these devices produce large amounts of data in the form of event and error logs that are essential not only for identifying problems but also for improving data center efficiency and management. These activities employ data analytics and often exploit hidden statistical patterns and correlations among different factors present in the data. Uncovering these patterns and correlations is challenging due to the sheer volume of data to be analyzed. This paper presents BiDAl, a prototype "log-data analysis framework" that incorporates various Big Data technologies to simplify the analysis of data traces from large clusters. BiDAl is written in Java with a modular and extensible architecture so that different storage backends (currently, HDFS and SQLite are supported), as well as different analysis languages (current implementation supports SQL, R and Hadoop MapReduce) can be easily selected as appropriate. We present the design of BiDAl and describe our experience using it to analyze several public traces of Google data clusters for building a simulation model capable of reproducing observed behavior. △ Less

Submitted 6 October, 2014; originally announced October 2014.

Comments: published in E. Plödereder, L. Grunske, E. Schneider, D. Ull (editors), proc. INFORMATIK 2014 Workshop on System Software Support for Big Data (BigSys 2014), September 25--26 2014, Stuttgart, Germany, Lecture Notes in Informatics (LNI) Proceedings, Series of the Gesellschaft für Informatik (GI), Volume P-232, pp. 1781--1795, ISBN 978-3-88579-626-8, ISSN 1617-5468

Journal ref: proc. INFORMATIK 2014 Workshop on System Software Support for Big Data (BigSys 2014), Lecture Notes in Informatics (LNI), Volume P-232, pp. 1781-1795, ISBN 78-3-88579-626-8, ISSN 1617-5468

arXiv:1401.4578 [pdf, other]

doi 10.1109/CGC.2013.69

XTribe: a web-based social computation platform

Authors: Saverio Caminiti, Claudio Cicali, Pietro Gravino, Vittorio Loreto, Vito D. P. Servedio, Alina Sîrbu, Francesca Tria

Abstract: In the last few years the Web has progressively acquired the status of an infrastructure for social computation that allows researchers to coordinate the cognitive abilities of human agents in on-line communities so to steer the collective user activity towards predefined goals. This general trend is also triggering the adoption of web-games as a very interesting laboratory to run experiments in t… ▽ More In the last few years the Web has progressively acquired the status of an infrastructure for social computation that allows researchers to coordinate the cognitive abilities of human agents in on-line communities so to steer the collective user activity towards predefined goals. This general trend is also triggering the adoption of web-games as a very interesting laboratory to run experiments in the social sciences and whenever the contribution of human beings is crucially required for research purposes. Nowadays, while the number of on-line users has been steadily growing, there is still a need of systematization in the approach to the web as a laboratory. In this paper we present Experimental Tribe (XTribe in short), a novel general purpose web-based platform for web-gaming and social computation. Ready to use and already operational, XTribe aims at drastically reducing the effort required to develop and run web experiments. XTribe has been designed to speed up the implementation of those general aspects of web experiments that are independent of the specific experiment content. For example, XTribe takes care of user management by handling their registration and profiles and in case of multi-player games, it provides the necessary user grou** functionalities. XTribe also provides communication facilities to easily achieve both bidirectional and asynchronous communication. From a practical point of view, researchers are left with the only task of designing and implementing the game interface and logic of their experiment, on which they maintain full control. Moreover, XTribe acts as a repository of different scientific experiments, thus realizing a sort of showcase that stimulates users' curiosity, enhances their participation, and helps researchers in recruiting volunteers. △ Less

Submitted 18 January, 2014; originally announced January 2014.

Comments: 11 pages, 2 figures, 1 table, 2013 Third International Conference on Cloud and Green Computing (CGC), Sept. 30 2013-Oct. 2 2013, Karlsruhe, Germany

Journal ref: IEEE Xplore, Cloud and Green Computing (CGC), 2013 Third International Conference on, 397-403 (2013)

arXiv:1302.4872 [pdf, other]

doi 10.1142/S0219525913500355

Cohesion, consensus and extreme information in opinion dynamics

Authors: Alina Sîrbu, Vittorio Loreto, Vito D. P. Servedio, Francesca Tria

Abstract: Opinion formation is an important element of social dynamics. It has been widely studied in the last years with tools from physics, mathematics and computer science. Here, a continuous model of opinion dynamics for multiple possible choices is analysed. Its main features are the inclusion of disagreement and possibility of modulating information, both from one and multiple sources. The interest is… ▽ More Opinion formation is an important element of social dynamics. It has been widely studied in the last years with tools from physics, mathematics and computer science. Here, a continuous model of opinion dynamics for multiple possible choices is analysed. Its main features are the inclusion of disagreement and possibility of modulating information, both from one and multiple sources. The interest is in identifying the effect of the initial cohesion of the population, the interplay between cohesion and information extremism, and the effect of using multiple sources of information that can influence the system. Final consensus, especially with external information, depends highly on these factors, as numerical simulations show. When no information is present, consensus or segregation is determined by the initial cohesion of the population. Interestingly, when only one source of information is present, consensus can be obtained, in general, only when this is extremely mild, i.e. there is not a single opinion strongly promoted, or in the special case of a large initial cohesion and low information exposure. On the contrary, when multiple information sources are allowed, consensus can emerge with an information source even when this is not extremely mild, i.e. it carries a strong message, for a large range of initial conditions. △ Less

Submitted 20 February, 2013; originally announced February 2013.

Comments: 20 pages, 11 figures

Journal ref: ADVANCES IN COMPLEX SYSTEMS online, 1350035 (2013)

arXiv:1212.0121 [pdf, other]

doi 10.1007/s10955-013-0724-x

Opinion dynamics with disagreement and modulated information

Authors: Alina Sîrbu, Vittorio Loreto, Vito D. P. Servedio, Francesca Tria

Abstract: Opinion dynamics concerns social processes through which populations or groups of individuals agree or disagree on specific issues. As such, modelling opinion dynamics represents an important research area that has been progressively acquiring relevance in many different domains. Existing approaches have mostly represented opinions through discrete binary or continuous variables by exploring a who… ▽ More Opinion dynamics concerns social processes through which populations or groups of individuals agree or disagree on specific issues. As such, modelling opinion dynamics represents an important research area that has been progressively acquiring relevance in many different domains. Existing approaches have mostly represented opinions through discrete binary or continuous variables by exploring a whole panoply of cases: e.g. independence, noise, external effects, multiple issues. In most of these cases the crucial ingredient is an attractive dynamics through which similar or similar enough agents get closer. Only rarely the possibility of explicit disagreement has been taken into account (i.e., the possibility for a repulsive interaction among individuals' opinions), and mostly for discrete or 1-dimensional opinions, through the introduction of additional model parameters. Here we introduce a new model of opinion formation, which focuses on the interplay between the possibility of explicit disagreement, modulated in a self-consistent way by the existing opinions' overlaps between the interacting individuals, and the effect of external information on the system. Opinions are modelled as a vector of continuous variables related to multiple possible choices for an issue. Information can be modulated to account for promoting multiple possible choices. Numerical results show that extreme information results in segregation and has a limited effect on the population, while milder messages have better success and a cohesion effect. Additionally, the initial condition plays an important role, with the population forming one or multiple clusters based on the initial average similarity between individuals, with a transition point depending on the number of opinion choices. △ Less

Submitted 1 December, 2012; originally announced December 2012.

Journal ref: Journal of Statistical Physics, Volume 151, Issue 1-2, pp 218-237, 2013

Showing 1–23 of 23 results for author: Sîrbu, A