Skip to main content

Showing 1–16 of 16 results for author: Katsifodimos, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.09534  [pdf, other

    cs.DB cs.LG

    FeatNavigator: Automatic Feature Augmentation on Tabular Data

    Authors: Jiaming Liang, Chuan Lei, Xiao Qin, Jiani Zhang, Asterios Katsifodimos, Christos Faloutsos, Huzefa Rangwala

    Abstract: Data-centric AI focuses on understanding and utilizing high-quality, relevant data in training machine learning (ML) models, thereby increasing the likelihood of producing accurate and useful results. Automatic feature augmentation, aiming to augment the initial base table with useful features from other tables, is critical in data preparation as it improves model performance, robustness, and gene… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 15 pages, 41 figures

  2. arXiv:2403.13629  [pdf, other

    cs.DC cs.DB

    CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows

    Authors: George Siachamis, Kyriakos Psarakis, Marios Fragkoulis, Arie van Deursen, Paris Carbone, Asterios Katsifodimos

    Abstract: Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing guarantees. At the moment of writing, virtually all stream processors that guarantee exactly-once processing implement a variant of Apache Flink's coordinated chec… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  3. arXiv:2403.07653  [pdf, other

    cs.DB

    OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories

    Authors: Christos Koutras, Jiani Zhang, Xiao Qin, Chuan Lei, Vasileios Ioannidis, Christos Faloutsos, George Karypis, Asterios Katsifodimos

    Abstract: How can we discover join relationships among columns of tabular data in a data repository? Can this be done effectively when metadata is missing? Traditional column matching works mainly rely on similarity measures based on exact value overlaps, hence missing important semantics or failing to handle noise in the data. At the same time, recent dataset discovery methods focusing on deep table repres… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  4. arXiv:2312.06893  [pdf, other

    cs.DC cs.DB

    Styx: Transactional Stateful Functions on Streaming Dataflows

    Authors: Kyriakos Psarakis, George Siachamis, George Christodoulou, Marios Fragkoulis, Asterios Katsifodimos

    Abstract: Develo** stateful cloud applications, such as high-throughput/low-latency workflows and microservices with strict consistency requirements, remains arduous for programmers. The Stateful-Functions-as-a-Service (SFaaS) paradigm aims to serve these use cases. However, existing approaches either provide serializable transactional guarantees at the level of individual functions or separate applicat… ▽ More

    Submitted 4 March, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

  5. Leveraging Large Language Models for Sequential Recommendation

    Authors: Jesse Harte, Wouter Zorgdrager, Panos Louridas, Asterios Katsifodimos, Dietmar Jannach, Marios Fragkoulis

    Abstract: Sequential recommendation problems have received increasing attention in research during the past few years, leading to the inception of a large variety of algorithmic approaches. In this work, we explore how large language models (LLMs), which are nowadays introducing disruptive effects in many AI-based applications, can be used to build or improve sequential recommendation approaches. Specifical… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

    Comments: 9 pages

    Report number: In Seventeenth ACM Conference on Recommender Systems (RecSys '23), September 18--22, 2023, Singapore, Singapore. ACM, New York, NY, USA

  6. Accelerating Machine Learning Queries with Linear Algebra Query Processing

    Authors: Wenbo Sun, Asterios Katsifodimos, Rihan Hai

    Abstract: The rapid growth of large-scale machine learning (ML) models has led numerous commercial companies to utilize ML models for generating predictive results to help business decision-making. As two primary components in traditional predictive pipelines, data processing, and model predictions often operate in separate execution environments, leading to redundant engineering and computations. Additiona… ▽ More

    Submitted 24 January, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

  7. arXiv:2207.09315  [pdf, other

    cs.LG cs.DB

    Metadata Representations for Queryable ML Model Zoos

    Authors: Ziyu Li, Rihan Hai, Alessandro Bozzon, Asterios Katsifodimos

    Abstract: Machine learning (ML) practitioners and organizations are building model zoos of pre-trained models, containing metadata describing properties of the ML models and datasets that are useful for reporting, auditing, reproducibility, and interpretability purposes. The metatada is currently not standardised; its expressivity is limited; and there is no interoperable way to store and query it. Conseque… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

  8. arXiv:2206.12733  [pdf, other

    cs.DB

    SiMa: Effective and Efficient Matching Across Data Silos Using Graph Neural Networks

    Authors: Christos Koutras, Rihan Hai, Kyriakos Psarakis, Marios Fragkoulis, Asterios Katsifodimos

    Abstract: How can we leverage existing column relationships within silos, to predict similar ones across silos? Can we do this efficiently and effectively? Existing matching approaches do not exploit prior knowledge, relying on prohibitively expensive similarity computations. In this paper we present the first technique for matching columns across data silos, called SiMa, which leverages Graph Neural Networ… ▽ More

    Submitted 3 March, 2024; v1 submitted 25 June, 2022; originally announced June 2022.

  9. arXiv:2205.09681  [pdf, other

    cs.DB

    Amalur: Data Integration Meets Machine Learning

    Authors: Rihan Hai, Christos Koutras, Andra Ionescu, Ziyu Li, Wenbo Sun, Jessie van Schijndel, Yan Kang, Asterios Katsifodimos

    Abstract: The data needed for machine learning (ML) model training, can reside in different separate sites often termed data silos. For data-intensive ML applications, data silos pose a major challenge: the integration and transformation of data demand a lot of manual work and computational resources. With data privacy and security constraints, data often cannot leave the local sites, and a model has to be… ▽ More

    Submitted 1 March, 2023; v1 submitted 19 May, 2022; originally announced May 2022.

    Comments: Accepted at ICDE2023 -- Special track (Vision)

  10. arXiv:2112.00710  [pdf, other

    cs.DC cs.DB

    Stateful Entities: Object-oriented Cloud Applications as Distributed Dataflows

    Authors: Kyriakos Psarakis, Wouter Zorgdrager, Marios Fragkoulis, Guido Salvaneschi, Asterios Katsifodimos

    Abstract: Although the cloud has reached a state of robustness, the burden of using its resources falls on the shoulders of programmers who struggle to keep up with ever-growing cloud infrastructure services and abstractions. As a result, state management, scaling, operation, and failure management of scalable cloud applications, require disproportionately more effort than develo** the applications' actua… ▽ More

    Submitted 3 September, 2023; v1 submitted 17 November, 2021; originally announced December 2021.

  11. arXiv:2103.10169  [pdf, other

    cs.DC cs.DB

    Hazelcast Jet: Low-latency Stream Processing at the 99.99th Percentile

    Authors: Can Gencer, Marko Topolnik, Viliam Ďurina, Emin Demirci, Ensar B. Kahveci, Ali Gürbüz Ondřej Lukáš, József Bartók, Grzegorz Gierlach, František Hartman, Ufuk Yılmaz, Mehmet Doğan, Mohamed Mandouh, Marios Fragkoulis, Asterios Katsifodimos

    Abstract: Jet is an open-source, high-performance, distributed stream processor built at Hazelcast during the last five years. Jet was engineered with millisecond latency on the 99.99th percentile as its primary design goal. Originally Jet's purpose was to be an execution engine that performs complex business logic on top of streams generated by Hazelcast's In-memory Data Grid (IMDG): a set of high-performa… ▽ More

    Submitted 18 March, 2021; originally announced March 2021.

  12. arXiv:2010.07386  [pdf, other

    cs.DB

    Valentine: Evaluating Matching Techniques for Dataset Discovery

    Authors: Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, Asterios Katsifodimos

    Abstract: Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of sche… ▽ More

    Submitted 13 February, 2021; v1 submitted 14 October, 2020; originally announced October 2020.

  13. arXiv:2008.00842  [pdf, other

    cs.DC cs.CL cs.DB cs.PF

    A Survey on the Evolution of Stream Processing Systems

    Authors: Marios Fragkoulis, Paris Carbone, Vasiliki Kalavri, Asterios Katsifodimos

    Abstract: Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, st… ▽ More

    Submitted 14 January, 2023; v1 submitted 3 August, 2020; originally announced August 2020.

    Comments: 30 pages, 10 figures, 6 tables

  14. Benchmarking Distributed Stream Data Processing Systems

    Authors: Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, Volker Markl

    Abstract: The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads, there is a clear gap of detailed analyses of the systems' performance characteristics. In this paper, we propose a fra… ▽ More

    Submitted 24 June, 2019; v1 submitted 23 February, 2018; originally announced February 2018.

    Comments: Published at ICDE 2018

    MSC Class: ieee.org

    Journal ref: 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 1507-1518, IEEE, 2018

  15. arXiv:1112.2610  [pdf, other

    cs.DB

    The ViP2P Platform: XML Views in P2P

    Authors: Konstantinos Karanasos, Asterios Katsifodimos, Ioana Manolescu, Spyros Zoupanos

    Abstract: The growing volumes of XML data sources on the Web or produced by enterprises, organizations etc. raise many performance challenges for data management applications. In this work, we are concerned with the distributed, peer-to-peer management of large corpora of XML documents, based on distributed hash table (or DHT, in short) overlay networks. We present ViP2P (standing for Views in Peer-to-Peer)… ▽ More

    Submitted 12 December, 2011; originally announced December 2011.

    Comments: RR-7812 (2011)

  16. arXiv:1008.0557  [pdf, ps, other

    cs.DB

    LiquidXML: Adaptive XML Content Redistribution

    Authors: Jesús Camacho-Rodríguez, Asterios Katsifodimos, Ioana Manolescu, Alexandra Roatis

    Abstract: We propose to demonstrate LiquidXML, a platform for managing large corpora of XML documents in large-scale P2P networks. All LiquidXML peers may publish XML documents to be shared with all the network peers. The challenge then is to efficiently (re-)distribute the published content in the network, possibly in overlap**, redundant fragments, to support efficient processing of queries at each peer… ▽ More

    Submitted 4 August, 2010; v1 submitted 3 August, 2010; originally announced August 2010.

    Journal ref: ACM International Conference on Information and Knowledge Management, Toronto : Canada (2010)