Search | arXiv e-print repository

Facts-and-Feelings: Capturing both Objectivity and Subjectivity in Table-to-Text Generation

Authors: Tathagata Dey, Pushpak Bhattacharyya

Abstract: Table-to-text generation, a long-standing challenge in natural language generation, has remained unexplored through the lens of subjectivity. Subjectivity here encompasses the comprehension of information derived from the table that cannot be described solely by objective data. Given the absence of pre-existing datasets, we introduce the Ta2TS dataset with 3849 data instances. We perform the task… ▽ More Table-to-text generation, a long-standing challenge in natural language generation, has remained unexplored through the lens of subjectivity. Subjectivity here encompasses the comprehension of information derived from the table that cannot be described solely by objective data. Given the absence of pre-existing datasets, we introduce the Ta2TS dataset with 3849 data instances. We perform the task of fine-tuning sequence-to-sequence models on the linearized tables and prompting on popular large language models. We analyze the results from a quantitative and qualitative perspective to ensure the capture of subjectivity and factual consistency. The analysis shows the fine-tuned LMs can perform close to the prompted LLMs. Both the models can capture the tabular data, generating texts with 85.15% BERTScore and 26.28% Meteor score. To the best of our knowledge, we provide the first-of-its-kind dataset on tables with multiple genres and subjectivity included and present the first comprehensive analysis and comparison of different LLM performances on this task. △ Less

Submitted 15 June, 2024; originally announced June 2024.

arXiv:2406.07100 [pdf, other]

D-GRIL: End-to-End Topological Learning with 2-parameter Persistence

Authors: Soham Mukherjee, Shreyas N. Samaga, Cheng Xin, Steve Oudot, Tamal K. Dey

Abstract: End-to-end topological learning using 1-parameter persistence is well-known. We show that the framework can be enhanced using 2-parameter persistence by adopting a recently introduced 2-parameter persistence based vectorization technique called GRIL. We establish a theoretical foundation of differentiating GRIL producing D-GRIL. We show that D-GRIL can be used to learn a bifiltration function on s… ▽ More End-to-end topological learning using 1-parameter persistence is well-known. We show that the framework can be enhanced using 2-parameter persistence by adopting a recently introduced 2-parameter persistence based vectorization technique called GRIL. We establish a theoretical foundation of differentiating GRIL producing D-GRIL. We show that D-GRIL can be used to learn a bifiltration function on standard benchmark graph datasets. Further, we exhibit that this framework can be applied in the context of bio-activity prediction in drug discovery. △ Less

Submitted 27 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.02732 [pdf, other]

GEFL: Extended Filtration Learning for Graph Classification

Authors: Simon Zhang, Soham Mukherjee, Tamal K. Dey

Abstract: Extended persistence is a technique from topological data analysis to obtain global multiscale topological information from a graph. This includes information about connected components and cycles that are captured by the so-called persistence barcodes. We introduce extended persistence into a supervised learning framework for graph classification. Global topological information, in the form of a… ▽ More Extended persistence is a technique from topological data analysis to obtain global multiscale topological information from a graph. This includes information about connected components and cycles that are captured by the so-called persistence barcodes. We introduce extended persistence into a supervised learning framework for graph classification. Global topological information, in the form of a barcode with four different types of bars and their explicit cycle representatives, is combined into the model by the readout function which is computed by extended persistence. The entire model is end-to-end differentiable. We use a link-cut tree data structure and parallelism to lower the complexity of computing extended persistence, obtaining a speedup of more than 60x over the state-of-the-art for extended persistence computation. This makes extended persistence feasible for machine learning. We show that, under certain conditions, extended persistence surpasses both the WL[1] graph isomorphism test and 0-dimensional barcodes in terms of expressivity because it adds more global (topological) information. In particular, arbitrarily long cycles can be represented, which is difficult for finite receptive field message passing graph neural networks. Furthermore, we show the effectiveness of our method on real world datasets compared to many existing recent graph representation learning methods. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: 26 pages, 13 figures, Learning on Graphs Conference (LoG 2022)

arXiv:2403.10958 [pdf, other]

Efficient Algorithms for Complexes of Persistence Modules with Applications

Authors: Tamal K. Dey, Florian Russold, Shreyas N. Samaga

Abstract: We extend the persistence algorithm, viewed as an algorithm computing the homology of a complex of free persistence or graded modules, to complexes of modules that are not free. We replace persistence modules by their presentations and develop an efficient algorithm to compute the homology of a complex of presentations. To deal with inputs that are not given in terms of presentations, we give an e… ▽ More We extend the persistence algorithm, viewed as an algorithm computing the homology of a complex of free persistence or graded modules, to complexes of modules that are not free. We replace persistence modules by their presentations and develop an efficient algorithm to compute the homology of a complex of presentations. To deal with inputs that are not given in terms of presentations, we give an efficient algorithm to compute a presentation of a morphism of persistence modules. This allows us to compute persistent (co)homology of instances giving rise to complexes of non-free modules. Our methods lead to a new efficient algorithm for computing the persistent homology of simplicial towers and they enable efficient algorithms to compute the persistent homology of cosheaves over simplicial towers and cohomology of persistent sheaves on simplicial complexes. We also show that we can compute the cohomology of persistent sheaves over arbitrary finite posets by reducing the computation to a computation over simplicial complexes. △ Less

Submitted 16 March, 2024; originally announced March 2024.

Comments: This is the full version of a paper accepted at the 40th International Symposium on Computational Geometry (SoCG 2024)

arXiv:2403.08110 [pdf, other]

Computing Generalized Ranks of Persistence Modules via Unfolding to Zigzag Modules

Authors: Tamal K. Dey, Cheng Xin

Abstract: For a $P$-indexed persistence module ${\sf M}$, the (generalized) rank of ${\sf M}$ is defined as the rank of the limit-to-colimit map for the diagram of vector spaces of ${\sf M}$ over the poset $P$. For $2$-parameter persistence modules, recently a zigzag persistence based algorithm has been proposed that takes advantage of the fact that generalized rank for $2$-parameter modules is equal to the… ▽ More For a $P$-indexed persistence module ${\sf M}$, the (generalized) rank of ${\sf M}$ is defined as the rank of the limit-to-colimit map for the diagram of vector spaces of ${\sf M}$ over the poset $P$. For $2$-parameter persistence modules, recently a zigzag persistence based algorithm has been proposed that takes advantage of the fact that generalized rank for $2$-parameter modules is equal to the number of full intervals in a zigzag module defined on the boundary of the poset. Analogous definition of boundary for $d$-parameter persistence modules or general $P$-indexed persistence modules does not seem plausible. To overcome this difficulty, we first unfold a given $P$-indexed module ${\sf M}$ into a zigzag module ${\sf M}_{ZZ}$ and then check how many full interval modules in a decomposition of ${\sf M}_{ZZ}$ can be folded back to remain full in a decomposition of ${\sf M}$. This number determines the generalized rank of ${\sf M}$. For special cases of degree-$d$ homology for $d$-complexes, we obtain a more efficient algorithm including a linear time algorithm for degree-$1$ homology in graphs. △ Less

Submitted 4 April, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2402.11339 [pdf, other]

Expressive Higher-Order Link Prediction through Hypergraph Symmetry Breaking

Authors: Simon Zhang, Cheng Xin, Tamal K. Dey

Abstract: A hypergraph consists of a set of nodes along with a collection of subsets of the nodes called hyperedges. Higher-order link prediction is the task of predicting the existence of a missing hyperedge in a hypergraph. A hyperedge representation learned for higher order link prediction is fully expressive when it does not lose distinguishing power up to an isomorphism. Many existing hypergraph repres… ▽ More A hypergraph consists of a set of nodes along with a collection of subsets of the nodes called hyperedges. Higher-order link prediction is the task of predicting the existence of a missing hyperedge in a hypergraph. A hyperedge representation learned for higher order link prediction is fully expressive when it does not lose distinguishing power up to an isomorphism. Many existing hypergraph representation learners, are bounded in expressive power by the Generalized Weisfeiler Lehman-1 (GWL-1) algorithm, a generalization of the Weisfeiler Lehman-1 algorithm. However, GWL-1 has limited expressive power. In fact, induced subhypergraphs with identical GWL-1 valued nodes are indistinguishable. Furthermore, message passing on hypergraphs can already be computationally expensive, especially on GPU memory. To address these limitations, we devise a preprocessing algorithm that can identify certain regular subhypergraphs exhibiting symmetry. Our preprocessing algorithm runs once with complexity the size of the input hypergraph. During training, we randomly replace subhypergraphs identified by the algorithm with covering hyperedges to break symmetry. We show that our method improves the expressivity of GWL-1. Our extensive experiments also demonstrate the effectiveness of our approach for higher-order link prediction on both graph and hypergraph datasets with negligible change in computation. △ Less

Submitted 17 February, 2024; originally announced February 2024.

Comments: 46 pages, 4 figures

arXiv:2309.15188 [pdf, other]

doi 10.5281/zenodo.7958513

ICML 2023 Topological Deep Learning Challenge : Design and Results

Authors: Mathilde Papillon, Mustafa Hajij, Helen Jenne, Johan Mathe, Audun Myers, Theodore Papamarkou, Tolga Birdal, Tamal Dey, Tim Doster, Tegan Emerson, Gurusankar Gopalakrishnan, Devendra Govil, Aldo Guzmán-Sáenz, Henry Kvinge, Neal Livesay, Soham Mukherjee, Shreyas N. Samaga, Karthikeyan Natesan Ramamurthy, Maneel Reddy Karri, Paul Rosen, Sophia Sanborn, Robin Walters, Jens Agerberg, Sadrodin Barikbin, Claudio Battiloro , et al. (31 additional authors not shown)

Abstract: This paper presents the computational challenge on topological deep learning that was hosted within the ICML 2023 Workshop on Topology and Geometry in Machine Learning. The competition asked participants to provide open-source implementations of topological neural networks from the literature by contributing to the python packages TopoNetX (data processing) and TopoModelX (deep learning). The chal… ▽ More This paper presents the computational challenge on topological deep learning that was hosted within the ICML 2023 Workshop on Topology and Geometry in Machine Learning. The competition asked participants to provide open-source implementations of topological neural networks from the literature by contributing to the python packages TopoNetX (data processing) and TopoModelX (deep learning). The challenge attracted twenty-eight qualifying submissions in its two-month duration. This paper describes the design of the challenge and summarizes its main findings. △ Less

Submitted 18 January, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

arXiv:2307.07462 [pdf, other]

Computing Zigzag Vineyard Efficiently Including Expansions and Contractions

Authors: Tamal K. Dey, Tao Hou

Abstract: Vines and vineyard connecting a stack of persistence diagrams have been introduced in the non-zigzag setting by Cohen-Steiner et al. We consider computing these vines over changing filtrations for zigzag persistence while incorporating two more operations: expansions and contractions in addition to the transpositions considered in the non-zigzag setting. Although expansions and contractions can be… ▽ More Vines and vineyard connecting a stack of persistence diagrams have been introduced in the non-zigzag setting by Cohen-Steiner et al. We consider computing these vines over changing filtrations for zigzag persistence while incorporating two more operations: expansions and contractions in addition to the transpositions considered in the non-zigzag setting. Although expansions and contractions can be implemented in quadratic time in the non-zigzag case by utilizing the linear-time transpositions, it is not obvious how they can be carried out under the zigzag framework with the same complexity. While transpositions alone can be easily conducted in linear time using the recent FastZigzag algorithm, expansions and contractions pose difficulty in breaking the barrier of cubic complexity. Our main result is that, the half-way constructed up-down filtration in the FastZigzag algorithm indeed can be used to achieve linear time complexity for transpositions and quadratic time complexity for expansions and contractions, matching the time complexity of all corresponding operations in the non-zigzag case. △ Less

Submitted 18 February, 2024; v1 submitted 14 July, 2023; originally announced July 2023.

Comments: Updated funding information for one co-author

arXiv:2304.06048 [pdf, other]

RELS-DQN: A Robust and Efficient Local Search Framework for Combinatorial Optimization

Authors: Yuanhang Shao, Tonmoy Dey, Nikola Vuckovic, Luke Van Popering, Alan Kuhnle

Abstract: Combinatorial optimization (CO) aims to efficiently find the best solution to NP-hard problems ranging from statistical physics to social media marketing. A wide range of CO applications can benefit from local search methods because they allow reversible action over greedy policies. Deep Q-learning (DQN) using message-passing neural networks (MPNN) has shown promise in replicating the local search… ▽ More Combinatorial optimization (CO) aims to efficiently find the best solution to NP-hard problems ranging from statistical physics to social media marketing. A wide range of CO applications can benefit from local search methods because they allow reversible action over greedy policies. Deep Q-learning (DQN) using message-passing neural networks (MPNN) has shown promise in replicating the local search behavior and obtaining comparable results to the local search algorithms. However, the over-smoothing and the information loss during the iterations of message passing limit its robustness across applications, and the large message vectors result in memory inefficiency. Our paper introduces RELS-DQN, a lightweight DQN framework that exhibits the local search behavior while providing practical scalability. Using the RELS-DQN model trained on one application, it can generalize to various applications by providing solution values higher than or equal to both the local search algorithms and the existing DQN models while remaining efficient in runtime and memory. △ Less

Submitted 11 April, 2023; originally announced April 2023.

arXiv:2304.04970 [pdf, other]

GRIL: A $2$-parameter Persistence Based Vectorization for Machine Learning

Authors: Cheng Xin, Soham Mukherjee, Shreyas N. Samaga, Tamal K. Dey

Abstract: $1$-parameter persistent homology, a cornerstone in Topological Data Analysis (TDA), studies the evolution of topological features such as connected components and cycles hidden in data. It has been applied to enhance the representation power of deep learning models, such as Graph Neural Networks (GNNs). To enrich the representations of topological features, here we propose to study $2… ▽ More $1$-parameter persistent homology, a cornerstone in Topological Data Analysis (TDA), studies the evolution of topological features such as connected components and cycles hidden in data. It has been applied to enhance the representation power of deep learning models, such as Graph Neural Networks (GNNs). To enrich the representations of topological features, here we propose to study $2$-parameter persistence modules induced by bi-filtration functions. In order to incorporate these representations into machine learning models, we introduce a novel vector representation called Generalized Rank Invariant Landscape (GRIL) for $2$-parameter persistence modules. We show that this vector representation is $1$-Lipschitz stable and differentiable with respect to underlying filtration functions and can be easily integrated into machine learning models to augment encoding topological features. We present an algorithm to compute the vector representation efficiently. We also test our methods on synthetic and benchmark graph datasets, and compare the results with previous vector representations of $1$-parameter and $2$-parameter persistence modules. Further, we augment GNNs with GRIL features and observe an increase in performance indicating that GRIL can capture additional features enriching GNNs. We make the complete code for the proposed method available at https://github.com/soham0209/mpml-graph. △ Less

Submitted 30 June, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

arXiv:2303.08270 [pdf, other]

Meta-Diagrams for 2-Parameter Persistence

Authors: Nate Clause, Tamal K. Dey, Facundo Mémoli, Bei Wang

Abstract: We first introduce the notion of meta-rank for a 2-parameter persistence module, an invariant that captures the information behind images of morphisms between 1D slices of the module. We then define the meta-diagram of a 2-parameter persistence module to be the Möbius inversion of the meta-rank, resulting in a function that takes values from signed 1-parameter persistence modules. We show that the… ▽ More We first introduce the notion of meta-rank for a 2-parameter persistence module, an invariant that captures the information behind images of morphisms between 1D slices of the module. We then define the meta-diagram of a 2-parameter persistence module to be the Möbius inversion of the meta-rank, resulting in a function that takes values from signed 1-parameter persistence modules. We show that the meta-rank and meta-diagram contain information equivalent to the rank invariant and the signed barcode. This equivalence leads to computational benefits, as we introduce an algorithm for computing the meta-rank and meta-diagram of a 2-parameter module $M$ indexed by a bifiltration of $n$ simplices in $O(n^3)$ time. This implies an improvement upon the existing algorithm for computing the signed barcode, which has $O(n^4)$ runtime. This also allows us to improve the existing upper bound on the number of rectangles in the rank decomposition of $M$ from $O(n^4)$ to $O(n^3)$. In addition, we define notions of erosion distance between meta-ranks and between meta-diagrams, and show that under these distances, meta-ranks and meta-diagrams are stable with respect to the interleaving distance. Lastly, the meta-diagram can be visualized in an intuitive fashion as a persistence diagram of diagrams, which generalizes the well-understood persistence diagram in the 1-parameter setting. △ Less

Submitted 14 March, 2023; originally announced March 2023.

Comments: 22 pages, 8 figures. Full version of the paper that is to appear in the Proceedings of the 39th International Symposium on Computational Geometry (SoCG 2023)

arXiv:2303.02549 [pdf, other]

Computing Connection Matrices via Persistence-like Reductions

Authors: Tamal K. Dey, Michał Lipiński, Marian Mrozek, Ryan Slechta

Abstract: Connection matrices are a generalization of Morse boundary operators from the classical Morse theory for gradient vector fields. Develo** an efficient computational framework for connection matrices is particularly important in the context of a rapidly growing data science that requires new mathematical tools for discrete data. Toward this goal, the classical theory for connection matrices has b… ▽ More Connection matrices are a generalization of Morse boundary operators from the classical Morse theory for gradient vector fields. Develo** an efficient computational framework for connection matrices is particularly important in the context of a rapidly growing data science that requires new mathematical tools for discrete data. Toward this goal, the classical theory for connection matrices has been adapted to combinatorial frameworks that facilitate computation. We develop an efficient persistence-like algorithm to compute a connection matrix from a given combinatorial (multi) vector field on a simplicial complex. This algorithm requires a single-pass, improving upon a known algorithm that runs an implicit recursion executing two-passes at each level. Overall, the new algorithm is more simple, direct, and efficient than the state-of-the-art. Because of the algorithm's similarity to the persistence algorithm, one may take advantage of various software optimizations from topological data analysis. △ Less

Submitted 23 September, 2023; v1 submitted 4 March, 2023; originally announced March 2023.

arXiv:2302.12796 [pdf, other]

Revisiting Graph Persistence for Updates and Efficiency

Authors: Tamal K. Dey, Tao Hou, Salman Parsa

Abstract: It is well known that ordinary persistence on graphs can be computed more efficiently than the general persistence. Recently, it has been shown that zigzag persistence on graphs also exhibits similar behavior. Motivated by these results, we revisit graph persistence and propose efficient algorithms especially for local updates on filtrations, similar to what is done in ordinary persistence for com… ▽ More It is well known that ordinary persistence on graphs can be computed more efficiently than the general persistence. Recently, it has been shown that zigzag persistence on graphs also exhibits similar behavior. Motivated by these results, we revisit graph persistence and propose efficient algorithms especially for local updates on filtrations, similar to what is done in ordinary persistence for computing the vineyard. We show that, for a filtration of length $m$, (i) switches (transpositions) in ordinary graph persistence can be done in $O(\log m)$ time; (ii) zigzag persistence on graphs can be computed in $O(m\log m)$ time, which improves a recent $O(m\log^4n)$ time algorithm assuming $n$, the size of the union of all graphs in the filtration, satisfies $n\inΩ({m^\varepsilon})$ for any fixed $0<\varepsilon<1$; (iii) open-closed, closed-open, and closed-closed bars in dimension $0$ for graph zigzag persistence can be updated in $O(\log m)$ time, whereas the open-open bars in dimension $0$ and closed-closed bars in dimension $1$ can be done in $O(\sqrt{m}\,\log m)$ time. △ Less

Submitted 11 May, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

arXiv:2212.01633 [pdf, other]

Cup Product Persistence and Its Efficient Computation

Authors: Tamal K. Dey, Abhishek Rathod

Abstract: It is well-known that the cohomology ring has a richer structure than homology groups. However, until recently, the use of cohomology in persistence setting has been limited to speeding up of barcode computations. Some of the recently introduced invariants, namely, persistent cup-length, persistent cup modules and persistent Steenrod modules, to some extent, fill this gap. When added to the standa… ▽ More It is well-known that the cohomology ring has a richer structure than homology groups. However, until recently, the use of cohomology in persistence setting has been limited to speeding up of barcode computations. Some of the recently introduced invariants, namely, persistent cup-length, persistent cup modules and persistent Steenrod modules, to some extent, fill this gap. When added to the standard persistence barcode, they lead to invariants that are more discriminative than the standard persistence barcode. In this work, we devise an $O(d n^4)$ algorithm for computing the persistent $k$-cup modules for all $k \in \{2, \dots, d\}$, where $d$ denotes the dimension of the filtered complex, and $n$ denotes its size. Moreover, we note that since the persistent cup length can be obtained as a byproduct of our computations, this leads to a faster algorithm for computing it for $d>3$. Finally, we introduce a new stable invariant called partition modules of cup product that is more discriminative than persistent $k$-cup modules and devise an $O(c(d)n^4)$ algorithm for computing it, where $c(d)$ is subexponential in $d$. △ Less

Submitted 17 March, 2024; v1 submitted 3 December, 2022; originally announced December 2022.

Comments: To appear in Proceedings of 40th International Symposium on Computational Geometry

arXiv:2207.14358 [pdf, other]

Topological structure of complex predictions

Authors: Meng Liu, Tamal K. Dey, David F. Gleich

Abstract: Complex prediction models such as deep learning are the output from fitting machine learning, neural networks, or AI models to a set of training data. These are now standard tools in science. A key challenge with the current generation of models is that they are highly parameterized, which makes describing and interpreting the prediction strategies difficult. We use topological data analysis to tr… ▽ More Complex prediction models such as deep learning are the output from fitting machine learning, neural networks, or AI models to a set of training data. These are now standard tools in science. A key challenge with the current generation of models is that they are highly parameterized, which makes describing and interpreting the prediction strategies difficult. We use topological data analysis to transform these complex prediction models into pictures representing a topological view. The result is a map of the predictions that enables inspection. The methods scale up to large datasets across different domains and enable us to detect labeling errors in training data, understand generalization in image classification, and inspect predictions of likely pathogenic mutations in the BRCA1 gene. △ Less

Submitted 19 October, 2022; v1 submitted 28 July, 2022; originally announced July 2022.

arXiv:2207.08475 [pdf, other]

Knights and Gold Stars: A Tale of InnerSource Incentivization

Authors: Tapajit Dey, Willem Jiang, Brian Fitzgerald

Abstract: Given the success of the open source phenomenon, it is not surprising that many organizations are seeking to emulate this success by adopting open source practices internally in what is termed InnerSource. However, while open source development and InnerSource are similar in some aspects, they differ significantly on others, and thus need to be implemented and managed differently. To the best of o… ▽ More Given the success of the open source phenomenon, it is not surprising that many organizations are seeking to emulate this success by adopting open source practices internally in what is termed InnerSource. However, while open source development and InnerSource are similar in some aspects, they differ significantly on others, and thus need to be implemented and managed differently. To the best of our knowledge, there is no significant account of a successful InnerSource incentivization program. Here we describe a comprehensive InnerSource incentivization program that was implemented at Huawei. The program is based on theories of motivation, both intrinsic and extrinsic, and also includes incentives at the individual, project, and divisional level, which helps to overcome the barriers that arise when implementing InnerSource. The program has had very impressive early results, leading to significant increases in the number of InnerSource projects, contributors, departments, and lines of code contributed. △ Less

Submitted 18 July, 2022; originally announced July 2022.

arXiv:2207.01015 [pdf, other]

One-off Events? An Empirical Study of Hackathon Code Creation and Reuse

Authors: Ahmed Samir Imam Mahmoud, Tapajit Dey, Alexander Nolte, Audris Mockus, James D. Herbsleb

Abstract: Background: Hackathons have become popular events for teams to collaborate on projects and develop software prototypes. Most existing research focuses on activities during an event with limited attention to the evolution of the hackathon code. Aim: We aim to understand the evolution of code used in and created during hackathon events, with a particular focus on the code blobs, specifically, how fr… ▽ More Background: Hackathons have become popular events for teams to collaborate on projects and develop software prototypes. Most existing research focuses on activities during an event with limited attention to the evolution of the hackathon code. Aim: We aim to understand the evolution of code used in and created during hackathon events, with a particular focus on the code blobs, specifically, how frequently hackathon teams reuse pre-existing code, how much new code they develop, if that code gets reused afterward, and what factors affect reuse. Method: We collected information about 22,183 hackathon projects from DevPost and obtained related code blobs, authors, project characteristics, original author, code creation time, language, and size information from World of Code. We tracked the reuse of code blobs by identifying all commits containing blobs created during hackathons and identifying all projects that contain those commits. We also conducted a series of surveys in order to gain a deeper understanding of hackathon code evolution that we sent out to hackathon participants whose code was reused, whose code was not reused, and developers who reused some hackathon code. Result: 9.14% of the code blobs in hackathon repositories and 8% of the lines of code (LOC) are created during hackathons and around a third of the hackathon code gets reused in other projects by both blob count and LOC. The number of associated technologies and the number of participants in hackathons increase the reuse probability. Conclusion: The results of our study demonstrate hackathons are not always "one-off" events as common knowledge dictates and they can serve as a starting point for further studies in this area. △ Less

Submitted 3 July, 2022; originally announced July 2022.

Comments: Accepted in Empirical Software Engineering Journal. arXiv admin note: substantial text overlap with arXiv:2103.01145

arXiv:2206.09563 [pdf, other]

Scalable Distributed Algorithms for Size-Constrained Submodular Maximization in the MapReduce and Adaptive Complexity Models

Authors: Tonmoy Dey, Yixin Chen, Alan Kuhnle

Abstract: Distributed maximization of a submodular function in the MapReduce (MR) model has received much attention, culminating in two frameworks that allow a centralized algorithm to be run in the MR setting without loss of approximation, as long as the centralized algorithm satisfies a certain consistency property - which had previously only been known to be satisfied by the standard greedy and continous… ▽ More Distributed maximization of a submodular function in the MapReduce (MR) model has received much attention, culminating in two frameworks that allow a centralized algorithm to be run in the MR setting without loss of approximation, as long as the centralized algorithm satisfies a certain consistency property - which had previously only been known to be satisfied by the standard greedy and continous greedy algorithms. A separate line of work has studied parallelizability of submodular maximization in the adaptive complexity model, where each thread may have access to the entire ground set. For the size-constrained maximization of a monotone and submodular function, we show that several sublinearly adaptive (highly parallelizable) algorithms satisfy the consistency property required to work in the MR setting, which yields practical, parallelizable and distributed algorithms. Separately, we develop the first distributed algorithm with linear query complexity for this problem. Finally, we provide a method to increase the maximum cardinality constraint for MR algorithms at the cost of additional MR rounds. △ Less

Submitted 1 April, 2024; v1 submitted 20 June, 2022; originally announced June 2022.

Comments: 35 pages, 5 figures

arXiv:2206.00606 [pdf, other]

Topological Deep Learning: Going Beyond Graph Data

Authors: Mustafa Hajij, Ghada Zamzmi, Theodore Papamarkou, Nina Miolane, Aldo Guzmán-Sáenz, Karthikeyan Natesan Ramamurthy, Tolga Birdal, Tamal K. Dey, Soham Mukherjee, Shreyas N. Samaga, Neal Livesay, Robin Walters, Paul Rosen, Michael T. Schaub

Abstract: Topological deep learning is a rapidly growing field that pertains to the development of deep learning models for data supported on topological domains such as simplicial complexes, cell complexes, and hypergraphs, which generalize many domains encountered in scientific computations. In this paper, we present a unifying deep learning framework built upon a richer data structure that includes widel… ▽ More Topological deep learning is a rapidly growing field that pertains to the development of deep learning models for data supported on topological domains such as simplicial complexes, cell complexes, and hypergraphs, which generalize many domains encountered in scientific computations. In this paper, we present a unifying deep learning framework built upon a richer data structure that includes widely adopted topological domains. Specifically, we first introduce combinatorial complexes, a novel type of topological domain. Combinatorial complexes can be seen as generalizations of graphs that maintain certain desirable properties. Similar to hypergraphs, combinatorial complexes impose no constraints on the set of relations. In addition, combinatorial complexes permit the construction of hierarchical higher-order relations, analogous to those found in simplicial and cell complexes. Thus, combinatorial complexes generalize and combine useful traits of both hypergraphs and cell complexes, which have emerged as two promising abstractions that facilitate the generalization of graph neural networks to topological spaces. Second, building upon combinatorial complexes and their rich combinatorial and algebraic structure, we develop a general class of message-passing combinatorial complex neural networks (CCNNs), focusing primarily on attention-based CCNNs. We characterize permutation and orientation equivariances of CCNNs, and discuss pooling and unpooling operations within CCNNs in detail. Third, we evaluate the performance of CCNNs on tasks related to mesh shape analysis and graph learning. Our experiments demonstrate that CCNNs have competitive performance as compared to state-of-the-art deep learning models specifically tailored to the same tasks. Our findings demonstrate the advantages of incorporating higher-order relations into deep learning models in different applications. △ Less

Submitted 19 May, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

arXiv:2204.11080 [pdf, other]

Fast Computation of Zigzag Persistence

Authors: Tamal K. Dey, Tao Hou

Abstract: Zigzag persistence is a powerful extension of the standard persistence which allows deletions of simplices besides insertions. However, computing zigzag persistence usually takes considerably more time than the standard persistence. We propose an algorithm called FastZigzag which narrows this efficiency gap. Our main result is that an input simplex-wise zigzag filtration can be converted to a cell… ▽ More Zigzag persistence is a powerful extension of the standard persistence which allows deletions of simplices besides insertions. However, computing zigzag persistence usually takes considerably more time than the standard persistence. We propose an algorithm called FastZigzag which narrows this efficiency gap. Our main result is that an input simplex-wise zigzag filtration can be converted to a cell-wise non-zigzag filtration of a $Δ$-complex with the same length, where the cells are copies of the input simplices. This conversion step in FastZigzag incurs very little cost. Furthermore, the barcode of the original filtration can be easily read from the barcode of the new cell-wise filtration because the conversion embodies a series of diamond switches known in topological data analysis. This seemingly simple observation opens up the vast possibilities for improving the computation of zigzag persistence because any efficient algorithm/software for standard persistence can now be applied to computing zigzag persistence. Our experiment shows that this indeed achieves substantial performance gain over the existing state-of-the-art softwares. △ Less

Submitted 4 July, 2022; v1 submitted 23 April, 2022; originally announced April 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2110.06315

arXiv:2203.05727 [pdf, other]

Tracking Dynamical Features via Continuation and Persistence

Authors: Tamal K. Dey, Michał Lipiński, Marian Mrozek, Ryan Slechta

Abstract: Multivector fields and combinatorial dynamical systems have recently become a subject of interest due to their potential for use in computational methods. In this paper, we develop a method to track an isolated invariant set -- a salient feature of a combinatorial dynamical system -- across a sequence of multivector fields. This goal is attained by placing the classical notion of the "continuation… ▽ More Multivector fields and combinatorial dynamical systems have recently become a subject of interest due to their potential for use in computational methods. In this paper, we develop a method to track an isolated invariant set -- a salient feature of a combinatorial dynamical system -- across a sequence of multivector fields. This goal is attained by placing the classical notion of the "continuation" of an isolated invariant set in the combinatorial setting. In particular, we give a "Tracking Protocol" that, when given a seed isolated invariant set, finds a canonical continuation of the seed across a sequence of multivector fields. In cases where it is not possible to continue, we show how to use zigzag persistence to track homological features associated with the isolated invariant sets. This construction permits viewing continuation as a special case of persistence. △ Less

Submitted 10 March, 2022; originally announced March 2022.

Comments: Full version of SoCG 2022 paper

arXiv:2112.02352 [pdf, other]

Updating Barcodes and Representatives for Zigzag Persistence

Authors: Tamal K. Dey, Tao Hou

Abstract: Computing persistence over changing filtrations give rise to a stack of 2D persistence diagrams where the birth-death points are connected by the so-called `vines'. We consider computing these vines over changing filtrations for zigzag persistence. We observe that eight atomic operations are sufficient for changing one zigzag filtration to another and provide update algorithms for each of them. Si… ▽ More Computing persistence over changing filtrations give rise to a stack of 2D persistence diagrams where the birth-death points are connected by the so-called `vines'. We consider computing these vines over changing filtrations for zigzag persistence. We observe that eight atomic operations are sufficient for changing one zigzag filtration to another and provide update algorithms for each of them. Six of these operations that have some analogues to one or multiple transpositions in the non-zigzag case can be executed as efficiently as their non-zigzag counterparts. This approach takes advantage of a recently discovered algorithm for computing zigzag barcodes by converting a zigzag filtration to a non-zigzag one and then connecting barcodes of the two with a bijection. The remaining two atomic operations do not have a strict analogue in the non-zigzag case. For them, we propose algorithms based on explicit maintenance of representatives (homology cycles) which can be useful in their own rights for applications requiring explicit updates of representatives. △ Less

Submitted 1 August, 2022; v1 submitted 4 December, 2021; originally announced December 2021.

arXiv:2111.15058 [pdf, other]

Computing Generalized Rank Invariant for 2-Parameter Persistence Modules via Zigzag Persistence and its Applications

Authors: Tamal K. Dey, Woo** Kim, Facundo Mémoli

Abstract: The notion of generalized rank invariant in the context of multiparameter persistence has become an important ingredient for defining interesting homological structures such as generalized persistence diagrams. Naturally, computing these rank invariants efficiently is a prelude to computing any of these derived structures efficiently. We show that the generalized rank over a finite interval $I$ of… ▽ More The notion of generalized rank invariant in the context of multiparameter persistence has become an important ingredient for defining interesting homological structures such as generalized persistence diagrams. Naturally, computing these rank invariants efficiently is a prelude to computing any of these derived structures efficiently. We show that the generalized rank over a finite interval $I$ of a $\mathbb{Z}^2$-indexed persistence module $M$ is equal to the generalized rank of the zigzag module that is induced on a certain path in $I$ tracing mostly its boundary. Hence, we can compute the generalized rank over $I$ by computing the barcode of the zigzag module obtained by restricting the bifiltration inducing $M$ to that path. If the bifiltration and $I$ have at most $t$ simplices and points respectively, this computation takes $O(t^ω)$ time where $ω\in[2,2.373)$ is the exponent of matrix multiplication. Among others, we apply this result to obtain an improved algorithm for the following problem. Given a bifiltration inducing a module $M$, determine whether $M$ is interval decomposable and, if so, compute all intervals supporting its summands. △ Less

Submitted 30 March, 2022; v1 submitted 29 November, 2021; originally announced November 2021.

Comments: Full version of the paper in the Proceedings of the 38th International Symposium on Computational Geometry (SoCG 2022). Shortened the proof of Theorem 3.12 and added new sections 4.4 and 4.5; 21 pages, 4 figures

arXiv:2111.07917 [pdf, other]

Best of Both Worlds: Practical and Theoretically Optimal Submodular Maximization in Parallel

Authors: Yixin Chen, Tonmoy Dey, Alan Kuhnle

Abstract: For the problem of maximizing a monotone, submodular function with respect to a cardinality constraint $k$ on a ground set of size $n$, we provide an algorithm that achieves the state-of-the-art in both its empirical performance and its theoretical properties, in terms of adaptive complexity, query complexity, and approximation ratio; that is, it obtains, with high probability, query complexity of… ▽ More For the problem of maximizing a monotone, submodular function with respect to a cardinality constraint $k$ on a ground set of size $n$, we provide an algorithm that achieves the state-of-the-art in both its empirical performance and its theoretical properties, in terms of adaptive complexity, query complexity, and approximation ratio; that is, it obtains, with high probability, query complexity of $O(n)$ in expectation, adaptivity of $O(\log(n))$, and approximation ratio of nearly $1-1/e$. The main algorithm is assembled from two components which may be of independent interest. The first component of our algorithm, LINEARSEQ, is useful as a preprocessing algorithm to improve the query complexity of many algorithms. Moreover, a variant of LINEARSEQ is shown to have adaptive complexity of $O( \log (n / k) )$ which is smaller than that of any previous algorithm in the literature. The second component is a parallelizable thresholding procedure THRESHOLDSEQ for adding elements with gain above a constant threshold. Finally, we demonstrate that our main algorithm empirically outperforms, in terms of runtime, adaptive rounds, total queries, and objective values, the previous state-of-the-art algorithm FAST in a comprehensive evaluation with six submodular objective functions. △ Less

Submitted 8 February, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

Comments: 32 pages, 8 figures, to be published in NeurIPS 2021

arXiv:2110.14734 [pdf, other]

Approximating 1-Wasserstein Distance between Persistence Diagrams by Graph Sparsification

Authors: Tamal K. Dey, Simon Zhang

Abstract: Persistence diagrams (PD)s play a central role in topological data analysis. This analysis requires computing distances among such diagrams such as the 1-Wasserstein distance. Accurate computation of these PD distances for large data sets that render large diagrams may not scale appropriately with the existing methods. The main source of difficulty ensues from the size of the bipartite graph on wh… ▽ More Persistence diagrams (PD)s play a central role in topological data analysis. This analysis requires computing distances among such diagrams such as the 1-Wasserstein distance. Accurate computation of these PD distances for large data sets that render large diagrams may not scale appropriately with the existing methods. The main source of difficulty ensues from the size of the bipartite graph on which a matching needs to be computed for determining these PD distances. We address this problem by making several algorithmic and computational observations in order to obtain an approximation. First, taking advantage of the proximity of PD points, we condense them thereby decreasing the number of nodes in the graph for computation. The increase in point multiplicities is addressed by reducing the matching problem to a min-cost flow problem on a transshipment network. Second, we use Well Separated Pair Decomposition to sparsify the graph to a size that is linear in the number of points. Both node and arc sparsifications contribute to the approximation factor where we leverage a lower bound given by the Relaxed Word Mover's distance. Third, we eliminate bottlenecks during the sparsification procedure by introducing parallelism. Fourth, we develop an open source software called PDoptFlow based on our algorithm, exploiting parallelism by GPU and multicore. We perform extensive experiments and show that the actual empirical error is very low. We also show that we can achieve high performance at low guaranteed relative errors, improving upon the state of the arts. △ Less

Submitted 27 October, 2021; originally announced October 2021.

Comments: 31 pages, 12 figures; extended version of paper published in ALENEX 2022

arXiv:2110.06315 [pdf, other]

On Association between Absolute and Relative Zigzag Persistence

Authors: Tamal K. Dey, Tao Hou

Abstract: Duality results connecting persistence modules for absolute and relative homology provides a fundamental understanding into persistence theory. In this paper, we study similar associations in the context of zigzag persistence. Our main finding is a weak duality for the so-called non-repetitive zigzag filtrations in which a simplex is never added again after being deleted. The technique used to pro… ▽ More Duality results connecting persistence modules for absolute and relative homology provides a fundamental understanding into persistence theory. In this paper, we study similar associations in the context of zigzag persistence. Our main finding is a weak duality for the so-called non-repetitive zigzag filtrations in which a simplex is never added again after being deleted. The technique used to prove the duality for non-zigzag persistence does not extend straightforwardly to our case. Accordingly, taking a different route, we prove the weak duality by converting a non-repetitive filtration to an up-down filtration by a sequence of diamond switches. We then show an application of the weak duality result which gives a near-linear algorithm for computing the $p$-th and a subset of the $(p-1)$-th persistence for a non-repetitive zigzag filtration of a simplicial $p$-manifold. Utilizing the fact that a non-repetitive filtration admits an up-down filtration as its canonical form, we further reduce the problem of computing zigzag persistence for non-repetitive filtrations to the problem of computing standard persistence for which several efficient implementations exist. Our experiment shows that this achieves substantial performance gain. Our study also identifies repetitive filtrations as instances that fundamentally distinguish zigzag persistence from the standard persistence. △ Less

Submitted 12 October, 2021; originally announced October 2021.

arXiv:2108.07429 [pdf, other]

Rectangular Approximation and Stability of $2$-parameter Persistence Modules

Authors: Tamal K. Dey, Cheng Xin

Abstract: One of the main reasons for topological persistence being useful in data analysis is that it is backed up by a stability (isometry) property: persistence diagrams of $1$-parameter persistence modules are stable in the sense that the bottleneck distance between two diagrams equals the interleaving distance between their generating modules. However, in multi-parameter setting this property breaks do… ▽ More One of the main reasons for topological persistence being useful in data analysis is that it is backed up by a stability (isometry) property: persistence diagrams of $1$-parameter persistence modules are stable in the sense that the bottleneck distance between two diagrams equals the interleaving distance between their generating modules. However, in multi-parameter setting this property breaks down in general. A simple special case of persistence modules called rectangle decomposable modules is known to admit a weaker stability property. Using this fact, we derive a stability-like property for $2$-parameter persistence modules. For this, first we consider interval decomposable modules and their optimal approximations with rectangle decomposable modules with respect to the bottleneck distance. We provide a polynomial time algorithm to exactly compute this optimal approximation which, together with the polynomial-time computable bottleneck distance among interval decomposable modules, provides a lower bound on the interleaving distance. Next, we leverage this result to derive a polynomial-time computable distance for general multi-parameter persistence modules which enjoys similar stability-like property. This distance can be viewed as a generalization of the matching distance defined in the literature. △ Less

Submitted 17 August, 2021; originally announced August 2021.

arXiv:2107.02115 [pdf, other]

Persistence of Conley-Morse Graphs in Combinatorial Dynamical Systems

Authors: Tamal K. Dey, Marian Mrozek, Ryan Slechta

Abstract: Multivector fields provide an avenue for studying continuous dynamical systems in a combinatorial framework. There are currently two approaches in the literature which use persistent homology to capture changes in combinatorial dynamical systems. The first captures changes in the Conley index, while the second captures changes in the Morse decomposition. However, such approaches have limitations.… ▽ More Multivector fields provide an avenue for studying continuous dynamical systems in a combinatorial framework. There are currently two approaches in the literature which use persistent homology to capture changes in combinatorial dynamical systems. The first captures changes in the Conley index, while the second captures changes in the Morse decomposition. However, such approaches have limitations. The former approach only describes how the Conley index changes across a selected isolated invariant set though the dynamics can be much more complicated than the behavior of a single isolated invariant set. Likewise, considering a Morse decomposition omits much information about the individual Morse sets. In this paper, we propose a method to summarize changes in combinatorial dynamical systems by capturing changes in the so-called Conley-Morse graphs. A Conley-Morse graph contains information about both the structure of a selected Morse decomposition and about the Conley index at each Morse set in the decomposition. Hence, our method summarizes the changing structure of a sequence of dynamical systems at a finer granularity than previous approaches. △ Less

Submitted 5 July, 2021; v1 submitted 5 July, 2021; originally announced July 2021.

arXiv:2105.00518 [pdf, ps, other]

Computing Optimal Persistent Cycles for Levelset Zigzag on Manifold-like Complexes

Authors: Tamal K. Dey, Tao Hou

Abstract: In standard persistent homology, a persistent cycle born and dying with a persistence interval (bar) associates the bar with a concrete topological representative, which provides means to effectively navigate back from the barcode to the topological space. Among the possibly many, optimal persistent cycles bring forth further information due to having guaranteed quality. However, topological featu… ▽ More In standard persistent homology, a persistent cycle born and dying with a persistence interval (bar) associates the bar with a concrete topological representative, which provides means to effectively navigate back from the barcode to the topological space. Among the possibly many, optimal persistent cycles bring forth further information due to having guaranteed quality. However, topological features usually go through variations in the lifecycle of a bar which a single persistent cycle may not capture. Hence, for persistent homology induced from PL functions, we propose levelset persistent cycles consisting of a sequence of cycles that depict the evolution of homological features from birth to death. Our definition is based on levelset zigzag persistence which involves four types of persistence intervals as opposed to the two types in standard persistence. For each of the four types, we present a polynomial-time algorithm computing an optimal sequence of levelset persistent $p$-cycles for the so-called weak $(p+1)$-pseudomanifolds. Given that optimal cycle problems for homology are NP-hard in general, our results are useful in practice because weak pseudomanifolds do appear in applications. Our algorithms draw upon an idea of relating optimal cycles to min-cuts in a graph that we exploited earlier for standard persistent cycles. Note that levelset zigzag poses non-trivial challenges for the approach because a sequence of optimal cycles instead of a single one needs to be computed in this case. △ Less

Submitted 2 May, 2021; originally announced May 2021.

arXiv:2104.13430 [pdf, other]

Topological Filtering for 3D Microstructure Segmentation

Authors: Anand V. Patel, Tao Hou, Juan D. Beltran Rodriguez, Tamal K. Dey, Dunbar P. Birnie III

Abstract: Tomography is a widely used tool for analyzing microstructures in three dimensions (3D). The analysis, however, faces difficulty because the constituent materials produce similar grey-scale values. Sometimes, this prompts the image segmentation process to assign a pixel/voxel to the wrong phase (active material or pore). Consequently, errors are introduced in the microstructure characteristics cal… ▽ More Tomography is a widely used tool for analyzing microstructures in three dimensions (3D). The analysis, however, faces difficulty because the constituent materials produce similar grey-scale values. Sometimes, this prompts the image segmentation process to assign a pixel/voxel to the wrong phase (active material or pore). Consequently, errors are introduced in the microstructure characteristics calculation. In this work, we develop a filtering algorithm called PerSplat based on topological persistence (a technique used in topological data analysis) to improve segmentation quality. One problem faced when evaluating filtering algorithms is that real image data in general are not equipped with the `ground truth' for the microstructure characteristics. For this study, we construct synthetic images for which the ground-truth values are known. On the synthetic images, we compare the pore tortuosity and Minkowski functionals (volume and surface area) computed with our PerSplat filter and other methods such as total variation (TV) and non-local means (NL-means). Moreover, on a real 3D image, we visually compare the segmentation results provided by our filter against TV and NL-means. The experimental results indicate that PerSplat provides a significant improvement in segmentation quality. △ Less

Submitted 26 September, 2021; v1 submitted 27 April, 2021; originally announced April 2021.

arXiv:2103.10167 [pdf, other]

Tracking Hackathon Code Creation and Reuse

Authors: Ahmed Imam, Tapajit Dey

Abstract: Background: Hackathons have become popular events for teams to collaborate on projects and develop software prototypes. Most existing research focuses on activities during an event with limited attention to the evolution of the code brought to or created during a hackathon. Aim: We aim to understand the evolution of hackathon-related code, specifically, how much hackathon teams rely on pre-existin… ▽ More Background: Hackathons have become popular events for teams to collaborate on projects and develop software prototypes. Most existing research focuses on activities during an event with limited attention to the evolution of the code brought to or created during a hackathon. Aim: We aim to understand the evolution of hackathon-related code, specifically, how much hackathon teams rely on pre-existing code or how much new code they develop during a hackathon. Moreover, we aim to understand if and where that code gets reused. Method: We collected information about 22,183 hackathon projects from DEVPOST -- a hackathon database -- and obtained related code (blobs), authors, and project characteristics from the World of Code. We investigated if code blobs in hackathon projects were created before, during, or after an event by identifying the original blob creation date and author, and also checked if the original author was a hackathon project member. We tracked code reuse by first identifying all commits containing blobs created during an event before determining all projects that contain those commits. Result: While only approximately 9.14% of the code blobs are created during hackathons, this amount is still significant considering the time and member constraints of such events. Approximately a third of these code blobs get reused in other projects. Conclusion: Our study demonstrates to what extent pre-existing code is used and new code is created during a hackathon and how much of it is reused elsewhere afterwards. Our findings help to better understand code reuse as a phenomenon and the role of hackathons in this context and can serve as a starting point for further studies in this area. △ Less

Submitted 18 March, 2021; originally announced March 2021.

Comments: An abridged version of arXiv:2103.01145. Required for publication pre-print distribution

arXiv:2103.09583 [pdf, other]

2D Points Curve Reconstruction Survey and Benchmark

Authors: Stefan Ohrhallinger, Jiju Peethambaran, Amal D. Parakkat, Tamal K. Dey, Ramanathan Muthuganapathy

Abstract: Curve reconstruction from unstructured points in a plane is a fundamental problem with many applications that has generated research interest for decades. Involved aspects like handling open, sharp, multiple and non-manifold outlines, run-time and provability as well as potential extension to 3D for surface reconstruction have led to many different algorithms. We survey the literature on 2D curve… ▽ More Curve reconstruction from unstructured points in a plane is a fundamental problem with many applications that has generated research interest for decades. Involved aspects like handling open, sharp, multiple and non-manifold outlines, run-time and provability as well as potential extension to 3D for surface reconstruction have led to many different algorithms. We survey the literature on 2D curve reconstruction and then present an open-sourced benchmark for the experimental study. Our unprecedented evaluation on a selected set of planar curve reconstruction algorithms aims to give an overview of both quantitative analysis and qualitative aspects for hel** users to select the right algorithm for specific problems in the field. Our benchmark framework is available online to permit reproducing the results, and easy integration of new algorithms. △ Less

Submitted 17 March, 2021; originally announced March 2021.

Comments: 24 pages, 22 figures, 5 tables

arXiv:2103.07353 [pdf, ps, other]

Computing Zigzag Persistence on Graphs in Near-Linear Time

Authors: Tamal K. Dey, Tao Hou

Abstract: Graphs model real-world circumstances in many applications where they may constantly change to capture the dynamic behavior of the phenomena. Topological persistence which provides a set of birth and death pairs for the topological features is one instrument for analyzing such changing graph data. However, standard persistent homology defined over a growing space cannot always capture such a dynam… ▽ More Graphs model real-world circumstances in many applications where they may constantly change to capture the dynamic behavior of the phenomena. Topological persistence which provides a set of birth and death pairs for the topological features is one instrument for analyzing such changing graph data. However, standard persistent homology defined over a growing space cannot always capture such a dynamic process unless shrinking with deletions is also allowed. Hence, zigzag persistence which incorporates both insertions and deletions of simplices is more appropriate in such a setting. Unlike standard persistence which admits nearly linear-time algorithms for graphs, such results for the zigzag version improving the general $O(m^ω)$ time complexity are not known, where $ω< 2.37286$ is the matrix multiplication exponent. In this paper, we propose algorithms for zigzag persistence on graphs which run in near-linear time. Specifically, given a filtration with $m$ additions and deletions on a graph with $n$ vertices and edges, the algorithm for $0$-dimension runs in $O(m\log^2 n+m\log m)$ time and the algorithm for 1-dimension runs in $O(m\log^4 n)$ time. The algorithm for $0$-dimension draws upon another algorithm designed originally for pairing critical points of Morse functions on $2$-manifolds. The algorithm for $1$-dimension pairs a negative edge with the earliest positive edge so that a $1$-cycle containing both edges resides in all intermediate graphs. Both algorithms achieve the claimed time complexity via dynamic graph data structures proposed by Holm et al. In the end, using Alexander duality, we extend the algorithm for $0$-dimension to compute the $(p-1)$-dimensional zigzag persistence for $\mathbb{R}^p$-embedded complexes in $O(m\log^2 n+m\log m+n\log n)$ time. △ Less

Submitted 12 March, 2021; originally announced March 2021.

Comments: The full version of the paper

arXiv:2103.01145 [pdf, other]

The Secret Life of Hackathon Code

Authors: Ahmed Imam, Tapajit Dey, Alexander Nolte, Audris Mockus, James D. Herbsleb

Abstract: Background: Hackathons have become popular events for teams to collaborate on projects and develop software prototypes. Most existing research focuses on activities during an event with limited attention to the evolution of the code brought to or created during a hackathon. Aim: We aim to understand the evolution of hackathon-related code, specifically, how much hackathon teams rely on pre-existin… ▽ More Background: Hackathons have become popular events for teams to collaborate on projects and develop software prototypes. Most existing research focuses on activities during an event with limited attention to the evolution of the code brought to or created during a hackathon. Aim: We aim to understand the evolution of hackathon-related code, specifically, how much hackathon teams rely on pre-existing code or how much new code they develop during a hackathon. Moreover, we aim to understand if and where that code gets reused, and what factors affect reuse. Method: We collected information about 22,183 hackathon projects from DEVPOST -- a hackathon database -- and obtained related code (blobs), authors, and project characteristics from the World of Code. We investigated if code blobs in hackathon projects were created before, during, or after an event by identifying the original blob creation date and author, and also checked if the original author was a hackathon project member. We tracked code reuse by first identifying all commits containing blobs created during an event before determining all projects that contain those commits. Result: While only approximately 9.14% of the code blobs are created during hackathons, this amount is still significant considering the time and member constraints of such events. Approximately a third of these code blobs get reused in other projects. The number of associated technologies and the number of participants in a project increase reuse probability. Conclusion: Our study demonstrates to what extent pre-existing code is used and new code is created during a hackathon and how much of it is reused elsewhere afterwards. Our findings help to better understand code reuse as a phenomenon and the role of hackathons in this context and can serve as a starting point for further studies in this area. △ Less

Submitted 18 March, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

Comments: Accepted in Proceedings of the 18th International Conference on Mining Software Repositories, MSR '21

arXiv:2010.16196 [pdf, other]

World of Code: Enabling a Research Workflow for Mining and Analyzing the Universe of Open Source VCS data

Authors: Yuxing Ma, Tapajit Dey, Chris Bogart, Sadika Amreen, Marat Valiev, Adam Tutko, David Kennard, Russell Zaretzki, Audris Mockus

Abstract: Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are the tens of millions of projects in the periphery interconnected through. technical dependencies, code sharing, or knowledge flow? To answer such qu… ▽ More Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are the tens of millions of projects in the periphery interconnected through. technical dependencies, code sharing, or knowledge flow? To answer such questions we: a) create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of the FLOSS ecosystems and b) provide capabilities to efficiently correct, augment, query, and analyze that data. Our current WoC implementation is capable of being updated on a monthly basis and contains over 18B Git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation. △ Less

Submitted 30 October, 2020; originally announced October 2020.

arXiv:2007.04816 [pdf, other]

doi 10.1145/3382494.3410685

Effect of Technical and Social Factors on Pull Request Quality for the NPM Ecosystem

Authors: Tapajit Dey, Audris Mockus

Abstract: Pull request (PR) based development, which is a norm for the social coding platforms, entails the challenge of evaluating the contributions of, often unfamiliar, developers from across the open source ecosystem and, conversely, submitting a contribution to a project with unfamiliar maintainers. Previous studies suggest that the decision of accepting or rejecting a PR may be influenced by a divergi… ▽ More Pull request (PR) based development, which is a norm for the social coding platforms, entails the challenge of evaluating the contributions of, often unfamiliar, developers from across the open source ecosystem and, conversely, submitting a contribution to a project with unfamiliar maintainers. Previous studies suggest that the decision of accepting or rejecting a PR may be influenced by a diverging set of technical and social factors, but often focus on relatively few projects, do not consider ecosystem-wide measures, or the possible non-monotonic relationships between the predictors and PR acceptance probability. We aim to shed light on this important decision making process by testing which measures significantly affect the probability of PR acceptance on a significant fraction of a large ecosystem, rank them by their relative importance in predicting PR acceptance, and determine the shape of the functions that map each predictor to PR acceptance. We proposed seven hypotheses regarding which technical and social factors might affect PR acceptance and created 17 measures based on them. Our dataset consisted of 470,925 PRs from 3349 popular NPM packages and 79,128 GitHub users who created those. We tested which of the measures affect PR acceptance and ranked the significant measures by their importance in a predictive model. Our predictive model had and AUC of 0.94, and 15 of the 17 measures were found to matter, including five novel ecosystem-wide measures. Measures describing the number of PRs submitted to a repository and what fraction of those get accepted, and signals about the PR review phase were most significant. We also discovered that only four predictors have a linear influence on the PR acceptance probability while others showed a more complicated response. △ Less

Submitted 20 July, 2020; v1 submitted 8 July, 2020; originally announced July 2020.

Comments: arXiv admin note: text overlap with arXiv:2003.01153. Preprint of the paper accepted in ESEM,2020 conference

ACM Class: D.2.7

arXiv:2005.10176 [pdf, other]

Representation of Developer Expertise in Open Source Software

Authors: Tapajit Dey, Andrey Karnauch, Audris Mockus

Abstract: Background: Accurate representation of developer expertise has always been an important research problem. While a number of studies proposed novel methods of representing expertise within individual projects, these methods are difficult to apply at an ecosystem level. However, with the focus of software development shifting from monolithic to modular, a method of representing developers' expertise… ▽ More Background: Accurate representation of developer expertise has always been an important research problem. While a number of studies proposed novel methods of representing expertise within individual projects, these methods are difficult to apply at an ecosystem level. However, with the focus of software development shifting from monolithic to modular, a method of representing developers' expertise in the context of the entire OSS development becomes necessary when, for example, a project tries to find new maintainers and look for developers with relevant skills. Aim: We aim to address this knowledge gap by proposing and constructing the Skill Space where each API, developer, and project is represented and postulate how the topology of this space should reflect what developers know (and projects need). Method: we use the World of Code infrastructure to extract the complete set of APIs in the files changed by open source developers and, based on that data, employ Doc2Vec embeddings for vector representations of APIs, developers, and projects. We then evaluate if these embeddings reflect the postulated topology of the Skill Space by predicting what new APIs/projects developers use/join, and whether or not their pull requests get accepted. We also check how the developers' representations in the Skill Space align with their self-reported API expertise. Result: Our results suggest that the proposed embeddings in the Skill Space appear to satisfy the postulated topology and we hope that such representations may aid in the construction of signals that increase trust (and efficiency) of open source ecosystems at large and may aid investigations of other phenomena related to developer proficiency and learning. △ Less

Submitted 2 February, 2021; v1 submitted 20 May, 2020; originally announced May 2020.

Comments: Accepted in ICSE 2021 Main Technical Track

arXiv:2005.09217 [pdf, other]

doi 10.1007/s10664-020-09837-4

Do Code Review Measures Explain the Incidence of Post-Release Defects?

Authors: Andrey Krutauz, Tapajit Dey, Peter C. Rigby, Audris Mockus

Abstract: Aim: In contrast to studies of defects found during code review, we aim to clarify whether code reviews measures can explain the prevalence of post-release defects. Method: We replicate a study by McIntoshet. al that uses additive regression to model the relationship between defects and code reviews. To increase external validity, we apply the same methodology on a new software project. We discuss… ▽ More Aim: In contrast to studies of defects found during code review, we aim to clarify whether code reviews measures can explain the prevalence of post-release defects. Method: We replicate a study by McIntoshet. al that uses additive regression to model the relationship between defects and code reviews. To increase external validity, we apply the same methodology on a new software project. We discuss our findings with the first author of the original study, McIntosh. We then investigate how to reduce the impact of correlated predictors in the variable selection process and how to increase understanding of the inter-relationships among the predictors by employing Bayesian Network (BN) models. Context: As in the original study, we use the same measures authors obtained for Qt project in the original study. We mine data from version control and issue tracker of Google Chrome and operationalize measures that are close analogs to the large collection of code, process, and code review measures used in the replicated the study. Results: Both the data from the original study and the Chrome data showed high instability of the influence of code review measures on defects with the results being highly sensitive to variable selection procedure. Models without code review predictors had as good or better fit than those with review predictors. Replication, however, confirms with the bulk of prior work showing that prior defects, module size, and authorship have the strongest relationship to post-release defects. The application of BN models helped explain the observed instability by demonstrating that the review-related predictors do not affect post-release defects directly and showed indirect effects. For example, changes that have no review discussion tend to be associated with files that have had many prior defects which in turn increase the number of post-release defects. △ Less

Submitted 19 May, 2020; originally announced May 2020.

arXiv:2003.08349 [pdf, other]

doi 10.1145/3379597.3387500

A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits

Authors: Tanner Fry, Tapajit Dey, Andrey Karnauch, Audris Mockus

Abstract: The data collected from open source projects provide means to model large software ecosystems, but often suffer from data quality issues, specifically, multiple author identification strings in code commits might actually be associated with one developer. While many methods have been proposed for addressing this problem, they are either heuristics requiring manual tweaking, or require too much cal… ▽ More The data collected from open source projects provide means to model large software ecosystems, but often suffer from data quality issues, specifically, multiple author identification strings in code commits might actually be associated with one developer. While many methods have been proposed for addressing this problem, they are either heuristics requiring manual tweaking, or require too much calculation time to do pairwise comparisons for 38M author IDs in, for example, the World of Code collection. In this paper, we propose a method that finds all author IDs belonging to a single developer in this entire dataset, and share the list of all author IDs that were found to have aliases. To do this, we first create blocks of potentially connected author IDs and then use a machine learning model to predict which of these potentially related IDs belong to the same developer. We processed around 38 million author IDs and found around 14.8 million IDs to have an alias, which belong to 5.4 million different developers, with the median number of aliases being 2 per developer. This dataset can be used to create more accurate models of developer behaviour at the entire OSS ecosystem level and can be used to provide a service to rapidly resolve new author IDs. △ Less

Submitted 27 March, 2020; v1 submitted 18 March, 2020; originally announced March 2020.

arXiv:2003.07961 [pdf, other]

doi 10.1145/3387940.3391502

An Exploratory Study of Bot Commits

Authors: Tapajit Dey, Bogdan Vasilescu, Audris Mockus

Abstract: Background: Bots help automate many of the tasks performed by software developers and are widely used to commit code in various social coding platforms. At present, it is not clear what types of activities these bots perform and understanding it may help design better bots, and find application areas which might benefit from bot adoption. Aim: We aim to categorize the Bot Commits by the type of ch… ▽ More Background: Bots help automate many of the tasks performed by software developers and are widely used to commit code in various social coding platforms. At present, it is not clear what types of activities these bots perform and understanding it may help design better bots, and find application areas which might benefit from bot adoption. Aim: We aim to categorize the Bot Commits by the type of change (files added, deleted, or modified), find the more commonly changed file types, and identify the groups of file types that tend to get updated together. Method: 12,326,137 commits made by 461 popular bots (that made at least 1000 commits) were examined to identify the frequency and the type of files added/ deleted/ modified by the commits, and association rule mining was used to identify the types of files modified together. Result: Majority of the bot commits modify an existing file, a few of them add new files, while deletion of a file is very rare. Commits involving more than one type of operation are even rarer. Files containing data, configuration, and documentation are most frequently updated, while HTML is the most common type in terms of the number of files added, deleted, and modified. Files of the type "Markdown", "Ignore List", "YAML", "JSON" were the types that are updated together with other types of files most frequently. Conclusion: We observe that majority of bot commits involve single file modifications, and bots primarily work with data, configuration, and documentation files. A better understanding if this is a limitation of the bots and, if overcome, would lead to different kinds of bots remains an open question. △ Less

Submitted 27 March, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

arXiv:2003.05579 [pdf, other]

Persistence of the Conley Index in Combinatorial Dynamical Systems

Authors: Tamal K. Dey, Marian Mrozek, Ryan Slechta

Abstract: A combinatorial framework for dynamical systems provides an avenue for connecting classical dynamics with data-oriented, algorithmic methods. Combinatorial vector fields introduced by Forman and their recent generalization to multivector fields have provided a starting point for building such a connection. In this work, we strengthen this relationship by placing the Conley index in the persistent… ▽ More A combinatorial framework for dynamical systems provides an avenue for connecting classical dynamics with data-oriented, algorithmic methods. Combinatorial vector fields introduced by Forman and their recent generalization to multivector fields have provided a starting point for building such a connection. In this work, we strengthen this relationship by placing the Conley index in the persistent homology setting. Conley indices are homological features associated with so-called isolated invariant sets, so a change in the Conley index is a response to perturbation in an underlying multivector field. We show how one can use zigzag persistence to summarize changes to the Conley index, and we develop techniques to capture such changes in the presence of noise. We conclude by develo** an algorithm to track features in a changing multivector field. △ Less

Submitted 11 March, 2020; originally announced March 2020.

arXiv:2003.03172 [pdf, other]

doi 10.1145/3379597.3387478

Detecting and Characterizing Bots that Commit Code

Authors: Tapajit Dey, Sara Mousavi, Eduardo Ponce, Tanner Fry, Bogdan Vasilescu, Anna Filippova, Audris Mockus

Abstract: Background: Some developer activity traditionally performed manually, such as making code commits, opening, managing, or closing issues is increasingly subject to automation in many OSS projects. Specifically, such activity is often performed by tools that react to events or run at specific times. We refer to such automation tools as bots and, in many software mining scenarios related to developer… ▽ More Background: Some developer activity traditionally performed manually, such as making code commits, opening, managing, or closing issues is increasingly subject to automation in many OSS projects. Specifically, such activity is often performed by tools that react to events or run at specific times. We refer to such automation tools as bots and, in many software mining scenarios related to developer productivity or code quality it is desirable to identify bots in order to separate their actions from actions of individuals. Aim: Find an automated way of identifying bots and code committed by these bots, and to characterize the types of bots based on their activity patterns. Method and Result: We propose BIMAN, a systematic approach to detect bots using author names, commit messages, files modified by the commit, and projects associated with the ommits. For our test data, the value for AUC-ROC was 0.9. We also characterized these bots based on the time patterns of their code commits and the types of files modified, and found that they primarily work with documentation files and web pages, and these files are most prevalent in HTML and JavaScript ecosystems. We have compiled a shareable dataset containing detailed information about 461 bots we found (all of whom have more than 1000 commits) and 13,762,430 commits they created. △ Less

Submitted 27 March, 2020; v1 submitted 2 March, 2020; originally announced March 2020.

Comments: Preprint of the paper accepted in MSR, 2020 conference

arXiv:2003.01153 [pdf, other]

Which Pull Requests Get Accepted and Why? A study of popular NPM Packages

Authors: Tapajit Dey, Audris Mockus

Abstract: Background: Pull Request (PR) Integrators often face challenges in terms of multiple concurrent PRs, so the ability to gauge which of the PRs will get accepted can help them balance their workload. PR creators would benefit from knowing if certain characteristics of their PRs may increase the chances of acceptance. Aim: We modeled the probability that a PR will be accepted within a month after cre… ▽ More Background: Pull Request (PR) Integrators often face challenges in terms of multiple concurrent PRs, so the ability to gauge which of the PRs will get accepted can help them balance their workload. PR creators would benefit from knowing if certain characteristics of their PRs may increase the chances of acceptance. Aim: We modeled the probability that a PR will be accepted within a month after creation using a Random Forest model utilizing 50 predictors representing properties of the author, PR, and the project to which PR is submitted. Method: 483,988 PRs from 4218 popular NPM packages were analysed and we selected a subset of 14 predictors sufficient for a tuned Random Forest model to reach high accuracy. Result: An AUC-ROC value of 0.95 was achieved predicting PR acceptance. The model excluding PR properties that change after submission gave an AUC-ROC value of 0.89. We tested the utility of our model in practical scenarios by training it with historical data for the NPM package \textit{bootstrap} and predicting if the PRs submitted in future will be accepted. This gave us an AUC-ROC value of 0.94 with all 14 predictors, and 0.77 excluding PR properties that change after its creation. Conclusion: PR integrators can use our model for a highly accurate assessment of the quality of the open PRs and PR creators may benefit from the model by understanding which characteristics of their PRs may be undesirable from the integrators' perspective. The model can be implemented as a tool, which we plan to do as a future work. △ Less

Submitted 2 March, 2020; originally announced March 2020.

arXiv:2002.09989 [pdf, other]

doi 10.1007/s10664-019-09791-w

Deriving a Usage-Independent Software Quality Metric

Authors: Tapajit Dey, Audris Mockus

Abstract: Context:The extent of post-release use of software affects the number of faults, thus biasing quality metrics and adversely affecting associated decisions. The proprietary nature of usage data limited deeper exploration of this subject in the past. Objective: To determine how software faults and software use are related and how an accurate quality measure can be designed. Method: New users, usage… ▽ More Context:The extent of post-release use of software affects the number of faults, thus biasing quality metrics and adversely affecting associated decisions. The proprietary nature of usage data limited deeper exploration of this subject in the past. Objective: To determine how software faults and software use are related and how an accurate quality measure can be designed. Method: New users, usage intensity, usage frequency, exceptions, and release date and duration measured for complex proprietary mobile applications for Android and iOS. Utilized Bayesian Network and Random Forest models to explain the interrelationships and to derive the usage independent release quality measure. Investigated the interrelationship among various code complexity measures, usage (downloads), and number of issues for 520 NPM packages and derived a usage-independent quality measure from these analyses, applied it on 4430 popular NPM packages to construct timelines for comparing the perceived quality (issues) and our derived measure of quality for these packages.Results: We found the number of new users to be the primary factor determining the number of exceptions, and found no direct link between the intensity and frequency of software usage and software faults. Release quality expressed as crashes per user was independent of other usage-related predictors, thus serving as a usage independent measure of software quality. Usage also affected quality in NPM, where downloads were strongly associated with numbers of issues, even after taking the other code complexity measures into consideration. Conclusions: We expect our result and our proposed quality measure will help gauge release quality of a software more accurately and inspire further research in this area. △ Less

Submitted 23 February, 2020; originally announced February 2020.

arXiv:2001.09549 [pdf, other]

An efficient algorithm for $1$-dimensional (persistent) path homology

Authors: Tamal K. Dey, Tianqi Li, Yusu Wang

Abstract: This paper focuses on develo** an efficient algorithm for analyzing a directed network (graph) from a topological viewpoint. A prevalent technique for such topological analysis involves computation of homology groups and their persistence. These concepts are well suited for spaces that are not directed. As a result, one needs a concept of homology that accommodates orientations in input space. P… ▽ More This paper focuses on develo** an efficient algorithm for analyzing a directed network (graph) from a topological viewpoint. A prevalent technique for such topological analysis involves computation of homology groups and their persistence. These concepts are well suited for spaces that are not directed. As a result, one needs a concept of homology that accommodates orientations in input space. Path-homology developed for directed graphs by Grigor'yan, Lin, Muranov and Yau has been effectively adapted for this purpose recently by Chowdhury and Mémoli. They also give an algorithm to compute this path-homology. Our main contribution in this paper is an algorithm that computes this path-homology and its persistence more efficiently for the $1$-dimensional ($H_1$) case. In develo** such an algorithm, we discover various structures and their efficient computations that aid computing the $1$-dimensional path-homnology. We implement our algorithm and present some preliminary experimental results. △ Less

Submitted 26 January, 2020; originally announced January 2020.

arXiv:1909.06728 [pdf, other]

doi 10.1145/3347146.3359348

Road Network Reconstruction from Satellite Images with Machine Learning Supported by Topological Methods

Authors: Tamal K. Dey, Jiayuan Wang, Yusu Wang

Abstract: Automatic Extraction of road network from satellite images is a goal that can benefit and even enable new technologies. Methods that combine machine learning (ML) and computer vision have been proposed in recent years which make the task semi-automatic by requiring the user to provide curated training samples. The process can be fully automatized if training samples can be produced algorithmically… ▽ More Automatic Extraction of road network from satellite images is a goal that can benefit and even enable new technologies. Methods that combine machine learning (ML) and computer vision have been proposed in recent years which make the task semi-automatic by requiring the user to provide curated training samples. The process can be fully automatized if training samples can be produced algorithmically. Of course, this requires a robust algorithm that can reconstruct the road networks from satellite images reliably so that the output can be fed as training samples. In this work, we develop such a technique by infusing a persistence-guided discrete Morse based graph reconstruction algorithm into ML framework. We elucidate our contributions in two phases. First, in a semi-automatic framework, we combine a discrete-Morse based graph reconstruction algorithm with an existing CNN framework to segment input satellite images. We show that this leads to reconstructions with better connectivity and less noise. Next, in a fully automatic framework, we leverage the power of the discrete-Morse based graph reconstruction algorithm to train a CNN from a collection of images without labelled data and use the same algorithm to produce the final output from the segmented images created by the trained CNN. We apply the discrete-Morse based graph reconstruction algorithm iteratively to improve the accuracy of the CNN. We show promising experimental results of this new framework on datasets from SpaceNet Challenge. △ Less

Submitted 15 September, 2019; originally announced September 2019.

Comments: 26 pages, 13 figures, ACM SIGSPATIAL 2019

arXiv:1907.06538 [pdf, other]

Patterns of Effort Contribution and Demand and User Classification based on Participation Patterns in NPM Ecosystem

Authors: Tapajit Dey, Yuxing Ma, Audris Mockus

Abstract: Background: Open source requires participation of volunteer and commercial developers (users) in order to deliver functional high-quality components. Developers both contribute effort in the form of patches and demand effort from the component maintainers to resolve issues reported against it. Aim: Identify and characterize patterns of effort contribution and demand throughout the open source supp… ▽ More Background: Open source requires participation of volunteer and commercial developers (users) in order to deliver functional high-quality components. Developers both contribute effort in the form of patches and demand effort from the component maintainers to resolve issues reported against it. Aim: Identify and characterize patterns of effort contribution and demand throughout the open source supply chain and investigate if and how these patterns vary with developer activity; identify different groups of developers; and predict developers' company affiliation based on their participation patterns. Method: 1,376,946 issues and pull-requests created for 4433 NPM packages with over 10,000 monthly downloads and full (public) commit activity data of the 272,142 issue creators is obtained and analyzed and dependencies on NPM packages are identified. Fuzzy c-means clustering algorithm is used to find the groups among the users based on their effort contribution and demand patterns, and Random Forest is used as the predictive modeling technique to identify their company affiliations. Result: Users contribute and demand effort primarily from packages that they depend on directly with only a tiny fraction of contributions and demand going to transitive dependencies. A significant portion of demand goes into packages outside the users' respective supply chains (constructed based on publicly visible version control data). Three and two different groups of users are observed based on the effort demand and effort contribution patterns respectively. The Random Forest model used for identifying the company affiliation of the users gives a AUC-ROC value of 0.68. Conclusion: Our results give new insights into effort demand and supply at different parts of the supply chain of the NPM ecosystem and its users and suggests the need to increase visibility further upstream. △ Less

Submitted 15 July, 2019; originally announced July 2019.

Comments: 10 pages, 5 Tables, 2 Figures, Accepted in The 15th International Conference on Predictive Models and Data Analytics in Software Engineering 2019

arXiv:1907.04889 [pdf, other]

Computing Minimal Persistent Cycles: Polynomial and Hard Cases

Authors: Tamal K. Dey, Tao Hou, Sayan Mandal

Abstract: Persistent cycles, especially the minimal ones, are useful geometric features functioning as augmentations for the intervals in a purely topological persistence diagram (also termed as barcode). In our earlier work, we showed that computing minimal 1-dimensional persistent cycles (persistent 1-cycles) for finite intervals is NP-hard while the same for infinite intervals is polynomially tractable.… ▽ More Persistent cycles, especially the minimal ones, are useful geometric features functioning as augmentations for the intervals in a purely topological persistence diagram (also termed as barcode). In our earlier work, we showed that computing minimal 1-dimensional persistent cycles (persistent 1-cycles) for finite intervals is NP-hard while the same for infinite intervals is polynomially tractable. In this paper, we address this problem for general dimensions with $\mathbb{Z}_2$ coefficients. In addition to proving that it is NP-hard to compute minimal persistent d-cycles (d>1) for both types of intervals given arbitrary simplicial complexes, we identify two interesting cases which are polynomially tractable. These two cases assume the complex to be a certain generalization of manifolds which we term as weak pseudomanifolds. For finite intervals from the d-th persistence diagram of a weak (d+1)-pseudomanifold, we utilize the fact that persistent cycles of such intervals are null-homologous and reduce the problem to a minimal cut problem. Since the same problem for infinite intervals is NP-hard, we further assume the weak (d+1)-pseudomanifold to be embedded in $\mathbb{R}^{d+1}$ so that the complex has a natural dual graph structure and the problem reduces to a minimal cut problem. Experiments with both algorithms on scientific data indicate that the minimal persistent cycles capture various significant features of the data. △ Less

Submitted 14 February, 2020; v1 submitted 10 July, 2019; originally announced July 2019.

Comments: Content same as appeared in the proceeding of SODA20'

arXiv:1904.03766 [pdf, other]

Generalized Persistence Algorithm for Decomposing Multi-parameter Persistence Modules

Authors: Tamal K. Dey, Cheng Xin

Abstract: The classical persistence algorithm computes the unique decomposition of a persistence module implicitly given by an input simplicial filtration. Based on matrix reduction, this algorithm is a cornerstone of the emergent area of topological data analysis. Its input is a simplicial filtration defined over the integers $\mathbb{Z}$ giving rise to a $1$-parameter persistence module. It has been recog… ▽ More The classical persistence algorithm computes the unique decomposition of a persistence module implicitly given by an input simplicial filtration. Based on matrix reduction, this algorithm is a cornerstone of the emergent area of topological data analysis. Its input is a simplicial filtration defined over the integers $\mathbb{Z}$ giving rise to a $1$-parameter persistence module. It has been recognized that multiparameter version of persistence modules given by simplicial filtrations over $d$-dimensional integer grids $\mathbb{Z}^d$ is equally or perhaps more important in data science applications. However, in the multiparameter setting, one of the main challenges is that topological summaries based on algebraic structure such as decompositions and bottleneck distances cannot be as efficiently computed as in the $1$-parameter case because there is no known extension of the persistence algorithm to multiparameter persistence modules. We present an efficient algorithm to compute the unique decomposition of a finitely presented persistence module $M$ defined over the multiparameter $\mathbb{Z}^d$. The algorithm first assumes that the module is presented with a set of $N$ generators and relations that are \emph{distinctly graded}. Based on a generalized matrix reduction technique it runs in $O(N^{2ω+1})$ time where $ω<2.373$ is the exponent for matrix multiplication. This is much better than the well known algorithm called Meataxe which runs in $\tilde{O}(N^{6(d+1)})$ time on such an input. In practice, persistence modules are usually induced by simplicial filtrations. With such an input consisting of $n$ simplices, our algorithm runs in $O(n^{(d-1)(2ω+ 1)})$ time for $d\geq 2$. For the special case of zero dimensional homology, it runs in time $O(n^{2ω+1})$. △ Less

Submitted 6 December, 2021; v1 submitted 7 April, 2019; originally announced April 2019.

arXiv:1810.04807 [pdf, other]

Persistent 1-Cycles: Definition, Computation, and Its Application

Authors: Tamal K. Dey, Tao Hou, Sayan Mandal

Abstract: Persistence diagrams, which summarize the birth and death of homological features extracted from data, are employed as stable signatures for applications in image analysis and other areas. Besides simply considering the multiset of intervals included in a persistence diagram, some applications need to find representative cycles for the intervals. In this paper, we address the problem of computing… ▽ More Persistence diagrams, which summarize the birth and death of homological features extracted from data, are employed as stable signatures for applications in image analysis and other areas. Besides simply considering the multiset of intervals included in a persistence diagram, some applications need to find representative cycles for the intervals. In this paper, we address the problem of computing these representative cycles, termed as persistent 1-cycles, for $\text{H}_1$-persistent homology with $\mathbb{Z}_2$ coefficients. The definition of persistent cycles is based on the interval module decomposition of persistence modules, which reveals the structure of persistent homology. After showing that the computation of the optimal persistent 1-cycles is NP-hard, we propose an alternative set of meaningful persistent 1-cycles that can be computed with an efficient polynomial time algorithm. We also inspect the stability issues of the optimal persistent 1-cycles and the persistent 1-cycles computed by our algorithm with the observation that the perturbations of both cannot be properly bounded. We design a software which applies our algorithm to various datasets. Experiments on 3D point clouds, mineral structures, and images show the effectiveness of our algorithm in practice. △ Less

Submitted 15 October, 2018; v1 submitted 10 October, 2018; originally announced October 2018.

Comments: Correct the algorithm numbering issue

Showing 1–50 of 81 results for author: Dey, T