-
High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates
Authors:
Janick Weberpals,
Pamela A. Shaw,
Kueiyu Joshua Lin,
Richard Wyss,
Joseph M Plasek,
Li Zhou,
Kerry Ngan,
Thomas DeRamus,
Sudha R. Raman,
Bradley G. Hammill,
Hana Lee,
Sengwee Toh,
John G. Connolly,
Kimberly J. Dandreo,
Fang Tian,
Wei Liu,
Jie Li,
José J. Hernández-Muñoz,
Sebastian Schneeweiss,
Rishi J. Desai
Abstract:
Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from…
▽ More
Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Automating REST API Postman Test Cases Using LLM
Authors:
S Deepika Sri,
Mohammed Aadil S,
Sanjjushri Varshini R,
Raja CSP Raman,
Gopinath Rajagopal,
S Taranath Chan
Abstract:
In the contemporary landscape of technological advancements, the automation of manual processes is crucial, compelling the demand for huge datasets to effectively train and test machines. This research paper is dedicated to the exploration and implementation of an automated approach to generate test cases specifically using Large Language Models. The methodology integrates the use of Open AI to en…
▽ More
In the contemporary landscape of technological advancements, the automation of manual processes is crucial, compelling the demand for huge datasets to effectively train and test machines. This research paper is dedicated to the exploration and implementation of an automated approach to generate test cases specifically using Large Language Models. The methodology integrates the use of Open AI to enhance the efficiency and effectiveness of test case generation for training and evaluating Large Language Models. This formalized approach with LLMs simplifies the testing process, making it more efficient and comprehensive. Leveraging natural language understanding, LLMs can intelligently formulate test cases that cover a broad range of REST API properties, ensuring comprehensive testing. The model that is developed during the research is trained using manually collected postman test cases or instances for various Rest APIs. LLMs enhance the creation of Postman test cases by automating the generation of varied and intricate test scenarios. Postman test cases offer streamlined automation, collaboration, and dynamic data handling, providing a user-friendly and efficient approach to API testing compared to traditional test cases. Thus, the model developed not only conforms to current technological standards but also holds the promise of evolving into an idea of substantial importance in future technological advancements.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Finding fake reviews in e-commerce platforms by using hybrid algorithms
Authors:
Mathivanan Periasamy,
Rohith Mahadevan,
Bagiya Lakshmi S,
Raja CSP Raman,
Hasan Kumar S,
Jasper Jessiman
Abstract:
Sentiment analysis, a vital component in natural language processing, plays a crucial role in understanding the underlying emotions and opinions expressed in textual data. In this paper, we propose an innovative ensemble approach for sentiment analysis for finding fake reviews that amalgamate the predictive capabilities of Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Decision Tree…
▽ More
Sentiment analysis, a vital component in natural language processing, plays a crucial role in understanding the underlying emotions and opinions expressed in textual data. In this paper, we propose an innovative ensemble approach for sentiment analysis for finding fake reviews that amalgamate the predictive capabilities of Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Decision Tree classifiers. Our ensemble architecture strategically combines these diverse models to capitalize on their strengths while mitigating inherent weaknesses, thereby achieving superior accuracy and robustness in fake review prediction. By combining all the models of our classifiers, the predictive performance is boosted and it also fosters adaptability to varied linguistic patterns and nuances present in real-world datasets. The metrics accounted for on fake reviews demonstrate the efficacy and competitiveness of the proposed ensemble method against traditional single-model approaches. Our findings underscore the potential of ensemble techniques in advancing the state-of-the-art in finding fake reviews using hybrid algorithms, with implications for various applications in different social media and e-platforms to find the best reviews and neglect the fake ones, eliminating puffery and bluffs.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Swee** Arrangements of Non-Piercing Curves in Plane
Authors:
Suryendu Dalal,
Rahul Gangopadhyay,
Rajiv Raman,
Saurabh Ray
Abstract:
Let $Γ$ be a finite set of Jordan curves in the plane. For any curve $γ\in Γ$, we denote the bounded region enclosed by $γ$ as $\tildeγ$. We say that $Γ$ is a non-piercing family if for any two curves $α, β\in Γ$, $\tildeα \setminus \tildeβ$ is a connected region. A non-piercing family of curves generalizes a family of $2$-intersecting curves in which each pair of curves intersect in at most two p…
▽ More
Let $Γ$ be a finite set of Jordan curves in the plane. For any curve $γ\in Γ$, we denote the bounded region enclosed by $γ$ as $\tildeγ$. We say that $Γ$ is a non-piercing family if for any two curves $α, β\in Γ$, $\tildeα \setminus \tildeβ$ is a connected region. A non-piercing family of curves generalizes a family of $2$-intersecting curves in which each pair of curves intersect in at most two points. Snoeyink and Hershberger (``Swee** Arrangements of Curves'', SoCG '89) proved that if we are given a family $\mathcal{C}$ of $2$-intersecting curves and a fixed curve $C\in\mathcal{C}$, then the arrangement can be \emph{swept} by $C$, i.e., $C$ can be continuously shrunk to any point $p \in \tilde{C}$ in such a way that the we have a family of $2$-intersecting curves throughout the process. In this paper, we generalize the result of Snoeyink and Hershberger to the setting of non-piercing curves. We show that given an arrangement of non-piercing curves $Γ$, and a fixed curve $γ\in Γ$, the arrangement can be swept by $γ$ so that the arrangement remains non-piercing throughout the process. We also give a shorter and simpler proof of the result of Snoeyink and Hershberger and cite applications of their result, where our result leads to a generalization.
△ Less
Submitted 2 April, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.
-
Comparative Study and Framework for Automated Summariser Evaluation: LangChain and Hybrid Algorithms
Authors:
Bagiya Lakshmi S,
Sanjjushri Varshini R,
Rohith Mahadevan,
Raja CSP Raman
Abstract:
Automated Essay Score (AES) is proven to be one of the cutting-edge technologies. Scoring techniques are used for various purposes. Reliable scores are calculated based on influential variables. Such variables can be computed by different methods based on the domain. The research is concentrated on the user's understanding of a given topic. The analysis is based on a scoring index by using Large L…
▽ More
Automated Essay Score (AES) is proven to be one of the cutting-edge technologies. Scoring techniques are used for various purposes. Reliable scores are calculated based on influential variables. Such variables can be computed by different methods based on the domain. The research is concentrated on the user's understanding of a given topic. The analysis is based on a scoring index by using Large Language Models. The user can then compare and contrast the understanding of a topic that they recently learned. The results are then contributed towards learning analytics and progression is made for enhancing the learning ability. In this research, the focus is on summarizing a PDF document and gauging a user's understanding of its content. The process involves utilizing a Langchain tool to summarize the PDF and extract the essential information. By employing this technique, the research aims to determine how well the user comprehends the summarized content.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Exploring Linguistic Similarity and Zero-Shot Learning for Multilingual Translation of Dravidian Languages
Authors:
Danish Ebadulla,
Rahul Raman,
S. Natarajan,
Hridhay Kiran Shetty,
Ashish Harish Shenoy
Abstract:
Current research in zero-shot translation is plagued by several issues such as high compute requirements, increased training time and off target translations. Proposed remedies often come at the cost of additional data or compute requirements. Pivot based neural machine translation is preferred over a single-encoder model for most settings despite the increased training and evaluation time. In thi…
▽ More
Current research in zero-shot translation is plagued by several issues such as high compute requirements, increased training time and off target translations. Proposed remedies often come at the cost of additional data or compute requirements. Pivot based neural machine translation is preferred over a single-encoder model for most settings despite the increased training and evaluation time. In this work, we overcome the shortcomings of zero-shot translation by taking advantage of transliteration and linguistic similarity. We build a single encoder-decoder neural machine translation system for Dravidian-Dravidian multilingual translation and perform zero-shot translation. We compare the data vs zero-shot accuracy tradeoff and evaluate the performance of our vanilla method against the current state of the art pivot based method. We also test the theory that morphologically rich languages require large vocabularies by restricting the vocabulary using an optimal transport based technique. Our model manages to achieves scores within 3 BLEU of large-scale pivot-based models when it is trained on 50\% of the language directions.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
CERTainty: Detecting DNS Manipulation at Scale using TLS Certificates
Authors:
Elisa Tsai,
Deepak Kumar,
Ram Sundara Raman,
Gavin Li,
Yael Eiger,
Roya Ensafi
Abstract:
DNS manipulation is an increasingly common technique used by censors and other network adversaries to prevent users from accessing restricted Internet resources and hijack their connections. Prior work in detecting DNS manipulation relies largely on comparing DNS resolutions with trusted control results to identify inconsistencies. However, the emergence of CDNs and other cloud providers practicin…
▽ More
DNS manipulation is an increasingly common technique used by censors and other network adversaries to prevent users from accessing restricted Internet resources and hijack their connections. Prior work in detecting DNS manipulation relies largely on comparing DNS resolutions with trusted control results to identify inconsistencies. However, the emergence of CDNs and other cloud providers practicing content localization and load balancing leads to these heuristics being inaccurate, paving the need for more verifiable signals of DNS manipulation. In this paper, we develop a new technique, CERTainty, that utilizes the widely established TLS certificate ecosystem to accurately detect DNS manipulation, and obtain more information about the adversaries performing such manipulation. We find that untrusted certificates, mismatching hostnames, and blockpages are powerful proxies for detecting DNS manipulation. Our results show that previous work using consistency-based heuristics is inaccurate, allowing for 72.45% false positives in the cases detected as DNS manipulation. Further, we identify 17 commercial DNS filtering products in 52 countries, including products such as SafeDNS, SkyDNS, and Fortinet, and identify the presence of 55 ASes in 26 countries that perform ISP-level DNS manipulation. We also identify 226 new blockpage clusters that are not covered by previous research. We are integrating techniques used by CERTainty into active measurement platforms to continuously and accurately monitor DNS manipulation.
△ Less
Submitted 14 May, 2023;
originally announced May 2023.
-
On Hypergraph Supports
Authors:
Rajiv Raman,
Karamjeet Singh
Abstract:
Let $\mathcal{H}=(X,\mathcal{E})$ be a hypergraph. A support is a graph $Q$ on $X$ such that for each $E\in\mathcal{E}$, the subgraph of $Q$ induced on the elements in $E$ is connected. In this paper, we consider hypergraphs defined on a host graph. Given a graph $G=(V,E)$, with $c:V\to\{\mathbf{r},\mathbf{b}\}$, and a collection of connected subgraphs $\mathcal{H}$ of $G$, a primal support is a g…
▽ More
Let $\mathcal{H}=(X,\mathcal{E})$ be a hypergraph. A support is a graph $Q$ on $X$ such that for each $E\in\mathcal{E}$, the subgraph of $Q$ induced on the elements in $E$ is connected. In this paper, we consider hypergraphs defined on a host graph. Given a graph $G=(V,E)$, with $c:V\to\{\mathbf{r},\mathbf{b}\}$, and a collection of connected subgraphs $\mathcal{H}$ of $G$, a primal support is a graph $Q$ on $\mathbf{b}(V)$ such that for each $H\in \mathcal{H}$, the induced subgraph $Q[\mathbf{b}(H)]$ on vertices $\mathbf{b}(H)=H\cap c^{-1}(\mathbf{b})$ is connected. A \emph{dual support} is a graph $Q^*$ on $\mathcal{H}$ s.t. for each $v\in X$, the induced subgraph $Q^*[\mathcal{H}_v]$ is connected, where $\mathcal{H}_v=\{H\in\mathcal{H}: v\in H\}$. We present sufficient conditions on the host graph and hyperedges so that the resulting support comes from a restricted family.
We primarily study two classes of graphs: $(1)$ If the host graph has genus $g$ and the hypergraphs satisfy a topological condition of being \emph{cross-free}, then there is a primal and a dual support of genus at most $g$. $(2)$ If the host graph has treewidth $t$ and the hyperedges satisfy a combinatorial condition of being \emph{non-piercing}, then there exist primal and dual supports of treewidth $O(2^t)$. We show that this exponential blow-up is sometimes necessary. As an intermediate case, we also study the case when the host graph is outerplanar. Finally, we show applications of our results to packing and covering, and coloring problems on geometric hypergraphs.
△ Less
Submitted 2 February, 2024; v1 submitted 29 March, 2023;
originally announced March 2023.
-
Fashion-model pose recommendation and generation using Machine Learning
Authors:
Vijitha Kannumuru,
Santhosh Kannan S P,
Krithiga Shankar,
Joy Larnyoh,
Rohith Mahadevan,
Raja CSP Raman
Abstract:
Fashion-model pose is an important attribute in the fashion industry. Creative directors, modeling production houses, and top photographers always look for professional models able to pose. without the skill to correctly pose, their chances of landing professional modeling employment are regrettably quite little. There are occasions when models and photographers are unsure of the best pose to stri…
▽ More
Fashion-model pose is an important attribute in the fashion industry. Creative directors, modeling production houses, and top photographers always look for professional models able to pose. without the skill to correctly pose, their chances of landing professional modeling employment are regrettably quite little. There are occasions when models and photographers are unsure of the best pose to strike while taking photographs. This research concentrates on suggesting the fashion personnel a series of similar images based on the input image. The image is segmented into different parts and similar images are suggested for the user. This was achieved by calculating the color histogram of the input image and applying the same for all the images in the dataset and comparing the histograms. Synthetic images have become popular to avoid privacy concerns and to overcome the high cost of photoshoots. Hence, this paper also extends the work of generating synthetic images from the recommendation engine using styleGAN to an extent.
△ Less
Submitted 19 February, 2023;
originally announced March 2023.
-
Natural cubic splines for the analysis of Alzheimer's clinical trials
Authors:
M. C. Donohue,
O. Langford,
P. Insel,
C. H. van Dyck,
R. C. Petersen,
S. Craft,
G. Sethuraman,
R. Raman,
P. S. Aisen
Abstract:
Mixed model repeated measures (MMRM) is the most common analysis approach used in clinical trials for Alzheimer's disease and other progressive diseases measured with continuous outcomes measured over time. The model treats time as a categorical variable, which allows an unconstrained estimate of the mean for each study visit in each randomized group. Categorizing time in this way can be problemat…
▽ More
Mixed model repeated measures (MMRM) is the most common analysis approach used in clinical trials for Alzheimer's disease and other progressive diseases measured with continuous outcomes measured over time. The model treats time as a categorical variable, which allows an unconstrained estimate of the mean for each study visit in each randomized group. Categorizing time in this way can be problematic when assessments occur off-schedule, as including off-schedule visits can induce bias, and excluding them ignores valuable information and violates the intention to treat principle. This problem has been exacerbated by clinical trial visits which have been delayed due to the COVID19 pandemic. As an alternative to MMRM, we propose a constrained longitudinal data analysis with natural cubic splines that treats time as continuous and uses test version effects to model the mean over time. The spline model is shown to be superior, in practice and simulation studies, to categorical-time models like MMRM and models that assume a proportional treatment effect.
△ Less
Submitted 17 August, 2022;
originally announced August 2022.
-
Payday loans -- blessing or growth suppressor? Machine Learning Analysis
Authors:
Rohith Mahadevan,
Sam Richard,
Kishore Harshan Kumar,
Jeevitha Murugan,
Santhosh Kannan,
Saaisri,
Tarun,
Raja CSP Raman
Abstract:
The upsurge of real estate involves a variety of factors that have got influenced by many domains. Indeed, the unrecognized sector that would affect the economy for which regulatory proposals are being drafted to keep this in control is the payday loans. This research paper revolves around the impact of payday loans in the real estate market. The research paper draws a first-hand experience of obt…
▽ More
The upsurge of real estate involves a variety of factors that have got influenced by many domains. Indeed, the unrecognized sector that would affect the economy for which regulatory proposals are being drafted to keep this in control is the payday loans. This research paper revolves around the impact of payday loans in the real estate market. The research paper draws a first-hand experience of obtaining the index for the concentration of real estate in an area of reference by virtue of payday loans in Toronto, Ontario in particular, which sets out an ideology to create, evaluate and demonstrate the scenario through research analysis. The purpose of this indexing via payday loans is the basic - debt: income ratio which states that when the income of the person bound to pay the interest of payday loans increases, his debt goes down marginally which hence infers that the person invests in fixed assets like real estate which hikes up its growth.
△ Less
Submitted 30 May, 2022;
originally announced May 2022.
-
Turtle Score -- Similarity Based Developer Analyzer
Authors:
Sanjjushri Varshini,
Ponshriharini V,
Santhosh Kannan,
Snekha Suresh,
Harshavardhan Ramesh,
Rohith Mahadevan,
Raja CSP Raman
Abstract:
In day-to-day life, a highly demanding task for IT companies is to find the right candidates who fit the companies' culture. This research aims to comprehend, analyze and automatically produce convincing outcomes to find a candidate who perfectly fits right in the company. Data is examined and collected for each employee who works in the IT domain focusing on their performance measure. This is don…
▽ More
In day-to-day life, a highly demanding task for IT companies is to find the right candidates who fit the companies' culture. This research aims to comprehend, analyze and automatically produce convincing outcomes to find a candidate who perfectly fits right in the company. Data is examined and collected for each employee who works in the IT domain focusing on their performance measure. This is done based on various different categories which bring versatility and a wide view of focus. To this data, learner analysis is done using machine learning algorithms to obtain learner similarity and developer similarity in order to recruit people with identical working patterns. It's been proven that the efficiency and capability of a particular worker go higher when working with a person of a similar personality. Therefore this will serve as a useful tool for recruiters who aim to recruit people with high productivity. This is to say that the model designed will render the best outcome possible with high accuracy and an immaculate recommendation score.
△ Less
Submitted 10 May, 2022;
originally announced May 2022.
-
Analysis of Ferroelectric Negative Capacitance-Hybrid MEMS Actuator Using Energy-Displacement Landscape
Authors:
Raghuram Tattamangalam Raman,
Jeffin Shibu,
Revathy Padmanabhan,
Arvind Ajoy
Abstract:
We propose an energy-based framework to analyze the statics and dynamics of a ferroelectric negative capacitance-hybrid Microelectromechanical System (MEMS) actuator. A map** function that relates the charge on the ferroelectric to displacement of the movable electrode, is used to obtain the Hamiltonian of the hybrid actuator in terms of displacement. We then use graphical energy-displacement an…
▽ More
We propose an energy-based framework to analyze the statics and dynamics of a ferroelectric negative capacitance-hybrid Microelectromechanical System (MEMS) actuator. A map** function that relates the charge on the ferroelectric to displacement of the movable electrode, is used to obtain the Hamiltonian of the hybrid actuator in terms of displacement. We then use graphical energy-displacement and phase portrait plots to analyze static pull-in, dynamic pull-in and pull-out phenomena of the hybrid actuator. Using these, we illustrate the low-voltage operation of the hybrid actuator to static and step inputs, as compared to the standalone MEMS actuator. The results obtained are in agreement with the analytical predictions and numerical simulations. The proposed framework enables straightforward inclusion of adhesion between the contacting surfaces, modeled using van der Waals force. We show that the pull-in voltage is not affected, while the pull-out voltage is reduced due to adhesion. The proposed framework provides a physics-based tool to design and analyze negative capacitance based low-voltage MEMS actuators.
△ Less
Submitted 20 March, 2022;
originally announced April 2022.
-
Plasmoids formation during simulations of coaxial helicity injection in the National Spherical Torus Experiment
Authors:
F. Ebrahimi,
R. Raman
Abstract:
Formation of an elongated Sweet-Parker current sheet and a transition to plasmoid instability has for the first time been predicted by simulations in a large-scale toroidal fusion plasma in the absence of any pre-existing instability. Plasmoid instability is demonstrated through resistive MHD simulations of transient Coaxial Helicity Injection (CHI) experiments in the National Spherical Torus Expe…
▽ More
Formation of an elongated Sweet-Parker current sheet and a transition to plasmoid instability has for the first time been predicted by simulations in a large-scale toroidal fusion plasma in the absence of any pre-existing instability. Plasmoid instability is demonstrated through resistive MHD simulations of transient Coaxial Helicity Injection (CHI) experiments in the National Spherical Torus Experiment (NSTX).Consistent with the theory, fundamental characteristics of the plasmoid instability, including fast reconnection rate, have been observed in these realistic simulations. Motivated by the simulations, experimental camera images have been revisited and suggest the existence of reconnecting plasmoids in NSTX. Global, system-size plasmoid formation observed here should also have strong implications for astrophysical reconnection, such as rapid eruptive solar events.
△ Less
Submitted 29 March, 2022;
originally announced March 2022.
-
On the Parameterized Complexity of the Maximum Exposure Problem
Authors:
Remi Raman,
Shahin John J S,
R Subashini,
Subhasree Methirumangalath
Abstract:
We investigate the parameterized complexity of Maximum Exposure Problem (MEP). Given a range space (R, P) where R is the set of ranges containing a set P of points, and an integer k, MEP asks for k ranges which on removal results in the maximum number of exposed points. A point p is said to be exposed when p is not contained in any of the ranges in R. The problem is known to be NP-hard. In this le…
▽ More
We investigate the parameterized complexity of Maximum Exposure Problem (MEP). Given a range space (R, P) where R is the set of ranges containing a set P of points, and an integer k, MEP asks for k ranges which on removal results in the maximum number of exposed points. A point p is said to be exposed when p is not contained in any of the ranges in R. The problem is known to be NP-hard. In this letter, we give fixed-parameter tractable results of MEP with respect to different parameterizations.
△ Less
Submitted 22 March, 2022; v1 submitted 21 March, 2022;
originally announced March 2022.
-
Stack Index Prediction Using Time-Series Analysis
Authors:
Raja CSP Raman,
Rohith Mahadevan,
Divya Perumal,
Vedha Sankar,
Talha Abdur Rahman
Abstract:
The Prevalence of Community support and engagement for different domains in the tech industry has changed and evolved throughout the years. In this study, we aim to understand, analyze and predict the trends of technology in a scientific manner, having collected data on numerous topics and their growth throughout the years in the past decade. We apply machine learning models on collected data, to…
▽ More
The Prevalence of Community support and engagement for different domains in the tech industry has changed and evolved throughout the years. In this study, we aim to understand, analyze and predict the trends of technology in a scientific manner, having collected data on numerous topics and their growth throughout the years in the past decade. We apply machine learning models on collected data, to understand, analyze and forecast the trends in the advancement of different fields. We show that certain technical concepts such as python, machine learning, and Keras have an undisputed uptrend, finally concluding that the Stackindex model forecasts with high accuracy and can be a viable tool for forecasting different tech domains.
△ Less
Submitted 18 August, 2021;
originally announced August 2021.
-
Modeling of carbon pellets disruption mitigation in an NSTX-U plasma
Authors:
C. F. Clauser,
S. C. Jardin,
R. Raman,
B. C. Lyons,
N. M. Ferraro
Abstract:
Single carbon pellet disruption mitigation simulations using M3D-C1 were conducted in an NSTX-U-like plasma to support the electromagnetic pellet injection concept (EPI). A carbon ablation model has been implemented in M3D-C1 and tested with available data. 2D simulations were conducted in order to estimate the amount of carbon needed to quench the plasma, finding that the content in a $1\,$mm rad…
▽ More
Single carbon pellet disruption mitigation simulations using M3D-C1 were conducted in an NSTX-U-like plasma to support the electromagnetic pellet injection concept (EPI). A carbon ablation model has been implemented in M3D-C1 and tested with available data. 2D simulations were conducted in order to estimate the amount of carbon needed to quench the plasma, finding that the content in a $1\,$mm radius vitreous carbon pellet (~ 3.2x10E20 atoms) would be enough if it is entirely ablated. 3D simulations were performed, scanning over pellet velocity and parallel thermal conductivity, as well as different injection directions and pellet concepts (solid pellets and shell pellets). The sensitivity of the thermal quench and other related quantities to these parameters has been evaluated. A 1 mm radius solid pellet only partially ablates at velocities of 300 m/s or higher, thus being unable to fully quench the plasma. To further enhance the ablation, approximations to an array of pellets and the shell pellet concept were also explored. 3D field line stochastization plays an important role in both quenching the center of the plasma and in heat flux losses, thus lowering the amount of carbon needed to mitigate the plasma when compared to the 2D case. This study constitutes an important step forward in `predict-first' simulations for disruption mitigation in NSTX-U and other devices, such as ITER.
△ Less
Submitted 28 June, 2021;
originally announced June 2021.
-
Knowledge Value Stream Framework For Complex Product Design Decisions
Authors:
Ramakrishnan Raman,
Meenakshi D'Souza
Abstract:
Product Development value stream includes all the activities, value added and non value added, for designing and develo** a product. It is characterized by flow of knowledge that drives the decisions, and deals with how the product is conceptualized, architected and designed by the design teams. The intangible flow of knowledge is determined by knowledge value stream, which shapes how the raw co…
▽ More
Product Development value stream includes all the activities, value added and non value added, for designing and develo** a product. It is characterized by flow of knowledge that drives the decisions, and deals with how the product is conceptualized, architected and designed by the design teams. The intangible flow of knowledge is determined by knowledge value stream, which shapes how the raw concepts and ideas flow into mature knowledge, how the knowledge is socialized, internalized, and how the knowledge impels the decisions in the product development value stream. For complex products, the design teams encounter tough challenges, such as uncertainty and variability, while making design decisions. This paper proposes a framework for knowledge value stream for complex product design decisions. The framework encompasses knowledge cadence and learning cycles as its core elements and incorporates rudiments for complex product design such as uncertainty, variability and perceptions. It helps in managing uncertainty and variability and evolving the required knowledge to drive optimal decisions during the design of complex products. It advocates a phased model for framework deployment towards establishing and progressively maturing the knowledge value stream.
△ Less
Submitted 16 May, 2021;
originally announced May 2021.
-
Weighted Ancestors in Suffix Trees Revisited
Authors:
Djamal Belazzougui,
Dmitry Kosolobov,
Simon J. Puglisi,
Rajeev Raman
Abstract:
The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require $Ω(\log\log n)$ time for queries provided $O(n\mathop{\mathrm{polylog}} n)$ space is available and weights are from $[0..n]$, where $n$ is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an $O(n)$-space solution with constant qu…
▽ More
The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require $Ω(\log\log n)$ time for queries provided $O(n\mathop{\mathrm{polylog}} n)$ space is available and weights are from $[0..n]$, where $n$ is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an $O(n)$-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string $s$, we need a data structure that can locate in the tree any substring $s[p..q]$ of $s$ in $O(1)$ time (as if one descended from the root reading $s[p..q]$ along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.
△ Less
Submitted 11 April, 2021; v1 submitted 28 February, 2021;
originally announced March 2021.
-
Underspecification Presents Challenges for Credibility in Modern Machine Learning
Authors:
Alexander D'Amour,
Katherine Heller,
Dan Moldovan,
Ben Adlam,
Babak Alipanahi,
Alex Beutel,
Christina Chen,
Jonathan Deaton,
Jacob Eisenstein,
Matthew D. Hoffman,
Farhad Hormozdiari,
Neil Houlsby,
Shaobo Hou,
Ghassen Jerfel,
Alan Karthikesalingam,
Mario Lucic,
Yian Ma,
Cory McLean,
Diana Mincu,
Akinori Mitani,
Andrea Montanari,
Zachary Nado,
Vivek Natarajan,
Christopher Nielson,
Thomas F. Osborne
, et al. (15 additional authors not shown)
Abstract:
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predict…
▽ More
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.
△ Less
Submitted 24 November, 2020; v1 submitted 6 November, 2020;
originally announced November 2020.
-
Quantitative and Qualitative Evaluation of Explainable Deep Learning Methods for Ophthalmic Diagnosis
Authors:
Amitojdeep Singh,
J. Jothi Balaji,
Mohammed Abdul Rasheed,
Varadharajan Jayakumar,
Rajiv Raman,
Vasudevan Lakshminarayanan
Abstract:
Background: The lack of explanations for the decisions made by algorithms such as deep learning has hampered their acceptance by the clinical community despite highly accurate results on multiple problems. Recently, attribution methods have emerged for explaining deep learning models, and they have been tested on medical imaging problems. The performance of attribution methods is compared on stand…
▽ More
Background: The lack of explanations for the decisions made by algorithms such as deep learning has hampered their acceptance by the clinical community despite highly accurate results on multiple problems. Recently, attribution methods have emerged for explaining deep learning models, and they have been tested on medical imaging problems. The performance of attribution methods is compared on standard machine learning datasets and not on medical images. In this study, we perform a comparative analysis to determine the most suitable explainability method for retinal OCT diagnosis.
Methods: A commonly used deep learning model known as Inception v3 was trained to diagnose 3 retinal diseases - choroidal neovascularization (CNV), diabetic macular edema (DME), and drusen. The explanations from 13 different attribution methods were rated by a panel of 14 clinicians for clinical significance. Feedback was obtained from the clinicians regarding the current and future scope of such methods.
Results: An attribution method based on a Taylor series expansion, called Deep Taylor was rated the highest by clinicians with a median rating of 3.85/5. It was followed by two other attribution methods, Guided backpropagation and SHAP (SHapley Additive exPlanations).
Conclusion: Explanations of deep learning models can make them more transparent for clinical diagnosis. This study compared different explanations methods in the context of retinal OCT diagnosis and found that the best performing method may not be the one considered best for other deep learning tasks. Overall, there was a high degree of acceptance from the clinicians surveyed in the study.
Keywords: explainable AI, deep learning, machine learning, image processing, Optical coherence tomography, retina, Diabetic macular edema, Choroidal Neovascularization, Drusen
△ Less
Submitted 24 March, 2021; v1 submitted 26 September, 2020;
originally announced September 2020.
-
Generating a Gray code for prefix normal words in amortized polylogarithmic time per word
Authors:
Péter Burcsi,
Gabriele Fici,
Zsuzsanna Lipták,
Rajeev Raman,
Joe Sawada
Abstract:
A prefix normal word is a binary word with the property that no substring has more $1$s than the prefix of the same length. By proving that the set of prefix normal words is a bubble language, we can exhaustively list all prefix normal words of length $n$ as a combinatorial Gray code, where successive strings differ by at most two swaps or bit flips. This Gray code can be generated in…
▽ More
A prefix normal word is a binary word with the property that no substring has more $1$s than the prefix of the same length. By proving that the set of prefix normal words is a bubble language, we can exhaustively list all prefix normal words of length $n$ as a combinatorial Gray code, where successive strings differ by at most two swaps or bit flips. This Gray code can be generated in $\Oh(\log^2 n)$ amortized time per word, while the best generation algorithm hitherto has $\Oh(n)$ running time per word. We also present a membership tester for prefix normal words, as well as a novel characterization of bubble languages.
△ Less
Submitted 7 August, 2020; v1 submitted 5 March, 2020;
originally announced March 2020.
-
Analysis of Electrostatic MEMS Using Energy-Charge Landscape
Authors:
Raghuram Tattamangalam Raman,
Arvind Ajoy,
Revathy Padmanabhan
Abstract:
A common way to analyze electrostatic microelectromechanical systems (MEMS) actuators is to use their energy-displacement landscape. Here, we describe an alternative approach to analyze electrostatic MEMS actuators using their energy-charge landscape. This technique involves coordinate transformation from displacement to charge, thereby formulating the Hamiltonian of electrostatic MEMS actuators i…
▽ More
A common way to analyze electrostatic microelectromechanical systems (MEMS) actuators is to use their energy-displacement landscape. Here, we describe an alternative approach to analyze electrostatic MEMS actuators using their energy-charge landscape. This technique involves coordinate transformation from displacement to charge, thereby formulating the Hamiltonian of electrostatic MEMS actuators in terms of charge. We investigate the use of the energy-charge landscape to analyze static pull-in, dynamic pull-in, and pull-out phenomena. The voltage expressions derived using this method are identical with those derived using the conventional energy-displacement landscape. In addition, we also obtain the expressions for charge under static and dynamic pull-in conditions. This work can aid in the design and analysis of electrostatic MEMS devices. As a case study, the analysis of a feedback capacitor-MEMS actuator system is presented to illustrate the application of the energy-charge landscape.
△ Less
Submitted 23 October, 2020; v1 submitted 27 February, 2020;
originally announced February 2020.
-
Finite-Sample Analysis of Image Registration
Authors:
Ravi Kiran Raman,
Lav R. Varshney
Abstract:
We study the problem of image registration in the finite-resolution regime and characterize the error probability of algorithms as a function of properties of the transformation and the image capture noise. Specifically, we define a channel-aware Feinstein decoder to obtain upper bounds on the minimum achievable error probability under finite resolution. We specifically focus on the higher-order t…
▽ More
We study the problem of image registration in the finite-resolution regime and characterize the error probability of algorithms as a function of properties of the transformation and the image capture noise. Specifically, we define a channel-aware Feinstein decoder to obtain upper bounds on the minimum achievable error probability under finite resolution. We specifically focus on the higher-order terms and use Berry-Esseen type CLTs to obtain a stronger characterization of the achievability condition for the problem. Then, we derive a strong type-counting result to characterize the performance of the MMI decoder in terms of the maximum likelihood decoder, in a simplified setting of the problem. We then describe how this analysis, when related to the results from the channel-aware context provide stronger characterization of the finite-sample performance of universal image registration.
△ Less
Submitted 12 January, 2020;
originally announced January 2020.
-
Decision Making in Star Networks with Incorrect Beliefs
Authors:
Daewon Seo,
Ravi Kiran Raman,
Lav R. Varshney
Abstract:
Consider a Bayesian binary decision-making problem in star networks, where local agents make selfish decisions independently, and a fusion agent makes a final decision based on aggregated decisions and its own private signal. In particular, we assume all agents have private beliefs for the true prior probability, based on which they perform Bayesian decision making. We focus on the Bayes risk of t…
▽ More
Consider a Bayesian binary decision-making problem in star networks, where local agents make selfish decisions independently, and a fusion agent makes a final decision based on aggregated decisions and its own private signal. In particular, we assume all agents have private beliefs for the true prior probability, based on which they perform Bayesian decision making. We focus on the Bayes risk of the fusion agent and counterintuitively find that incorrect beliefs could achieve a smaller risk than that when agents know the true prior. It is of independent interest for sociotechnical system design that the optimal beliefs of local agents resemble human probability reweighting models from cumulative prospect theory.
We also consider asymptotic characterization of the optimal beliefs and fusion agent's risk in the number of local agents. We find that the optimal risk of the fusion agent converges to zero exponentially fast as the number of local agents grows. Furthermore, having an identical constant belief is asymptotically optimal in the sense of the risk exponent. For additive Gaussian noise, the optimal belief turns out to be a simple function of only error costs and the risk exponent can be explicitly characterized.
△ Less
Submitted 26 October, 2021; v1 submitted 27 December, 2019;
originally announced December 2019.
-
Fixed-Parameter Tractable Algorithms for Corridor Guarding Problems
Authors:
Remi Raman,
R Subashini,
Subhasree Methirumangalath
Abstract:
Given an orthogonal connected arrangement of line-segments, Minimum Corridor Guarding(MCG) problem asks for an optimal tree/closed walk such that, if a guard moves through the tree/closed walk, all the line-segments are visited by the guard. This problem is referred to as Corridor-MST/Corridor-TSP (CMST/CTSP) for the cases when the guarding walk is a tree/closed walk, respectively. The correspondi…
▽ More
Given an orthogonal connected arrangement of line-segments, Minimum Corridor Guarding(MCG) problem asks for an optimal tree/closed walk such that, if a guard moves through the tree/closed walk, all the line-segments are visited by the guard. This problem is referred to as Corridor-MST/Corridor-TSP (CMST/CTSP) for the cases when the guarding walk is a tree/closed walk, respectively. The corresponding decision version of MCG is shown to be NP-Complete[1]. The parameterized version of CMST/CTSP referred to as k-CMST/k-CTSP, asks for an optimal tree/closed walk on at most k vertices, that visits all the line-segments. Here, vertices correspond to the endpoints and intersection points of the input line-segments. We show that k-CMST/k-CTSP is fixed-parameter tractable (FPT) with respect to the parameter k. Next, we propose a variant of CTSP referred to as Minimum Link CTSP(MLC), in which the link-distance of the closed walk has to be minimized. Here, link-distance refers to the number of links or connected line-segments with same orientation in the walk. We prove that the decision version of MLC is NP-Complete, and show that the parameterized version, namely b-MLC, is FPT with respect to the parameter b, where b corresponds to the link-distance. We also consider another related problem, the Minimum Corridor Connection (MCC). Given a rectilinear polygon partitioned into rectilinear components or rooms, MCC asks for a minimum length tree along the edges of the partitions, such that every room is incident to at least one vertex of the tree. The decision version of MCC is shown to be NP-Complete[2]. We prove the fixed parameter tractability of the parameterized version of MCC, namely k-MCC with respect to the parameter k, where k is the number of rooms.
△ Less
Submitted 29 March, 2022; v1 submitted 11 February, 2019;
originally announced February 2019.
-
Beliefs in Decision-Making Cascades
Authors:
Daewon Seo,
Ravi Kiran Raman,
Joong Bum Rhim,
Vivek K Goyal,
Lav R Varshney
Abstract:
This work explores a social learning problem with agents having nonidentical noise variances and mismatched beliefs. We consider an $N$-agent binary hypothesis test in which each agent sequentially makes a decision based not only on a private observation, but also on preceding agents' decisions. In addition, the agents have their own beliefs instead of the true prior, and have nonidentical noise v…
▽ More
This work explores a social learning problem with agents having nonidentical noise variances and mismatched beliefs. We consider an $N$-agent binary hypothesis test in which each agent sequentially makes a decision based not only on a private observation, but also on preceding agents' decisions. In addition, the agents have their own beliefs instead of the true prior, and have nonidentical noise variances in the private signal. We focus on the Bayes risk of the last agent, where preceding agents are selfish.
We first derive the optimal decision rule by recursive belief update and conclude, counterintuitively, that beliefs deviating from the true prior could be optimal in this setting. The effect of nonidentical noise levels in the two-agent case is also considered and analytical properties of the optimal belief curves are given. Next, we consider a predecessor selection problem wherein the subsequent agent of a certain belief chooses a predecessor from a set of candidates with varying beliefs. We characterize the decision region for choosing such a predecessor and argue that a subsequent agent with beliefs varying from the true prior often ends up selecting a suboptimal predecessor, indicating the need for a social planner. Lastly, we discuss an augmented intelligence design problem that uses a model of human behavior from cumulative prospect theory and investigate its near-optimality and suboptimality.
△ Less
Submitted 5 August, 2019; v1 submitted 23 November, 2018;
originally announced December 2018.
-
Promoting Distributed Trust in Machine Learning and Computational Simulation via a Blockchain Network
Authors:
Nelson Kibichii Bore,
Ravi Kiran Raman,
Isaac M. Markus,
Sekou L. Remy,
Oliver Bent,
Michael Hind,
Eleftheria K. Pissadaki,
Biplav Srivastava,
Roman Vaculin,
Kush R. Varshney,
Komminist Weldemariam
Abstract:
Policy decisions are increasingly dependent on the outcomes of simulations and/or machine learning models. The ability to share and interact with these outcomes is relevant across multiple fields and is especially critical in the disease modeling community where models are often only accessible and workable to the researchers that generate them. This work presents a blockchain-enabled system that…
▽ More
Policy decisions are increasingly dependent on the outcomes of simulations and/or machine learning models. The ability to share and interact with these outcomes is relevant across multiple fields and is especially critical in the disease modeling community where models are often only accessible and workable to the researchers that generate them. This work presents a blockchain-enabled system that establishes a decentralized trust between parties involved in a modeling process. Utilizing the OpenMalaria framework, we demonstrate the ability to store, share and maintain auditable logs and records of each step in the simulation process, showing how to validate results generated by computing workers. We also show how the system monitors worker outputs to rank and identify faulty workers via comparison to nearest neighbors or historical reward spaces as a means of ensuring model quality.
△ Less
Submitted 25 October, 2018;
originally announced October 2018.
-
Deep Learning vs. Human Graders for Classifying Severity Levels of Diabetic Retinopathy in a Real-World Nationwide Screening Program
Authors:
Paisan Raumviboonsuk,
Jonathan Krause,
Peranut Chotcomwongse,
Rory Sayres,
Rajiv Raman,
Kasumi Widner,
Bilson J L Campana,
Sonia Phene,
Kornwipa Hemarat,
Mongkol Tadarati,
Sukhum Silpa-Acha,
Jirawut Limwattanayingyong,
Chetan Rao,
Oscar Kuruvilla,
Jesse Jung,
Jeffrey Tan,
Surapong Orprayoon,
Chawawat Kangwanwongpaisan,
Ramase Sukulmalpaiboon,
Chainarong Luengchaichawang,
Jitumporn Fuangkaew,
Pipat Kongsap,
Lamyong Chualinpha,
Sarawuth Saree,
Srirat Kawinpanitan
, et al. (7 additional authors not shown)
Abstract:
Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. 25,326 gradable retinal images of patients with diabetes from the community-based, nation-wide screening program of DR in Thailand were analy…
▽ More
Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. 25,326 gradable retinal images of patients with diabetes from the community-based, nation-wide screening program of DR in Thailand were analyzed for DR severity and referable diabetic macular edema (DME). Grades adjudicated by a panel of international retinal specialists served as the reference standard. Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate (by 23%) at the cost of slightly higher false positive rates (2%). Deep learning algorithms may serve as a valuable tool for DR screening.
△ Less
Submitted 18 October, 2018;
originally announced October 2018.
-
Trusted Multi-Party Computation and Verifiable Simulations: A Scalable Blockchain Approach
Authors:
Ravi Kiran Raman,
Roman Vaculin,
Michael Hind,
Sekou L. Remy,
Eleftheria K. Pissadaki,
Nelson Kibichii Bore,
Roozbeh Daneshvar,
Biplav Srivastava,
Kush R. Varshney
Abstract:
Large-scale computational experiments, often running over weeks and over large datasets, are used extensively in fields such as epidemiology, meteorology, computational biology, and healthcare to understand phenomena, and design high-stakes policies affecting everyday health and economy. For instance, the OpenMalaria framework is a computationally-intensive simulation used by various non-governmen…
▽ More
Large-scale computational experiments, often running over weeks and over large datasets, are used extensively in fields such as epidemiology, meteorology, computational biology, and healthcare to understand phenomena, and design high-stakes policies affecting everyday health and economy. For instance, the OpenMalaria framework is a computationally-intensive simulation used by various non-governmental and governmental agencies to understand malarial disease spread and effectiveness of intervention strategies, and subsequently design healthcare policies. Given that such shared results form the basis of inferences drawn, technological solutions designed, and day-to-day policies drafted, it is essential that the computations are validated and trusted. In particular, in a multi-agent environment involving several independent computing agents, a notion of trust in results generated by peers is critical in facilitating transparency, accountability, and collaboration. Using a novel combination of distributed validation of atomic computation blocks and a blockchain-based immutable audits mechanism, this work proposes a universal framework for distributed trust in computations. In particular we address the scalaibility problem by reducing the storage and communication costs using a lossy compression scheme. This framework guarantees not only verifiability of final results, but also the validity of local computations, and its cost-benefit tradeoffs are studied using a synthetic example of training a neural network.
△ Less
Submitted 22 September, 2018;
originally announced September 2018.
-
Dynamic Distributed Storage for Scaling Blockchains
Authors:
Ravi Kiran Raman,
Lav R. Varshney
Abstract:
Blockchain uses the idea of storing transaction data in the form of a distributed ledger wherein each node in the network stores a current copy of the sequence of transactions in the form of a hash chain. This requirement of storing the entire ledger incurs a high storage cost that grows undesirably large for high transaction rates and large networks. In this work we use the ideas of secret key sh…
▽ More
Blockchain uses the idea of storing transaction data in the form of a distributed ledger wherein each node in the network stores a current copy of the sequence of transactions in the form of a hash chain. This requirement of storing the entire ledger incurs a high storage cost that grows undesirably large for high transaction rates and large networks. In this work we use the ideas of secret key sharing, private key encryption, and distributed storage to design a coding scheme such that each node stores only a part of the entire transaction thereby reducing the storage cost to a fraction of its original cost. When further using dynamic zone allocation, we show the coding scheme can also improve the integrity of the transaction data in the network over current schemes. Further, block validation (bitcoin mining) consumes a significant amount of energy as it is necessary to determine a hash value satisfying a specific set of constraints; we show that using dynamic distributed storage reduces these energy costs.
△ Less
Submitted 7 January, 2018; v1 submitted 20 November, 2017;
originally announced November 2017.
-
m-Bonsai: a Practical Compact Dynamic Trie
Authors:
Andreas Poyias,
Simon J. Puglisi,
Rajeev Raman
Abstract:
We consider the problem of implementing a space-efficient dynamic trie, with an emphasis on good practical performance. For a trie with $n$ nodes with an alphabet of size $σ$, the information-theoretic lower bound is $n \log σ+ O(n)$ bits. The Bonsai data structure is a compact trie proposed by Darragh et al. (Softw., Pract. Exper. 23(3), 1993, p. 277-291). Its disadvantages include the user havin…
▽ More
We consider the problem of implementing a space-efficient dynamic trie, with an emphasis on good practical performance. For a trie with $n$ nodes with an alphabet of size $σ$, the information-theoretic lower bound is $n \log σ+ O(n)$ bits. The Bonsai data structure is a compact trie proposed by Darragh et al. (Softw., Pract. Exper. 23(3), 1993, p. 277-291). Its disadvantages include the user having to specify an upper bound $M$ on the trie size in advance (which cannot be changed easily after initalization), a space usage of $M \log σ+ O(M \log \log M)$ (which is asymptotically non-optimal for smaller $σ$ or if $n \ll M$) and a lack of support for deletions. It supports traversal and update operations in $O(1/ε)$ expected time (based on assumptions about the behaviour of hash functions), where $ε= (M-n)/M$ and has excellent speed performance in practice. We propose an alternative, m-Bonsai, that addresses the above problems, obtaining a trie that uses $(1+β) n (\log σ+ O(1))$ bits in expectation, and supports traversal and update operations in $O(1/β)$ expected time and $O(1/β^2)$ amortized expected time, for any user-specified parameter $β> 0$ (again based on assumptions about the behaviour of hash functions). We give an implementation of m-Bonsai which uses considerably less memory and is slightly faster than the original Bonsai.
△ Less
Submitted 19 April, 2017;
originally announced April 2017.
-
Universal Joint Image Clustering and Registration using Partition Information
Authors:
Ravi Kiran Raman,
Lav R. Varshney
Abstract:
We consider the problem of universal joint clustering and registration of images and define algorithms using multivariate information functionals. We first study registering two images using maximum mutual information and prove its asymptotic optimality. We then show the shortcomings of pairwise registration in multi-image registration, and design an asymptotically optimal algorithm based on multi…
▽ More
We consider the problem of universal joint clustering and registration of images and define algorithms using multivariate information functionals. We first study registering two images using maximum mutual information and prove its asymptotic optimality. We then show the shortcomings of pairwise registration in multi-image registration, and design an asymptotically optimal algorithm based on multiinformation. Further, we define a novel multivariate information functional to perform joint clustering and registration of images, and prove consistency of the algorithm. Finally, we consider registration and clustering of numerous limited-resolution images, defining algorithms that are order-optimal in scaling of number of pixels in each image with the number of images.
△ Less
Submitted 30 November, 2017; v1 submitted 10 January, 2017;
originally announced January 2017.
-
Universal Clustering via Crowdsourcing
Authors:
Ravi Kiran Raman,
Lav Varshney
Abstract:
Consider unsupervised clustering of objects drawn from a discrete set, through the use of human intelligence available in crowdsourcing platforms. This paper defines and studies the problem of universal clustering using responses of crowd workers, without knowledge of worker reliability or task difficulty. We model stochastic worker response distributions by incorporating traits of memory for simi…
▽ More
Consider unsupervised clustering of objects drawn from a discrete set, through the use of human intelligence available in crowdsourcing platforms. This paper defines and studies the problem of universal clustering using responses of crowd workers, without knowledge of worker reliability or task difficulty. We model stochastic worker response distributions by incorporating traits of memory for similar objects and traits of distance among differing objects. We are particularly interested in two limiting worker types---temporary workers who retain no memory of responses and long-term workers with memory. We first define clustering algorithms for these limiting cases and then integrate them into an algorithm for the unified worker model. We prove asymptotic consistency of the algorithms and establish sufficient conditions on the sample complexity of the algorithm. Converse arguments establish necessary conditions on sample complexity, proving that the defined algorithms are asymptotically order-optimal in cost.
△ Less
Submitted 5 October, 2016;
originally announced October 2016.
-
Significance of Natural Scene Statistics in Understanding the Anisotropies of Perceptual Filling-in at the Blind Spot
Authors:
Rajani Raman,
Sandip Sarkar
Abstract:
Psychophysical experiments reveal our horizontal preference in perceptual filling-in at the blind spot. On the other hand, vertical preference is exhibited in the case of tolerance in filling-in. What causes this anisotropy in our perception? Building upon the general notion, that the functional properties of the early visual system are shaped by the innate specification as well as the statistics…
▽ More
Psychophysical experiments reveal our horizontal preference in perceptual filling-in at the blind spot. On the other hand, vertical preference is exhibited in the case of tolerance in filling-in. What causes this anisotropy in our perception? Building upon the general notion, that the functional properties of the early visual system are shaped by the innate specification as well as the statistics of the environment, we reasoned that the anisotropy in filling-in could be understood in terms of anisotropy in orientation distribution inherent to natural scene statistics. We examined this proposition by investigating filling-in of bar stimuli on a Hierarchical Predictive Coding model network. In response to bar stimuli, the model network, trained with natural images, exhibited anisotropic filling-in performance at the blind spot similar to reported in psychophysical experiments i.e. horizontal preference in filling-in and vertical preference in tolerance of filling-in. We suggest that the over-representation of horizontal contours in the natural scene contribute to the observed horizontal superiority while the broader distribution of vertical contours contributes to the observed vertical superiority in tolerance. These results indicate that natural scene statistics plays a significant role in determining the filling-in performance at the blind spot and sha** the associated anisotropies.
△ Less
Submitted 12 January, 2017; v1 submitted 11 July, 2016;
originally announced July 2016.
-
Predictive coding: A Possible Explanation of Filling-in at the blind spot
Authors:
Rajani Raman,
Sandip Sarkar
Abstract:
Filling-in at the blind-spot is a perceptual phenomenon in which the visual system fills the informational void, which arises due to the absence of retinal input corresponding to the optic disc, with surrounding visual attributes. Though there are enough evidence to conclude that some kind of neural computation is involved in filling-in at the blind spot especially in the early visual cortex, the…
▽ More
Filling-in at the blind-spot is a perceptual phenomenon in which the visual system fills the informational void, which arises due to the absence of retinal input corresponding to the optic disc, with surrounding visual attributes. Though there are enough evidence to conclude that some kind of neural computation is involved in filling-in at the blind spot especially in the early visual cortex, the knowledge of the actual computational mechanism is far from complete. We have investigated the bar experiments and the associated filling-in phenomenon in the light of the hierarchical predictive coding framework, where the blind-spot was represented by the absence of early feed-forward connection. We recorded the responses of predictive estimator neurons at the blind-spot region in the V1 area of our three level (LGN-V1-V2) model network. These responses are in agreement with the results of earlier physiological studies and using the generative model we also showed that these response profiles indeed represent the filling-in completion. These demonstrate that predictive coding framework could account for the filling-in phenomena observed in several psychophysical and physiological experiments involving bar stimuli. These results suggest that the filling-in could naturally arise from the computational principle of hierarchical predictive coding (HPC) of natural images.
△ Less
Submitted 6 July, 2015;
originally announced July 2015.
-
Tree Compression with Top Trees Revisited
Authors:
Lorenz Hübschle-Schneider,
Rajeev Raman
Abstract:
We revisit tree compression with top trees (Bille et al, ICALP'13) and present several improvements to the compressor and its analysis. By significantly reducing the amount of information stored and guiding the compression step using a RePair-inspired heuristic, we obtain a fast compressor achieving good compression ratios, addressing an open problem posed by Bille et al. We show how, with relativ…
▽ More
We revisit tree compression with top trees (Bille et al, ICALP'13) and present several improvements to the compressor and its analysis. By significantly reducing the amount of information stored and guiding the compression step using a RePair-inspired heuristic, we obtain a fast compressor achieving good compression ratios, addressing an open problem posed by Bille et al. We show how, with relatively small overhead, the compressed file can be converted into an in-memory representation that supports basic navigation operations in worst-case logarithmic time without decompression. We also show a much improved worst-case bound on the size of the output of top-tree compression (answering an open question posed in a talk on this algorithm by Weimann in 2012).
△ Less
Submitted 15 June, 2015;
originally announced June 2015.
-
SEPIA: Search for Proofs Using Inferred Automata
Authors:
Thomas Gransden,
Neil Walkinshaw,
Rajeev Raman
Abstract:
This paper describes SEPIA, a tool for automated proof generation in Coq. SEPIA combines model inference with interactive theorem proving. Existing proof corpora are modelled using state-based models inferred from tactic sequences. These can then be traversed automatically to identify proofs. The SEPIA system is described and its performance evaluated on three Coq datasets. Our results show that S…
▽ More
This paper describes SEPIA, a tool for automated proof generation in Coq. SEPIA combines model inference with interactive theorem proving. Existing proof corpora are modelled using state-based models inferred from tactic sequences. These can then be traversed automatically to identify proofs. The SEPIA system is described and its performance evaluated on three Coq datasets. Our results show that SEPIA provides a useful complement to existing automated tactics in Coq.
△ Less
Submitted 29 May, 2015;
originally announced May 2015.
-
Detection of texts in natural images
Authors:
Gowtham Rangarajan Raman
Abstract:
A framework that makes use of Connected components and supervised Support machine to recognise texts is proposed. The image is preprocessed and and edge graph is calculated using a probabilistic framework to compensate for photometric noise. Connected components over the resultant image is calculated, which is bounded and then pruned using geometric constraints. Finally a Gabor Feature based SVM i…
▽ More
A framework that makes use of Connected components and supervised Support machine to recognise texts is proposed. The image is preprocessed and and edge graph is calculated using a probabilistic framework to compensate for photometric noise. Connected components over the resultant image is calculated, which is bounded and then pruned using geometric constraints. Finally a Gabor Feature based SVM is used to classify the presence of text in the candidates. The proposed method was tested with ICDAR 10 dataset and few other images available on the internet. It resulted in a recall and precision metric of 0.72 and 0.88 comfortably better than the benchmark Eiphstein's algorithm. The proposed method recorded a 0.70 and 0.74 in natural images which is significantly better than current methods on natural images. The proposed method also scales almost linearly for high resolution, cluttered images.
△ Less
Submitted 1 November, 2014;
originally announced November 2014.
-
On Succinct Representations of Binary Trees
Authors:
Pooya Davoodi,
Rajeev Raman,
Srinivasa Rao Satti
Abstract:
We observe that a standard transformation between \emph{ordinal} trees (arbitrary rooted trees with ordered children) and binary trees leads to interesting succinct binary tree representations. There are four symmetric versions of these transformations. Via these transformations we get four succinct representations of $n$-node binary trees that use $2n + n/(\log n)^{O(1)}$ bits and support (among…
▽ More
We observe that a standard transformation between \emph{ordinal} trees (arbitrary rooted trees with ordered children) and binary trees leads to interesting succinct binary tree representations. There are four symmetric versions of these transformations. Via these transformations we get four succinct representations of $n$-node binary trees that use $2n + n/(\log n)^{O(1)}$ bits and support (among other operations) navigation, inorder numbering, one of pre- or post-order numbering, subtree size and lowest common ancestor (LCA) queries. The ability to support inorder numbering is crucial for the well-known range-minimum query (RMQ) problem on an array $A$ of $n$ ordered values. While this functionality, and more, is also supported in $O(1)$ time using $2n + o(n)$ bits by Davoodi et al.'s (\emph{Phil. Trans. Royal Soc. A} \textbf{372} (2014)) extension of a representation by Farzan and Munro (\emph{Algorithmica} \textbf{6} (2014)), their \emph{redundancy}, or the $o(n)$ term, is much larger, and their approach may not be suitable for practical implementations.
One of these transformations is related to the Zaks' sequence (S.~Zaks, \emph{Theor. Comput. Sci.} \textbf{10} (1980)) for encoding binary trees, and we thus provide the first succinct binary tree representation based on Zaks' sequence. Another of these transformations is equivalent to Fischer and Heun's (\emph{SIAM J. Comput.} \textbf{40} (2011)) \minheap\ structure for this problem. Yet another variant allows an encoding of the Cartesian tree of $A$ to be constructed from $A$ using only $O(\sqrt{n} \log n)$ bits of working space.
△ Less
Submitted 18 October, 2014;
originally announced October 2014.
-
Mining State-Based Models from Proof Corpora
Authors:
Thomas Gransden,
Neil Walkinshaw,
Rajeev Raman
Abstract:
Interactive theorem provers have been used extensively to reason about various software/hardware systems and mathematical theorems. The key challenge when using an interactive prover is finding a suitable sequence of proof steps that will lead to a successful proof requires a significant amount of human intervention. This paper presents an automated technique that takes as input examples of succes…
▽ More
Interactive theorem provers have been used extensively to reason about various software/hardware systems and mathematical theorems. The key challenge when using an interactive prover is finding a suitable sequence of proof steps that will lead to a successful proof requires a significant amount of human intervention. This paper presents an automated technique that takes as input examples of successful proofs and infers an Extended Finite State Machine as output. This can in turn be used to generate proofs of new conjectures. Our preliminary experiments show that the inferred models are generally accurate (contain few false-positive sequences) and that representing existing proofs in such a way can be very useful when guiding new ones.
△ Less
Submitted 14 May, 2014;
originally announced May 2014.
-
QPTAS for Geometric Set-Cover Problems via Optimal Separators
Authors:
Nabil H. Mustafa,
Rajiv Raman,
Saurabh Ray
Abstract:
Weighted geometric set-cover problems arise naturally in several geometric and non-geometric settings (e.g. the breakthrough of Bansal-Pruhs (FOCS 2010) reduces a wide class of machine scheduling problems to weighted geometric set-cover). More than two decades of research has succeeded in settling the $(1+ε)$-approximability status for most geometric set-cover problems, except for four basic scena…
▽ More
Weighted geometric set-cover problems arise naturally in several geometric and non-geometric settings (e.g. the breakthrough of Bansal-Pruhs (FOCS 2010) reduces a wide class of machine scheduling problems to weighted geometric set-cover). More than two decades of research has succeeded in settling the $(1+ε)$-approximability status for most geometric set-cover problems, except for four basic scenarios which are still lacking. One is that of weighted disks in the plane for which, after a series of papers, Varadarajan (STOC 2010) presented a clever \emph{quasi-sampling} technique, which together with improvements by Chan \etal~(SODA 2012), yielded a $O(1)$-approximation algorithm. Even for the unweighted case, a PTAS for a fundamental class of objects called pseudodisks (which includes disks, unit-height rectangles, translates of convex sets etc.) is currently unknown. Another fundamental case is weighted halfspaces in $\Re^3$, for which a PTAS is currently lacking. In this paper, we present a QPTAS for all of these remaining problems. Our results are based on the separator framework of Adamaszek-Wiese (FOCS 2013, SODA 2014), who recently obtained a QPTAS for weighted independent set of polygonal regions. This rules out the possibility that these problems are APX-hard, assuming $\textbf{NP} \nsubseteq \textbf{DTIME}(2^{polylog(n)})$. Together with the recent work of Chan-Grant (CGTA 2014), this settles the APX-hardness status for all natural geometric set-cover problems.
△ Less
Submitted 5 April, 2014; v1 submitted 4 March, 2014;
originally announced March 2014.
-
Encoding Range Minimum Queries
Authors:
Pooya Davoodi,
Gonzalo Navarro,
Rajeev Raman,
S. Srinivasa Rao
Abstract:
We consider the problem of encoding range minimum queries (RMQs): given an array A[1..n] of distinct totally ordered values, to pre-process A and create a data structure that can answer the query RMQ(i,j), which returns the index containing the smallest element in A[i..j], without access to the array A at query time. We give a data structure whose space usage is 2n + o(n) bits, which is asymptotic…
▽ More
We consider the problem of encoding range minimum queries (RMQs): given an array A[1..n] of distinct totally ordered values, to pre-process A and create a data structure that can answer the query RMQ(i,j), which returns the index containing the smallest element in A[i..j], without access to the array A at query time. We give a data structure whose space usage is 2n + o(n) bits, which is asymptotically optimal for worst-case data, and answers RMQs in O(1) worst-case time. This matches the previous result of Fischer and Heun [SICOMP, 2011], but is obtained in a more natural way. Furthermore, our result can encode the RMQs of a random array A in 1.919n + o(n) bits in expectation, which is not known to hold for Fischer and Heun's result. We then generalize our result to the encoding range top-2 query (RT2Q) problem, which is like the encoding RMQ problem except that the query RT2Q(i,j) returns the indices of both the smallest and second-smallest elements of A[i..j]. We introduce a data structure using 3.272n+o(n) bits that answers RT2Qs in constant time, and also give lower bounds on the effective entropy} of RT2Q.
△ Less
Submitted 18 November, 2013;
originally announced November 2013.
-
Electronically Driven Fragmentation of Silver Nanocrystals Revealed by Ultrafast Electron Crystallography
Authors:
Ramani K. Raman,
Ryan A. Murdick,
Richard J. Worhatch,
Yoshie Murooka,
Subhendra D. Mahanti,
Tzong-Ru T. Han,
Chong-Yu Ruan
Abstract:
We report an ultrafast electron diffraction study of silver nanocrystals under surface plasmon resonance excitation, leading to a concerted fragmentation. By examining simultaneously transient structural, thermal, and Coulombic signatures of the prefragmented state, an electronically driven nonthermal fragmentation scenario is proposed.
We report an ultrafast electron diffraction study of silver nanocrystals under surface plasmon resonance excitation, leading to a concerted fragmentation. By examining simultaneously transient structural, thermal, and Coulombic signatures of the prefragmented state, an electronically driven nonthermal fragmentation scenario is proposed.
△ Less
Submitted 12 September, 2013;
originally announced September 2013.
-
Succinct Indices for Range Queries with applications to Orthogonal Range Maxima
Authors:
Arash Farzan,
J. Ian Munro,
Rajeev Raman
Abstract:
We consider the problem of preprocessing $N$ points in 2D, each endowed with a priority, to answer the following queries: given a axis-parallel rectangle, determine the point with the largest priority in the rectangle. Using the ideas of the \emph{effective entropy} of range maxima queries and \emph{succinct indices} for range maxima queries, we obtain a structure that uses O(N) words and answers…
▽ More
We consider the problem of preprocessing $N$ points in 2D, each endowed with a priority, to answer the following queries: given a axis-parallel rectangle, determine the point with the largest priority in the rectangle. Using the ideas of the \emph{effective entropy} of range maxima queries and \emph{succinct indices} for range maxima queries, we obtain a structure that uses O(N) words and answers the above query in $O(\log N \log \log N)$ time. This is a direct improvement of Chazelle's result from FOCS 1985 for this problem -- Chazelle required $O(N/ε)$ words to answer queries in $O((\log N)^{1+ε})$ time for any constant $ε> 0$.
△ Less
Submitted 21 April, 2012;
originally announced April 2012.
-
Encoding 2-D Range Maximum Queries
Authors:
Mordecai J. Golin,
John Iacono,
Danny Krizanc,
Rajeev Raman,
S. Srinivasa Rao,
Sunil Shende
Abstract:
We consider the \emph{two-dimensional range maximum query (2D-RMQ)} problem: given an array $A$ of ordered values, to pre-process it so that we can find the position of the smallest element in the sub-matrix defined by a (user-specified) range of rows and range of columns. We focus on determining the \emph{effective} entropy of 2D-RMQ, i.e., how many bits are needed to encode $A$ so that 2D-RMQ qu…
▽ More
We consider the \emph{two-dimensional range maximum query (2D-RMQ)} problem: given an array $A$ of ordered values, to pre-process it so that we can find the position of the smallest element in the sub-matrix defined by a (user-specified) range of rows and range of columns. We focus on determining the \emph{effective} entropy of 2D-RMQ, i.e., how many bits are needed to encode $A$ so that 2D-RMQ queries can be answered \emph{without} access to $A$. We give tight upper and lower bounds on the expected effective entropy for the case when $A$ contains independent identically-distributed random values, and new upper and lower bounds for arbitrary $A$, for the case when $A$ contains few rows. The latter results improve upon previous upper and lower bounds by Brodal et al. (ESA 2010). In some cases we also give data structures whose space usage is close to the effective entropy and answer 2D-RMQ queries rapidly.
△ Less
Submitted 25 April, 2012; v1 submitted 13 September, 2011;
originally announced September 2011.
-
Optimal Indexes for Sparse Bit Vectors
Authors:
Alexander Golynski,
Alessio Orlandi,
Rajeev Raman,
S. Srinivasa Rao
Abstract:
We consider the problem of supporting Rank() and Select() operations on a bit vector of length m with n 1 bits. The problem is considered in the succinct index model, where the bit vector is stored in "read-only" memory and an additional data structure, called the index, is created during pre-processing to help answer the above queries. We give asymptotically optimal density-sensitive trade-offs,…
▽ More
We consider the problem of supporting Rank() and Select() operations on a bit vector of length m with n 1 bits. The problem is considered in the succinct index model, where the bit vector is stored in "read-only" memory and an additional data structure, called the index, is created during pre-processing to help answer the above queries. We give asymptotically optimal density-sensitive trade-offs, involving both m and n, that relate the size of the index to the number of accesses to the bit vector (and processing time) needed to answer the above queries. The results are particularly interesting for the case where n = o(m).
△ Less
Submitted 10 August, 2011;
originally announced August 2011.
-
Succinct Representations of Permutations and Functions
Authors:
J. Ian Munro,
Rajeev Raman,
Venkatesh Raman,
S. Srinivasa Rao
Abstract:
We investigate the problem of succinctly representing an arbitrary permutation, π, on {0,...,n-1} so that π^k(i) can be computed quickly for any i and any (positive or negative) integer power k. A representation taking (1+ε) n lg n + O(1) bits suffices to compute arbitrary powers in constant time, for any positive constant ε<= 1. A representation taking the optimal \ceil{\lg n!} + o(n) bits can be…
▽ More
We investigate the problem of succinctly representing an arbitrary permutation, π, on {0,...,n-1} so that π^k(i) can be computed quickly for any i and any (positive or negative) integer power k. A representation taking (1+ε) n lg n + O(1) bits suffices to compute arbitrary powers in constant time, for any positive constant ε<= 1. A representation taking the optimal \ceil{\lg n!} + o(n) bits can be used to compute arbitrary powers in O(lg n / lg lg n) time.
We then consider the more general problem of succinctly representing an arbitrary function, f: [n] \rightarrow [n] so that f^k(i) can be computed quickly for any i and any integer power k. We give a representation that takes (1+ε) n lg n + O(1) bits, for any positive constant ε<= 1, and computes arbitrary positive powers in constant time. It can also be used to compute f^k(i), for any negative integer k, in optimal O(1+|f^k(i)|) time.
We place emphasis on the redundancy, or the space beyond the information-theoretic lower bound that the data structure uses in order to support operations efficiently. A number of lower bounds have recently been shown on the redundancy of data structures. These lower bounds confirm the space-time optimality of some of our solutions. Furthermore, the redundancy of one of our structures "surpasses" a recent lower bound by Golynski [Golynski, SODA 2009], thus demonstrating the limitations of this lower bound.
△ Less
Submitted 9 August, 2011;
originally announced August 2011.
-
Optimal Trade-Off for Succinct String Indexes
Authors:
Roberto Grossi,
Alessio Orlandi,
Rajeev Raman
Abstract:
Let s be a string whose symbols are solely available through access(i), a read-only operation that probes s and returns the symbol at position i in s. Many compressed data structures for strings, trees, and graphs, require two kinds of queries on s: select(c, j), returning the position in s containing the jth occurrence of c, and rank(c, p), counting how many occurrences of c are found in the firs…
▽ More
Let s be a string whose symbols are solely available through access(i), a read-only operation that probes s and returns the symbol at position i in s. Many compressed data structures for strings, trees, and graphs, require two kinds of queries on s: select(c, j), returning the position in s containing the jth occurrence of c, and rank(c, p), counting how many occurrences of c are found in the first p positions of s. We give matching upper and lower bounds for this problem, improving the lower bounds given by Golynski [Theor. Comput. Sci. 387 (2007)] [PhD thesis] and the upper bounds of Barbay et al. [SODA 2007]. We also present new results in another model, improving on Barbay et al. [SODA 2007] and matching a lower bound of Golynski [SODA 2009]. The main contribution of this paper is to introduce a general technique for proving lower bounds on succinct data structures, that is based on the access patterns of the supported operations, abstracting from the particular operations at hand. For this, it may find application to other interesting problems on succinct data structures.
△ Less
Submitted 28 June, 2010;
originally announced June 2010.
-
Random Access to Grammar Compressed Strings
Authors:
Philip Bille,
Gad M. Landau,
Rajeev Raman,
Kunihiko Sadakane,
Srinivasa Rao Satti,
Oren Weimann
Abstract:
Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string.
Let $S$ be a string of length $N$ compre…
▽ More
Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string.
Let $S$ be a string of length $N$ compressed into a context-free grammar $\mathcal{S}$ of size $n$. We present two representations of $\mathcal{S}$ achieving $O(\log N)$ random access time, and either $O(n\cdot α_k(n))$ construction time and space on the pointer machine model, or $O(n)$ construction time and space on the RAM. Here, $α_k(n)$ is the inverse of the $k^{th}$ row of Ackermann's function. Our representations also efficiently support decompression of any substring in $S$: we can decompress any substring of length $m$ in the same complexity as a single random access query and additional $O(m)$ time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern $P$ with at most $k$ errors in time $O(n(\min\{|P|k, k^4 + |P|\} + \log N) + occ)$, where $occ$ is the number of occurrences of $P$ in $S$. Finally, we generalize our results to navigation and other operations on grammar-compressed ordered trees.
All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy paths in grammars.
△ Less
Submitted 29 October, 2013; v1 submitted 11 January, 2010;
originally announced January 2010.