Skip to main content

Showing 1–38 of 38 results for author: Bedathur, S

Searching in archive cs. Search in all archives.
.
  1. Robust Training of Temporal GNNs using Nearest Neighbours based Hard Negatives

    Authors: Shubham Gupta, Srikanta Bedathur

    Abstract: Temporal graph neural networks Tgnn have exhibited state-of-art performance in future-link prediction tasks. Training of these TGNNs is enumerated by uniform random sampling based unsupervised loss. During training, in the context of a positive example, the loss is computed over uninformative negatives, which introduces redundancy and sub-optimal performance. In this paper, we propose modified uns… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

    Comments: 10 pages

  2. arXiv:2312.13616  [pdf, other

    cs.LG cs.AI

    Navigating the Structured What-If Spaces: Counterfactual Generation via Structured Diffusion

    Authors: Nishtha Madaan, Srikanta Bedathur

    Abstract: Generating counterfactual explanations is one of the most effective approaches for uncovering the inner workings of black-box neural network models and building user trust. While remarkable strides have been made in generative modeling using diffusion models in domains like vision, their utility in generating counterfactual explanations in structured modalities remains unexplored. In this paper, w… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: 13 pages

  3. arXiv:2307.10305  [pdf, other

    cs.CV cs.LG

    Tapestry of Time and Actions: Modeling Human Activity Sequences using Temporal Point Process Flows

    Authors: Vinayak Gupta, Srikanta Bedathur

    Abstract: Human beings always engage in a vast range of activities and tasks that demonstrate their ability to adapt to different scenarios. Any human activity can be represented as a temporal sequence of actions performed to achieve a certain goal. Unlike the time series datasets extracted from electronics or machines, these action sequences are highly disparate in their nature -- the time to finish a sequ… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: Extended version of Gupta and Bedathur [arXiv:2206.05291] (SIGKDD 2022). Under review in a journal

  4. arXiv:2307.09613  [pdf, other

    cs.LG cs.IR

    Retrieving Continuous Time Event Sequences using Neural Temporal Point Processes with Learnable Hashing

    Authors: Vinayak Gupta, Srikanta Bedathur, Abir De

    Abstract: Temporal sequences have become pervasive in various real-world applications. Consequently, the volume of data generated in the form of continuous time-event sequence(s) or CTES(s) has increased exponentially in the past few years. Thus, a significant fraction of the ongoing research on CTES datasets involves designing models to address downstream tasks such as next-event prediction, long-term fore… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: Extended version of Gupta et al. [arXiv:2202.11485] (AAAI 2022). Under review in a journal

  5. arXiv:2306.03480  [pdf, other

    cs.LG cs.AI

    GSHOT: Few-shot Generative Modeling of Labeled Graphs

    Authors: Sahil Manchanda, Shubham Gupta, Sayan Ranu, Srikanta Bedathur

    Abstract: Deep graph generative modeling has gained enormous attraction in recent years due to its impressive ability to directly learn the underlying hidden graph distribution. Despite their initial success, these techniques, like much of the existing deep generative methods, require a large number of training samples to learn a good model. Unfortunately, large number of training samples may not always be… ▽ More

    Submitted 14 December, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

    Comments: Accepted in Learning on Graph Conference (LOG,2023),https://openreview.net/forum?id=Hy9K2WiVwW

  6. arXiv:2306.03447  [pdf, other

    cs.LG cs.AI

    GRAFENNE: Learning on Graphs with Heterogeneous and Dynamic Feature Sets

    Authors: Shubham Gupta, Sahil Manchanda, Sayan Ranu, Srikanta Bedathur

    Abstract: Graph neural networks (GNNs), in general, are built on the assumption of a static set of features characterizing each node in a graph. This assumption is often violated in practice. Existing methods partly address this issue through feature imputation. However, these techniques (i) assume uniformity of feature set across nodes, (ii) are transductive by nature, and (iii) fail to work when features… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: 17 pages, 4 figures and 9 tables. Accepted in ICML 2023, DOI will be updated once it is available

  7. arXiv:2302.11777  [pdf, other

    cs.LG cs.DB cs.IR

    Embeddings for Tabular Data: A Survey

    Authors: Rajat Singh, Srikanta Bedathur

    Abstract: Tabular data comprising rows (samples) with the same set of columns (attributes, is one of the most widely used data-type among various industries, including financial services, health care, research, retail, and logistics, to name a few. Tables are becoming the natural way of storing data among various industries and academia. The data stored in these tables serve as an essential source of inform… ▽ More

    Submitted 22 February, 2023; originally announced February 2023.

  8. arXiv:2211.04250  [pdf, other

    cs.LG cs.AI cs.CL

    DetAIL : A Tool to Automatically Detect and Analyze Drift In Language

    Authors: Nishtha Madaan, Adithya Manjunatha, Hrithik Nambiar, Aviral Kumar Goel, Harivansh Kumar, Diptikalyan Saha, Srikanta Bedathur

    Abstract: Machine learning and deep learning-based decision making has become part of today's software. The goal of this work is to ensure that machine learning and deep learning-based systems are as trusted as traditional software. Traditional software is made dependable by following rigorous practice like static analysis, testing, debugging, verifying, and repairing throughout the development and maintena… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

  9. Modeling Spatial Trajectories using Coarse-Grained Smartphone Logs

    Authors: Vinayak Gupta, Srikanta Bedathur

    Abstract: Current approaches for points-of-interest (POI) recommendation learn the preferences of a user via the standard spatial features such as the POI coordinates, the social network, etc. These models ignore a crucial aspect of spatial mobility -- every user carries their smartphones wherever they go. In addition, with growing privacy concerns, users refrain from sharing their exact geographical coordi… ▽ More

    Submitted 28 August, 2022; originally announced August 2022.

    Comments: IEEE Transactions on Big Data

  10. arXiv:2208.12126  [pdf, other

    cs.LG

    A Survey on Temporal Graph Representation Learning and Generative Modeling

    Authors: Shubham Gupta, Srikanta Bedathur

    Abstract: Temporal graphs represent the dynamic relationships among entities and occur in many real life application like social networks, e commerce, communication, road networks, biological systems, and many more. They necessitate research beyond the work related to static graphs in terms of their generative modeling and representation learning. In this survey, we comprehensively review the neural time de… ▽ More

    Submitted 25 August, 2022; originally announced August 2022.

    Comments: 27 pages, 2 figures

  11. Modeling Continuous Time Sequences with Intermittent Observations using Marked Temporal Point Processes

    Authors: Vinayak Gupta, Srikanta Bedathur, Sourangshu Bhattacharya, Abir De

    Abstract: A large fraction of data generated via human activities such as online purchases, health records, spatial mobility etc. can be represented as a sequence of events over a continuous-time. Learning deep learning models over these continuous-time event sequences is a non-trivial task as it involves modeling the ever-increasing event timestamps, inter-event time gaps, event types, and the influences b… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

    Comments: ACM TIST

  12. arXiv:2206.10429  [pdf, other

    cs.CL cs.LG

    Plug and Play Counterfactual Text Generation for Model Robustness

    Authors: Nishtha Madaan, Srikanta Bedathur, Diptikalyan Saha

    Abstract: Generating counterfactual test-cases is an important backbone for testing NLP models and making them as robust and reliable as traditional software. In generating the test-cases, a desired property is the ability to control the test-case generation in a flexible manner to test for a large variety of failure cases and to explain and repair them in a targeted manner. In this direction, significant p… ▽ More

    Submitted 21 June, 2022; originally announced June 2022.

  13. arXiv:2206.08081  [pdf, other

    cs.CL cs.LG

    TransDrift: Modeling Word-Embedding Drift using Transformer

    Authors: Nishtha Madaan, Prateek Chaudhury, Nishant Kumar, Srikanta Bedathur

    Abstract: In modern NLP applications, word embeddings are a crucial backbone that can be readily shared across a number of tasks. However as the text distributions change and word semantics evolve over time, the downstream applications using the embeddings can suffer if the word representations do not conform to the data drift. Thus, maintaining word embeddings to be consistent with the underlying data dist… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

    Comments: 10 pages

  14. ProActive: Self-Attentive Temporal Point Process Flows for Activity Sequences

    Authors: Vinayak Gupta, Srikanta Bedathur

    Abstract: Any human activity can be represented as a temporal sequence of actions performed to achieve a certain goal. Unlike machine-made time series, these action sequences are highly disparate as the time taken to finish a similar action might vary between different persons. Therefore, understanding the dynamics of these sequences is essential for many downstream tasks such as activity length prediction,… ▽ More

    Submitted 10 June, 2022; originally announced June 2022.

    Comments: KDD 2022

  15. arXiv:2203.03564  [pdf, other

    cs.LG cs.AI cs.IR cs.SI

    TIGGER: Scalable Generative Modelling for Temporal Interaction Graphs

    Authors: Shubham Gupta, Sahil Manchanda, Srikanta Bedathur, Sayan Ranu

    Abstract: There has been a recent surge in learning generative models for graphs. While impressive progress has been made on static graphs, work on generative modeling of temporal graphs is at a nascent stage with significant scope for improvement. First, existing generative models do not scale with either the time horizon or the number of nodes. Second, existing techniques are transductive in nature and th… ▽ More

    Submitted 8 March, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

    Comments: To be published in AAAI-2022, additionally contains technical appendices/supplementary material

    Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, 36(6), 6819-6828,2022

  16. arXiv:2202.11485  [pdf, other

    cs.IR cs.LG

    Learning Temporal Point Processes for Efficient Retrieval of Continuous Time Event Sequences

    Authors: Vinayak Gupta, Srikanta Bedathur, Abir De

    Abstract: Recent developments in predictive modeling using marked temporal point processes (MTPP) have enabled an accurate characterization of several real-world applications involving continuous-time event sequences (CTESs). However, the retrieval problem of such sequences remains largely unaddressed in literature. To tackle this, we propose NEUROSEQRET which learns to retrieve and rank a relevant set of c… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: AAAI 2022

  17. arXiv:2201.06095  [pdf, other

    cs.LG cs.IR

    Doing More with Less: Overcoming Data Scarcity for POI Recommendation via Cross-Region Transfer

    Authors: Vinayak Gupta, Srikanta Bedathur

    Abstract: Variability in social app usage across regions results in a high skew of the quantity and the quality of check-in data collected, which in turn is a challenge for effective location recommender systems. In this paper, we present Axolotl (Automated cross Location-network Transfer Learning), a novel method aimed at transferring location preference models learned in a data-rich region to significantl… ▽ More

    Submitted 16 January, 2022; originally announced January 2022.

    Comments: ACM TIST

  18. arXiv:2111.04190  [pdf, other

    cs.HC cs.AI cs.LG

    VizAI : Selecting Accurate Visualizations of Numerical Data

    Authors: Ritvik Vij, Rohit Raj, Madhur Singhal, Manish Tanwar, Srikanta Bedathur

    Abstract: A good data visualization is not only a distortion-free graphical representation of data but also a way to reveal underlying statistical properties of the data. Despite its common use across various stages of data analysis, selecting a good visualization often is a manual process involving many iterations. Recently there has been interest in reducing this effort by develo** models that can recom… ▽ More

    Submitted 7 November, 2021; originally announced November 2021.

    Comments: Proc. of the ACM India Joint International Conference on Data Sciences and Management of Data (CODS-COMAD) 2022 (9th ACM IKDD CODS and 27th COMAD) - To Appear

  19. Region Invariant Normalizing Flows for Mobility Transfer

    Authors: Vinayak Gupta, Srikanta Bedathur

    Abstract: There exists a high variability in mobility data volumes across different regions, which deteriorates the performance of spatial recommender systems that rely on region-specific data. In this paper, we propose a novel transfer learning framework called REFORMD, for continuous-time location prediction for regions with sparse checkin data. Specifically, we model user-specific checkin-sequences in a… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: CIKM 2021

  20. arXiv:2108.07758  [pdf, other

    cs.DB

    Computing and Maintaining Provenance of Query Result Probabilities in Uncertain Knowledge Graphs

    Authors: Garima Gaur, Abhishek Dang, Arnab Bhattacharya, Srikanta Bedathur

    Abstract: Knowledge graphs (KG) that model the relationships between entities as labeled edges (or facts) in a graph are mostly constructed using a suite of automated extractors, thereby inherently leading to uncertainty in the extracted facts. Modeling the uncertainty as probabilistic confidence scores results in a probabilistic knowledge graph. Graph queries over such probabilistic KGs require answer comp… ▽ More

    Submitted 17 August, 2021; originally announced August 2021.

  21. arXiv:2104.14914  [pdf, other

    cs.CL cs.DB cs.LG

    BERT Meets Relational DB: Contextual Representations of Relational Databases

    Authors: Siddhant Arora, Vinayak Gupta, Garima Gaur, Srikanta Bedathur

    Abstract: In this paper, we address the problem of learning low dimension representation of entities on relational databases consisting of multiple tables. Embeddings help to capture semantics encoded in the database and can be used in a variety of settings like auto-completion of tables, fully-neural query processing of relational joins queries, seamlessly handling missing values, and more. Current work is… ▽ More

    Submitted 30 April, 2021; originally announced April 2021.

  22. arXiv:2104.07378  [pdf, other

    cs.CL cs.AI cs.IR

    Tracking entities in technical procedures -- a new dataset and baselines

    Authors: Saransh Goyal, Pratyush Pandey, Garima Gaur, Subhalingam D, Srikanta Bedathur, Maya Ramanath

    Abstract: We introduce TechTrack, a new dataset for tracking entities in technical procedures. The dataset, prepared by annotating open domain articles from WikiHow, consists of 1351 procedures, e.g., "How to connect a printer", identifies more than 1200 unique entities with an average of 4.7 entities per procedure. We evaluate the performance of state-of-the-art models on the entity-tracking task and find… ▽ More

    Submitted 15 April, 2021; originally announced April 2021.

  23. arXiv:2011.14317  [pdf, other

    cs.LG stat.ML

    FROCC: Fast Random projection-based One-Class Classification

    Authors: Arindam Bhattacharya, Sumanth Varambally, Amitabha Bagchi, Srikanta Bedathur

    Abstract: We present Fast Random projection-based One-Class Classification (FROCC), an extremely efficient method for one-class classification. Our method is based on a simple idea of transforming the training data by projecting it onto a set of random unit vectors that are chosen uniformly and independently from the unit sphere, and bounding the regions based on separation of the data. FROCC can be natural… ▽ More

    Submitted 30 June, 2021; v1 submitted 29 November, 2020; originally announced November 2020.

  24. arXiv:2009.14116  [pdf, other

    cs.CL

    A Survey on Semantic Parsing from the perspective of Compositionality

    Authors: Pawan Kumar, Srikanta Bedathur

    Abstract: Different from previous surveys in semantic parsing (Kamath and Das, 2018) and knowledge base question answering(KBQA)(Chakraborty et al., 2019; Zhu et al., 2019; Hoffner et al., 2017) we try to takes a different perspective on the study of semantic parsing. Specifically, we will focus on (a)meaning composition from syntactical structure(Partee, 1975), and (b) the ability of semantic parsers to ha… ▽ More

    Submitted 29 September, 2020; originally announced September 2020.

  25. arXiv:2007.14864  [pdf, other

    cs.DB

    How and Why is An Answer (Still) Correct? Maintaining Provenance in Dynamic Knowledge Graphs

    Authors: Garima Gaur, Arnab Bhattacharya, Srikanta Bedathur

    Abstract: Knowledge graphs (KGs) have increasingly become the backbone of many critical knowledge-centric applications. Most large-scale KGs used in practice are automatically constructed based on an ensemble of extraction techniques applied over diverse data sources. Therefore, it is important to establish the provenance of results for a query to determine how these were computed. Provenance is shown to be… ▽ More

    Submitted 29 July, 2020; originally announced July 2020.

    Journal ref: CIKM 2020

  26. arXiv:2006.07580  [pdf, other

    cs.SI physics.soc-ph

    Modeling Implicit Communities using Spatio-Temporal Point Processes from Geo-tagged Event Traces

    Authors: Ankita Likhyani, Vinayak Gupta, Srijith P. K., Deepak P., Srikanta Bedathur

    Abstract: The location check-ins of users through various location-based services such as Foursquare, Twitter, and Facebook Places, etc., generate large traces of geo-tagged events. These event-traces often manifest in hidden (possibly overlap**) communities of users with similar interests. Inferring these implicit communities is crucial for forming user profiles for improvements in recommendation and pre… ▽ More

    Submitted 13 June, 2020; originally announced June 2020.

    Comments: 17 pages

  27. arXiv:2006.04509  [pdf, other

    cs.AI cs.DB cs.LG stat.ML

    IterefinE: Iterative KG Refinement Embeddings using Symbolic Knowledge

    Authors: Siddhant Arora, Srikanta Bedathur, Maya Ramanath, Deepak Sharma

    Abstract: Knowledge Graphs (KGs) extracted from text sources are often noisy and lead to poor performance in downstream application tasks such as KG-based question answering.While much of the recent activity is focused on addressing the sparsity of KGs by using embeddings for inferring new facts, the issue of cleaning up of noise in KGs through KG refinement task is not as actively studied. Most successful… ▽ More

    Submitted 3 June, 2020; originally announced June 2020.

    Comments: 16 pages, 7 figures, AKBC 2020 Conference

  28. arXiv:2005.06437  [pdf, other

    cs.DB cs.LG

    On Embeddings in Relational Databases

    Authors: Siddhant Arora, Srikanta Bedathur

    Abstract: We address the problem of learning a distributed representation of entities in a relational database using a low-dimensional embedding. Low-dimensional embeddings aim to encapsulate a concise vector representation for an underlying dataset with minimum loss of information. Embeddings across entities in a relational database have been less explored due to the intricate data relations and representa… ▽ More

    Submitted 13 May, 2020; originally announced May 2020.

    Comments: 9 pages, 6 Figures, Proceedings of Knowledge Representation & Reasoning Meets Machine Learning Workshop, NeurIPS 2019

  29. arXiv:2005.00480  [pdf, ps, other

    cs.CL cs.LG

    Regex Queries over Incomplete Knowledge Bases

    Authors: Vaibhav Adlakha, Parth Shah, Srikanta Bedathur, Mausam

    Abstract: We propose the novel task of answering regular expression queries (containing disjunction ($\vee$) and Kleene plus ($+$) operators) over incomplete KBs. The answer set of these queries potentially has a large number of entities, hence previous works for single-hop queries in KBC that model a query as a point in high-dimensional space are not as effective. In response, we develop RotatE-Box -- a no… ▽ More

    Submitted 16 September, 2021; v1 submitted 1 May, 2020; originally announced May 2020.

    Comments: AKBC 2021

  30. arXiv:2001.10781  [pdf, other

    cs.IR

    Aspect-based Academic Search using Domain-specific KB

    Authors: Prajna Upadhyay, Srikanta Bedathur, Tanmoy Chakraborty, Maya Ramanath

    Abstract: Academic search engines allow scientists to explore related work relevant to a given query. Often, the user is also aware of the "aspect" to retrieve a relevant document. In such cases, existing search engines can be used by expanding the query with terms describing that aspect. However, this approach does not guarantee good results since plain keyword matches do not always imply relevance. To add… ▽ More

    Submitted 29 January, 2020; originally announced January 2020.

  31. arXiv:1809.04487  [pdf, other

    cs.LG stat.ML

    Discovering Topical Interactions in Text-based Cascades using Hidden Markov Hawkes Processes

    Authors: Srikanta Bedathur, Indrajit Bhattacharya, Jayesh Choudhari, Anirban Dasgupta

    Abstract: Social media conversations unfold based on complex interactions between users, topics and time. While recent models have been proposed to capture network strengths between users, users' topical preferences and temporal patterns between posting and response times, interaction patterns between topics has not been studied. We propose the Hidden Markov Hawkes Process (HMHP) that incorporates topical M… ▽ More

    Submitted 12 September, 2018; originally announced September 2018.

    Comments: Accepted as a short paper at ICDM-2018

  32. arXiv:1801.10080  [pdf, other

    cs.DL cs.CL

    A Machine Learning Approach to Quantitative Prosopography

    Authors: Aayushee Gupta, Haimonti Dutta, Srikanta Bedathur, Lipika Dey

    Abstract: Prosopography is an investigation of the common characteristics of a group of people in history, by a collective study of their lives. It involves a study of biographies to solve historical problems. If such biographies are unavailable, surviving documents and secondary biographical data are used. Quantitative prosopography involves analysis of information from a wide variety of sources about "ord… ▽ More

    Submitted 30 January, 2018; originally announced January 2018.

  33. arXiv:1711.04971  [pdf, other

    cs.AI cs.DB cs.HC

    DataVizard: Recommending Visual Presentations for Structured Data

    Authors: Rema Ananthanarayanan, Pranay Kr. Lohia, Srikanta Bedathur

    Abstract: Selecting the appropriate visual presentation of the data such that it preserves the semantics of the underlying data and at the same time provides an intuitive summary of the data is an important, often the final step of data analytics. Unfortunately, this is also a step involving significant human effort starting from selection of groups of columns in the structured results from analytics stages… ▽ More

    Submitted 14 November, 2017; originally announced November 2017.

  34. arXiv:1710.07411  [pdf, other

    cs.DB

    STREAK: An Efficient Engine for Processing Top-k SPARQL Queries with Spatial Filters

    Authors: Jyoti Leeka, Srikanta Bedathur, Debajyoti Bera, Sriram Lakshminarasimhan

    Abstract: The importance of geo-spatial data in critical applications such as emergency response, transportation, agriculture etc., has prompted the adoption of recent GeoSPARQL standard in many RDF processing engines. In addition to large repositories of geo-spatial data -- e.g., LinkedGeoData, OpenStreetMap, etc. -- spatial data is also routinely found in automatically constructed knowledgebases such as Y… ▽ More

    Submitted 20 October, 2017; originally announced October 2017.

  35. Sampling and Reconstruction Using Bloom Filters

    Authors: Neha Sengupta, Amitabha Bagchi, Srikanta Bedathur, Maya Ramanath

    Abstract: In this paper, we address the problem of sampling from a set and reconstructing a set stored as a Bloom filter. To the best of our knowledge our work is the first to address this question. We introduce a novel hierarchical data structure called BloomSampleTree that helps us design efficient algorithms to extract an almost uniform sample from the set stored in a Bloom filter and also allows us to r… ▽ More

    Submitted 6 September, 2017; v1 submitted 12 January, 2017; originally announced January 2017.

    Journal ref: IEEE T. Knowl. Data En. 30(7):1324-1337, July 2018

  36. arXiv:1312.4036  [pdf, ps, other

    cs.IR

    Mind Your Language: Effects of Spoken Query Formulation on Retrieval Effectiveness

    Authors: Apoorv Narang, Srikanta Bedathur

    Abstract: Voice search is becoming a popular mode for interacting with search engines. As a result, research has gone into building better voice transcription engines, interfaces, and search engines that better handle inherent verbosity of queries. However, when one considers its use by non- native speakers of English, another aspect that becomes important is the formulation of the query by users. In this p… ▽ More

    Submitted 14 December, 2013; originally announced December 2013.

  37. arXiv:1211.3375  [pdf, ps, other

    cs.DB cs.SI

    High-Performance Reachability Query Processing under Index Size Restrictions

    Authors: Stephan Seufert, Avishek Anand, Srikanta Bedathur, Gerhard Weikum

    Abstract: In this paper, we propose a scalable and highly efficient index structure for the reachability problem over graphs. We build on the well-known node interval labeling scheme where the set of vertices reachable from a particular node is compactly encoded as a collection of node identifier ranges. We impose an explicit bound on the size of the index and flexibly assign approximate reachability ranges… ▽ More

    Submitted 29 November, 2012; v1 submitted 14 November, 2012; originally announced November 2012.

    Comments: 30 pages

  38. arXiv:1207.4371  [pdf, other

    cs.IR cs.DB cs.DC

    Computing n-Gram Statistics in MapReduce

    Authors: Klaus Berberich, Srikanta Bedathur

    Abstract: Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distribute… ▽ More

    Submitted 18 July, 2012; originally announced July 2012.