Search | arXiv e-print repository

Data Collection and Analysis of French Dialects

Authors: Omar Shaur Choudhry, Paul Omara Odida, Joshua Reiner, Keiron Appleyard, Danielle Kushnir, William Toon

Abstract: This paper discusses creating and analysing a new dataset for data mining and text analytics research, contributing to a joint Leeds University research project for the Corpus of National Dialects. This report investigates machine learning classifiers to classify samples of French dialect text across various French-speaking countries. Following the steps of the CRISP-DM methodology, this report ex… ▽ More This paper discusses creating and analysing a new dataset for data mining and text analytics research, contributing to a joint Leeds University research project for the Corpus of National Dialects. This report investigates machine learning classifiers to classify samples of French dialect text across various French-speaking countries. Following the steps of the CRISP-DM methodology, this report explores the data collection process, data quality issues and data conversion for text analysis. Finally, after applying suitable data mining techniques, the evaluation methods, best overall features and classifiers and conclusions are discussed. △ Less

Submitted 1 August, 2022; originally announced August 2022.

Comments: 4 pages plus 1 page for references. 4 figures including 1 image

arXiv:2003.10339 [pdf, other]

Diffusion-based Deep Active Learning

Authors: Dan Kushnir, Luca Venturi

Abstract: The remarkable performance of deep neural networks depends on the availability of massive labeled data. To alleviate the load of data annotation, active deep learning aims to select a minimal set of training points to be labelled which yields maximal model accuracy. Most existing approaches implement either an `exploration'-type selection criterion, which aims at exploring the joint distribution o… ▽ More The remarkable performance of deep neural networks depends on the availability of massive labeled data. To alleviate the load of data annotation, active deep learning aims to select a minimal set of training points to be labelled which yields maximal model accuracy. Most existing approaches implement either an `exploration'-type selection criterion, which aims at exploring the joint distribution of data and labels, or a `refinement'-type criterion which aims at localizing the detected decision boundaries. We propose a versatile and efficient criterion that automatically switches from exploration to refinement when the distribution has been sufficiently mapped. Our criterion relies on a process of diffusing the existing label information over a graph constructed from the hidden representation of the data set as provided by the neural network. This graph representation captures the intrinsic geometry of the approximated labeling function. The diffusion-based criterion is shown to be advantageous as it outperforms existing criteria for deep active learning. △ Less

Submitted 23 March, 2020; originally announced March 2020.

arXiv:1801.05856 [pdf, other]

Active Community Detection with Maximal Expected Model Change

Authors: Dan Kushnir, Benjamin Mirabelli

Abstract: We present a novel active learning algorithm for community detection on networks. Our proposed algorithm uses a Maximal Expected Model Change (MEMC) criterion for querying network nodes label assignments. MEMC detects nodes that maximally change the community assignment likelihood model following a query. Our method is inspired by detection in the benchmark Stochastic Block Model (SBM), where we p… ▽ More We present a novel active learning algorithm for community detection on networks. Our proposed algorithm uses a Maximal Expected Model Change (MEMC) criterion for querying network nodes label assignments. MEMC detects nodes that maximally change the community assignment likelihood model following a query. Our method is inspired by detection in the benchmark Stochastic Block Model (SBM), where we provide sample complexity analysis and empirical study with SBM and real network data for binary as well as for the multi-class settings. The analysis also covers the most challenging case of sparse degree and below-detection-threshold SBMs, where we observe a super-linear error reduction. MEMC is shown to be superior to the random selection baseline and other state-of-the-art active learners. △ Less

Submitted 20 March, 2020; v1 submitted 10 January, 2018; originally announced January 2018.

arXiv:1712.07242 [pdf, other]

Linear Time Clustering for High Dimensional Mixtures of Gaussian Clouds

Authors: Dan Kushnir, Shirin Jalali, Iraj Saniee

Abstract: Clustering mixtures of Gaussian distributions is a fundamental and challenging problem that is ubiquitous in various high-dimensional data processing tasks. While state-of-the-art work on learning Gaussian mixture models has focused primarily on improving separation bounds and their generalization to arbitrary classes of mixture models, less emphasis has been paid to practical computational effici… ▽ More Clustering mixtures of Gaussian distributions is a fundamental and challenging problem that is ubiquitous in various high-dimensional data processing tasks. While state-of-the-art work on learning Gaussian mixture models has focused primarily on improving separation bounds and their generalization to arbitrary classes of mixture models, less emphasis has been paid to practical computational efficiency of the proposed solutions. In this paper, we propose a novel and highly efficient clustering algorithm for $n$ points drawn from a mixture of two arbitrary Gaussian distributions in $\mathbb{R}^p$. The algorithm involves performing random 1-dimensional projections until a direction is found that yields a user-specified clustering error $e$. For a 1-dimensional separation parameter $γ$ satisfying $γ=Q^{-1}(e)$, the expected number of such projections is shown to be bounded by $o(\ln p)$, when $γ$ satisfies $γ\leq c\sqrt{\ln{\ln{p}}}$, with $c$ as the separability parameter of the two Gaussians in $\mathbb{R}^p$. Consequently, the expected overall running time of the algorithm is linear in $n$ and quasi-linear in $p$ at $o(\ln{p})O(np)$, and the sample complexity is independent of $p$. This result stands in contrast to prior works which provide polynomial, with at-best quadratic, running time in $p$ and $n$. We show that our bound on the expected number of 1-dimensional projections extends to the case of three or more Gaussian components, and we present a generalization of our results to mixture distributions beyond the Gaussian model. △ Less

Submitted 1 March, 2018; v1 submitted 19 December, 2017; originally announced December 2017.

arXiv:1602.02348 [pdf]

doi 10.1016/j.techfore.2017.04.007

Economic and Technological Complexity: A Model Study of Indicators of Knowledge-based Innovation Systems

Authors: Inga Ivanova, Oivind Strand, Duncan Kushnir, Loet Leydesdorff

Abstract: The Economic Complexity Index (ECI; Hidalgo & Hausmann, 2009) measures the complexity of national economies in terms of product groups. Analogously to ECI, a Patent Complexity Index (PatCI) can be developed on the basis of a matrix of nations versus patent classes. Using linear algebra, the three dimensions: countries, product groups, and patent classes can be combined into a measure of "Triple He… ▽ More The Economic Complexity Index (ECI; Hidalgo & Hausmann, 2009) measures the complexity of national economies in terms of product groups. Analogously to ECI, a Patent Complexity Index (PatCI) can be developed on the basis of a matrix of nations versus patent classes. Using linear algebra, the three dimensions: countries, product groups, and patent classes can be combined into a measure of "Triple Helix" complexity (THCI) including the trilateral interaction terms between knowledge production, wealth generation, and (national) control. THCI can be expected to capture the extent of systems integration between the global dynamics of markets (ECI) and technologies (PatCI) in each national system of innovation. We measure ECI, PatCI, and THCI during the period 2000-2014 for the 34 OECD member states, the BRICS countries, and a group of emerging and affiliated economies (Argentina, Hong Kong, Indonesia, Malaysia, Romania, and Singapore). The three complexity indicators are correlated between themselves; but the correlations with GDP per capita are virtually absent. Of the world's major economies, Japan scores highest on all three indicators, while China has been increasingly successful in combining economic and technological complexity. We could not reproduce the correlation between ECI and average income that has been central to the argument about the fruitfulness of the economic complexity approach. △ Less

Submitted 7 December, 2016; v1 submitted 7 February, 2016; originally announced February 2016.

Journal ref: Technological Forecasting and Social Change 120 (July 2017) 77-89

arXiv:1512.04214 [pdf]

The Globalization of Academic Entrepreneurship? The Recent Growth (2009-2014) in University Patenting Decomposed

Authors: Loet Leydesdorff, Henry Etzkowitz, Duncan Kushnir

Abstract: The contribution of academia to US patents has become increasingly global. Following a pause, with a relatively flat rate, from 1998 to 2008, the long-term trend of university patenting rising as a share of all patenting has resumed, driven by the internationalization of academic entrepreneurship and the persistence of US university technology transfer. We disaggregate this recent growth in univer… ▽ More The contribution of academia to US patents has become increasingly global. Following a pause, with a relatively flat rate, from 1998 to 2008, the long-term trend of university patenting rising as a share of all patenting has resumed, driven by the internationalization of academic entrepreneurship and the persistence of US university technology transfer. We disaggregate this recent growth in university patenting at the US Patent and Trademark Organization (USPTO) in terms of nations and patent classes. Foreign patenting in the US has almost doubled during the period 2009-2014, mainly due to patenting by universities in Taiwan, Korea, China, and Japan. These nations compete with the US in terms of patent portfolios, whereas most European countries--with the exception of the UK--have more specific portfolios, mainly in the bio-medical fields. In the case of China, Tsinghua University holds 63% of the university patents in USPTO, followed by King Fahd University with 55.2% of the national portfolio. △ Less

Submitted 14 December, 2015; originally announced December 2015.

arXiv:1210.6456 [pdf]

Interactive Overlay Maps for US Patent (USPTO) Data Based on International Patent Classifications (IPC)

Authors: Loet Leydesdorff, Duncan Kushnir, Ismael Rafols

Abstract: We report on the development of an interface to the US Patent and Trademark Office (USPTO) that allows for the map** of patent portfolios as overlays to basemaps constructed from citation relations among all patents contained in this database during the period 1976-2011. Both the interface and the data are in the public domain; the freeware programs VOSViewer and/or Pajek can be used for the vis… ▽ More We report on the development of an interface to the US Patent and Trademark Office (USPTO) that allows for the map** of patent portfolios as overlays to basemaps constructed from citation relations among all patents contained in this database during the period 1976-2011. Both the interface and the data are in the public domain; the freeware programs VOSViewer and/or Pajek can be used for the visualization. These basemaps and overlays can be generated at both the 3-digit and 4-digit levels of the International Patent Classifications (IPC) of the World Intellectual Property Organization (WIPO). The basemaps can provide a stable mental framework for analysts to follow developments over searches for different years, which can be animated. The full flexibility of the advanced search engines of USPTO are available for generating sets of patents and/or patent applications which can thus be visualized and compared. This instrument allows for addressing questions about technological distance, diversity in portfolios, and animating the developments of both technologies and technological capacities of organizations over time. △ Less

Submitted 18 November, 2012; v1 submitted 24 October, 2012; originally announced October 2012.

Comments: Scientometrics (forthcoming)

Showing 1–7 of 7 results for author: Kushnir, D