Search | arXiv e-print repository

Trustworthy Artificial Intelligence in the Context of Metrology

Authors: Tameem Adel, Sam Bilson, Mark Levene, Andrew Thompson

Abstract: We review research at the National Physical Laboratory (NPL) in the area of trustworthy artificial intelligence (TAI), and more specifically trustworthy machine learning (TML), in the context of metrology, the science of measurement. We describe three broad themes of TAI: technical, socio-technical and social, which play key roles in ensuring that the developed models are trustworthy and can be re… ▽ More We review research at the National Physical Laboratory (NPL) in the area of trustworthy artificial intelligence (TAI), and more specifically trustworthy machine learning (TML), in the context of metrology, the science of measurement. We describe three broad themes of TAI: technical, socio-technical and social, which play key roles in ensuring that the developed models are trustworthy and can be relied upon to make responsible decisions. From a metrology perspective we emphasise uncertainty quantification (UQ), and its importance within the framework of TAI to enhance transparency and trust in the outputs of AI systems. We then discuss three research areas within TAI that we are working on at NPL, and examine the certification of AI systems in terms of adherence to the characteristics of TAI. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Journal ref: In Producing Artificial Intelligent Systems: The roles of Benchmarking, Standardisation and Certification, Studies in Computational Intelligence, edited by M. I. A. Ferreira, 2024, Springer

arXiv:2309.02188 [pdf, other]

Incorporating Dictionaries into a Neural Network Architecture to Extract COVID-19 Medical Concepts From Social Media

Authors: Abul Hasan, Mark Levene, David Weston

Abstract: We investigate the potential benefit of incorporating dictionary information into a neural network architecture for natural language processing. In particular, we make use of this architecture to extract several concepts related to COVID-19 from an on-line medical forum. We use a sample from the forum to manually curate one dictionary for each concept. In addition, we use MetaMap, which is a tool… ▽ More We investigate the potential benefit of incorporating dictionary information into a neural network architecture for natural language processing. In particular, we make use of this architecture to extract several concepts related to COVID-19 from an on-line medical forum. We use a sample from the forum to manually curate one dictionary for each concept. In addition, we use MetaMap, which is a tool for extracting biomedical concepts, to identify a small number of semantic concepts. For a supervised concept extraction task on the forum data, our best model achieved a macro $F_1$ score of 90\%. A major difficulty in medical concept extraction is obtaining labelled data from which to build supervised models. We investigate the utility of our models to transfer to data derived from a different source in two ways. First for producing labels via weak learning and second to perform concept extraction. The dataset we use in this case comprises COVID-19 related tweets and we achieve an $F_1$ score 81\% for symptom concept extraction trained on weakly labelled data. The utility of our dictionaries is compared with a COVID-19 symptom dictionary that was constructed directly from Twitter. Further experiments that incorporate BERT and a COVID-19 version of BERTweet demonstrate that the dictionaries provide a commensurate result. Our results show that incorporating small domain dictionaries to deep learning models can improve concept extraction tasks. Moreover, models built using dictionaries generalize well and are transferable to different datasets on a similar task. △ Less

Submitted 5 September, 2023; originally announced September 2023.

arXiv:2103.11850 [pdf, other]

Monitoring Covid-19 on social media using a novel triage and diagnosis approach

Authors: Abul Hasan, Mark Levene, David Weston, Renate Fromson, Nicolas Koslover, Tamara Levene

Abstract: Objective: This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and public health practitioners with additional information on the symptoms, severity and prevalence of the disease rather than to provide an actionable decision at the individual level. Materials and… ▽ More Objective: This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and public health practitioners with additional information on the symptoms, severity and prevalence of the disease rather than to provide an actionable decision at the individual level. Materials and Methods: The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients' posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. Results: We report that macro- and micro-averaged F1 scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. △ Less

Submitted 7 January, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

Comments: 13 pages, 6 figrues

arXiv:2005.08959 [pdf, ps, other]

doi 10.1145/3350546.3352559

Potential gain as a centrality measure

Authors: Pasquale De Meo, Mark Levene, Alessandro Provetti

Abstract: Navigability is a distinctive features of graphs associated with artificial or natural systems whose primary goal is the transportation of information or goods. We say that a graph $\mathcal{G}$ is navigable when an agent is able to efficiently reach any target node in $\mathcal{G}$ by means of local routing decisions. In a social network navigability translates to the ability of reaching an indiv… ▽ More Navigability is a distinctive features of graphs associated with artificial or natural systems whose primary goal is the transportation of information or goods. We say that a graph $\mathcal{G}$ is navigable when an agent is able to efficiently reach any target node in $\mathcal{G}$ by means of local routing decisions. In a social network navigability translates to the ability of reaching an individual through personal contacts. Graph navigability is well-studied, but a fundamental question is still open: why are some individuals more likely than others to be reached via short, friend-of-a-friend, communication chains? In this article we answer the question above by proposing a novel centrality metric called the potential gain, which, in an informal sense, quantifies the easiness at which a target node can be reached. We define two variants of the potential gain, called the geometric and the exponential potential gain, and present fast algorithms to compute them. The geometric and the potential gain are the first instances of a novel class of composite centrality metrics, i.e., centrality metrics which combine the popularity of a node in $\mathcal{G}$ with its similarity to all other nodes. As shown in previous studies, popularity and similarity are two main criteria which regulate the way humans seek for information in large networks such as Wikipedia. We give a formal proof that the potential gain of a node is always equivalent to the product of its degree centrality (which captures popularity) and its Katz centrality (which captures similarity). △ Less

Submitted 17 May, 2020; originally announced May 2020.

Comments: In Proceedings of Web Intelligence 2019 (WI19), the IEEE/WIC/ACM International Conference on Web Intelligence, pages 418--422. arXiv admin note: text overlap with arXiv:1812.08012

ACM Class: I.2.8; F.2.1

arXiv:2002.06450 [pdf, ps, other]

Supervised Phrase-boundary Embeddings

Authors: Manni Singh, David Weston, Mark Levene

Abstract: We propose a new word embedding model, called SPhrase, that incorporates supervised phrase information. Our method modifies traditional word embeddings by ensuring that all target words in a phrase have exactly the same context. We demonstrate that including this information within a context window produces superior embeddings for both intrinsic evaluation tasks and downstream extrinsic tasks. We propose a new word embedding model, called SPhrase, that incorporates supervised phrase information. Our method modifies traditional word embeddings by ensuring that all target words in a phrase have exactly the same context. We demonstrate that including this information within a context window produces superior embeddings for both intrinsic evaluation tasks and downstream extrinsic tasks. △ Less

Submitted 15 February, 2020; originally announced February 2020.

Comments: 12 pages, 3 figures, 4 tables

arXiv:1906.09822 [pdf, ps, other]

doi 10.1007/s11192-019-03151-7

Characterisation of the $χ$-index and the $rec$-index

Authors: Mark Levene, Trevor Fenner, Judit Bar-Ilan

Abstract: Axiomatic characterisation of a bibliometric index provides insight into the properties that the index satisfies and facilitates the comparison of different indices. A geometric generalisation of the $h$-index, called the $χ$-index, has recently been proposed to address some of the problems with the $h$-index, in particular, the fact that it is not scale invariant, i.e., multiplying the number of… ▽ More Axiomatic characterisation of a bibliometric index provides insight into the properties that the index satisfies and facilitates the comparison of different indices. A geometric generalisation of the $h$-index, called the $χ$-index, has recently been proposed to address some of the problems with the $h$-index, in particular, the fact that it is not scale invariant, i.e., multiplying the number of citations of each publication by a positive constant may change the relative ranking of two researchers. While the square of the $h$-index is the area of the largest square under the citation curve of a researcher, the square of the $χ$-index, which we call the $rec$-index (or {\em rectangle}-index), is the area of the largest rectangle under the citation curve. Our main contribution here is to provide a characterisation of the $rec$-index via three properties: {\em monotonicity}, {\em uniform citation} and {\em uniform equivalence}. Monotonicity is a natural property that we would expect any bibliometric index to satisfy, while the other two properties constrain the value of the $rec$-index to be the area of the largest rectangle under the citation curve. The $rec$-index also allows us to distinguish between {\em influential} researchers who have relatively few, but highly-cited, publications and {\em prolific} researchers who have many, but less-cited, publications. △ Less

Submitted 24 June, 2019; originally announced June 2019.

Comments: 14 pages, 3 figures. This is a pre-print of an article published in Scientometrics. The final authenticated version is available online at: https://doi.org/10.1007/s11192-019-03151-7

arXiv:1903.05440 [pdf, other]

Market Trend Prediction using Sentiment Analysis: Lessons Learned and Paths Forward

Authors: Andrius Mudinas, Dell Zhang, Mark Levene

Abstract: Financial market forecasting is one of the most attractive practical applications of sentiment analysis. In this paper, we investigate the potential of using sentiment \emph{attitudes} (positive vs negative) and also sentiment \emph{emotions} (joy, sadness, etc.) extracted from financial news or tweets to help predict stock price movements. Our extensive experiments using the \emph{Granger-causali… ▽ More Financial market forecasting is one of the most attractive practical applications of sentiment analysis. In this paper, we investigate the potential of using sentiment \emph{attitudes} (positive vs negative) and also sentiment \emph{emotions} (joy, sadness, etc.) extracted from financial news or tweets to help predict stock price movements. Our extensive experiments using the \emph{Granger-causality} test have revealed that (i) in general sentiment attitudes do not seem to Granger-cause stock price changes; and (ii) while on some specific occasions sentiment emotions do seem to Granger-cause stock price changes, the exhibited pattern is not universal and must be looked at on a case by case basis. Furthermore, it has been observed that at least for certain stocks, integrating sentiment emotions as additional features into the machine learning based market trend prediction model could improve its accuracy. △ Less

Submitted 13 March, 2019; originally announced March 2019.

Comments: 10 pages, 4 figues, 6 tables

arXiv:1812.08012 [pdf, other]

doi 10.1109/TKDE.2019.2947035

A general centrality framework based on node navigability

Authors: Pasquale De Meo, Mark Levene, Fabrizio Messina, Alessandro Provetti

Abstract: Centrality metrics are a popular tool in Network Science to identify important nodes within a graph. We introduce the Potential Gain as a centrality measure that unifies many walk-based centrality metrics in graphs and captures the notion of node navigability, interpreted as the property of being reachable from anywhere else (in the graph) through short walks. Two instances of the Potential Gain (… ▽ More Centrality metrics are a popular tool in Network Science to identify important nodes within a graph. We introduce the Potential Gain as a centrality measure that unifies many walk-based centrality metrics in graphs and captures the notion of node navigability, interpreted as the property of being reachable from anywhere else (in the graph) through short walks. Two instances of the Potential Gain (called the Geometric and the Exponential Potential Gain) are presented and we describe scalable algorithms for computing them on large graphs. We also give a proof of the relationship between the new measures and established centralities. The geometric potential gain of a node can thus be characterized as the product of its Degree centrality by its Katz centrality scores. At the same time, the exponential potential gain of a node is proved to be the product of Degree centrality by its Communicability index. These formal results connect potential gain to both the "popularity" and "similarity" properties that are captured by the above centralities. △ Less

Submitted 12 March, 2020; v1 submitted 19 December, 2018; originally announced December 2018.

Comments: 26 pages, 11 figures. To be published in IEEE Transactions on Knowledge and Data Engineering

arXiv:1603.07150 [pdf, other]

The Anatomy of a Search and Mining System for Digital Archives

Authors: Martyn Harris, Mark Levene, Dell Zhang, Dan Levene

Abstract: Samtla (Search And Mining Tools with Linguistic Analysis) is a digital humanities system designed in collaboration with historians and linguists to assist them with their research work in quantifying the content of any textual corpora through approximate phrase search and document comparison. The retrieval engine uses a character-based n-gram language model rather than the conventional word-based… ▽ More Samtla (Search And Mining Tools with Linguistic Analysis) is a digital humanities system designed in collaboration with historians and linguists to assist them with their research work in quantifying the content of any textual corpora through approximate phrase search and document comparison. The retrieval engine uses a character-based n-gram language model rather than the conventional word-based one so as to achieve great flexibility in language agnostic query processing. The index is implemented as a space-optimised character-based suffix tree with an accompanying database of document content and metadata. A number of text mining tools are integrated into the system to allow researchers to discover textual patterns, perform comparative analysis, and find out what is currently popular in the research community. Herein we describe the system architecture, user interface, models and algorithms, and data storage of the Samtla system. We also present several case studies of its usage in practice together with an evaluation of the systems' ranking performance through crowdsourcing. △ Less

Submitted 23 March, 2016; originally announced March 2016.

Comments: 49 pages

arXiv:1511.08712 [pdf, ps, other]

doi 10.1140/epjb/e2016-60926-8

A stochastic evolutionary model generating a mixture of exponential distributions

Authors: Trevor Fenner, Mark Levene, George Loizou

Abstract: Recent interest in human dynamics has stimulated the investigation of the stochastic processes that explain human behaviour in various contexts, such as mobile phone networks and social media. In this paper, we extend the stochastic urn-based model proposed in \cite{FENN15} so that it can generate mixture models,in particular, a mixture of exponential distributions. The model is designed to captur… ▽ More Recent interest in human dynamics has stimulated the investigation of the stochastic processes that explain human behaviour in various contexts, such as mobile phone networks and social media. In this paper, we extend the stochastic urn-based model proposed in \cite{FENN15} so that it can generate mixture models,in particular, a mixture of exponential distributions. The model is designed to capture the dynamics of survival analysis, traditionally employed in clinical trials, reliability analysis in engineering, and more recently in the analysis of large data sets recording human dynamics. The mixture modelling approach, which is relatively simple and well understood, is very effective in capturing heterogeneity in data. We provide empirical evidence for the validity of the model, using a data set of popular search engine queries collected over a period of 114 months. We show that the survival function of these queries is closely matched by the exponential mixture solution for our model. △ Less

Submitted 14 January, 2016; v1 submitted 27 November, 2015; originally announced November 2015.

Comments: 14 pages. arXiv admin note: substantial text overlap with arXiv:1502.07558

arXiv:1304.6945 [pdf, ps, other]

A bibliometric index based on the complete list of cited publications

Authors: Mark Levene, Trevor Fenner, Judit Bar-Ilan

Abstract: We propose a new index, the $j$-index, which is defined for an author as the sum of the square roots of the numbers of citations to each of the author's publications. The idea behind the $j$-index it to remedy a drawback of the $h$-index $-$ that the $h$-index does not take into account the full citation record of a researcher. The square root function is motivated by our desire to avoid the possi… ▽ More We propose a new index, the $j$-index, which is defined for an author as the sum of the square roots of the numbers of citations to each of the author's publications. The idea behind the $j$-index it to remedy a drawback of the $h$-index $-$ that the $h$-index does not take into account the full citation record of a researcher. The square root function is motivated by our desire to avoid the possible bias that may occur with a simple sum when an author has several very highly cited papers. We compare the $j$-index to the $h$-index, the $g$-index and the total citation count for three subject areas using several association measures. Our results indicate that that the association between the $j$-index and the other indices varies according to the subject area. One explanation of this variation may be due to the proportion of citations to publications of the researcher that are in the $h$-core. The $j$-index is {\em not} an $h$-index variant, and as such is intended to complement rather than necessarily replace the $h$-index and other bibliometric indicators, thus providing a more complete picture of a researcher's achievements. △ Less

Submitted 25 April, 2013; originally announced April 2013.

Comments: 12 pages

arXiv:1103.1530 [pdf, ps, other]

A Discrete Evolutionary Model for Chess Players' Ratings

Authors: Trevor Fenner, Mark Levene, George Loizou

Abstract: The Elo system for rating chess players, also used in other games and sports, was adopted by the World Chess Federation over four decades ago. Although not without controversy, it is accepted as generally reliable and provides a method for assessing players' strengths and ranking them in official tournaments. It is generally accepted that the distribution of players' rating data is approximately… ▽ More The Elo system for rating chess players, also used in other games and sports, was adopted by the World Chess Federation over four decades ago. Although not without controversy, it is accepted as generally reliable and provides a method for assessing players' strengths and ranking them in official tournaments. It is generally accepted that the distribution of players' rating data is approximately normal but, to date, no stochastic model of how the distribution might have arisen has been proposed. We propose such an evolutionary stochastic model, which models the arrival of players into the rating pool, the games they play against each other, and how the results of these games affect their ratings. Using a continuous approximation to the discrete model, we derive the distribution for players' ratings at time $t$ as a normal distribution, where the variance increases in time as a logarithmic function of $t$. We validate the model using published rating data from 2007 to 2010, showing that the parameters obtained from the data can be recovered through simulations of the stochastic model. The distribution of players' ratings is only approximately normal and has been shown to have a small negative skew. We show how to modify our evolutionary stochastic model to take this skewness into account, and we validate the modified model using the published official rating data. △ Less

Submitted 30 March, 2011; v1 submitted 8 March, 2011; originally announced March 2011.

Comments: 17 pages, 4 figures

arXiv:0904.2595 [pdf, ps, other]

A Methodology for Learning Players' Styles from Game Records

Authors: Mark Levene, Trevor Fenner

Abstract: We describe a preliminary investigation into learning a Chess player's style from game records. The method is based on attempting to learn features of a player's individual evaluation function using the method of temporal differences, with the aid of a conventional Chess engine architecture. Some encouraging results were obtained in learning the styles of two recent Chess world champions, and we… ▽ More We describe a preliminary investigation into learning a Chess player's style from game records. The method is based on attempting to learn features of a player's individual evaluation function using the method of temporal differences, with the aid of a conventional Chess engine architecture. Some encouraging results were obtained in learning the styles of two recent Chess world champions, and we report on our attempt to use the learnt styles to discriminate between the players from game records by trying to detect who was playing white and who was playing black. We also discuss some limitations of our approach and propose possible directions for future research. The method we have presented may also be applicable to other strategic games, and may even be generalisable to other domains where sequences of agents' actions are recorded. △ Less

Submitted 16 April, 2009; originally announced April 2009.

Comments: 15 pages, 3 figures

arXiv:cs/0610060 [pdf, ps, other]

Comparing Typical Opening Move Choices Made by Humans and Chess Engines

Authors: Mark Levene, Judit Bar-Ilan

Abstract: The opening book is an important component of a chess engine, and thus computer chess programmers have been develo** automated methods to improve the quality of their books. For chess, which has a very rich opening theory, large databases of high-quality games can be used as the basis of an opening book, from which statistics relating to move choices from given positions can be collected. In o… ▽ More The opening book is an important component of a chess engine, and thus computer chess programmers have been develo** automated methods to improve the quality of their books. For chess, which has a very rich opening theory, large databases of high-quality games can be used as the basis of an opening book, from which statistics relating to move choices from given positions can be collected. In order to find out whether the opening books used by modern chess engines in machine versus machine competitions are ``comparable'' to those used by chess players in human versus human competitions, we carried out analysis on 26 test positions using statistics from two opening books one compiled from humans' games and the other from machines' games. Our analysis using several nonparametric measures, shows that, overall, there is a strong association between humans' and machines' choices of opening moves when using a book to guide their choices. △ Less

Submitted 11 October, 2006; originally announced October 2006.

Comments: 12 pages, 1 figure, 6 tables

ACM Class: I.2.0

arXiv:cs/0606115 [pdf, ps, other]

Evaluating Variable Length Markov Chain Models for Analysis of User Web Navigation Sessions

Authors: Jose Borges, Mark Levene

Abstract: Markov models have been widely used to represent and analyse user web navigation data. In previous work we have proposed a method to dynamically extend the order of a Markov chain model and a complimentary method for assessing the predictive power of such a variable length Markov chain. Herein, we review these two methods and propose a novel method for measuring the ability of a variable length… ▽ More Markov models have been widely used to represent and analyse user web navigation data. In previous work we have proposed a method to dynamically extend the order of a Markov chain model and a complimentary method for assessing the predictive power of such a variable length Markov chain. Herein, we review these two methods and propose a novel method for measuring the ability of a variable length Markov model to summarise user web navigation sessions up to a given length. While the summarisation ability of a model is important to enable the identification of user navigation patterns, the ability to make predictions is important in order to foresee the next link choice of a user after following a given trail so as, for example, to personalise a web site. We present an extensive experimental evaluation providing strong evidence that prediction accuracy increases linearly with summarisation ability. △ Less

Submitted 28 June, 2006; originally announced June 2006.

arXiv:cs/0505039 [pdf]

Methods for comparing rankings of search engine results

Authors: Judit Bar-Ilan, Mazlita Mat-Hassan, Mark Levene

Abstract: In this paper we present a number of measures that compare rankings of search engine results. We apply these measures to five queries that were monitored daily for two periods of about 21 days each. Rankings of the different search engines (Google, Yahoo and Teoma for text searches and Google, Yahoo and Picsearch for image searches) are compared on a daily basis, in addition to longitudinal comp… ▽ More In this paper we present a number of measures that compare rankings of search engine results. We apply these measures to five queries that were monitored daily for two periods of about 21 days each. Rankings of the different search engines (Google, Yahoo and Teoma for text searches and Google, Yahoo and Picsearch for image searches) are compared on a daily basis, in addition to longitudinal comparisons of the same engine for the same query over time. The results and rankings of the two periods are compared as well. △ Less

Submitted 14 May, 2005; originally announced May 2005.

Comments: 19 pages, 4 figures, 8 tables

ACM Class: H.3.3

arXiv:cs/0503030 [pdf, ps, other]

A Suffix Tree Approach to Email Filtering

Authors: Rajesh M. Pampapathi, Boris Mirkin, Mark Levene

Abstract: We present an approach to email filtering based on the suffix tree data structure. A method for the scoring of emails using the suffix tree is developed and a number of scoring and score normalisation functions are tested. Our results show that the character level representation of emails and classes facilitated by the suffix tree can significantly improve classification accuracy when compared w… ▽ More We present an approach to email filtering based on the suffix tree data structure. A method for the scoring of emails using the suffix tree is developed and a number of scoring and score normalisation functions are tested. Our results show that the character level representation of emails and classes facilitated by the suffix tree can significantly improve classification accuracy when compared with the currently popular methods, such as naive Bayes. We believe the method can be extended to the classification of documents in other domains. △ Less

Submitted 6 December, 2005; v1 submitted 14 March, 2005; originally announced March 2005.

Comments: Revisions made in the light of reviewer comments. Main changes: (i) The extension and elaboration of section 4.4 which describes the scoring algorithm; (ii) Favouring the use of false positive and false negative performance measures over the use of precision and recall; (iii) The addition of ROC curves wherever possible; and (iv) Inclusion of performance statistics for algorithm. Re-submitted 5th August 2005

arXiv:cs/0412002 [pdf, ps, other]

Ranking Pages by Topology and Popularity within Web Sites

Authors: Jose Borges, Mark Levene

Abstract: We compare two link analysis ranking methods of web pages in a site. The first, called Site Rank, is an adaptation of PageRank to the granularity of a web site and the second, called Popularity Rank, is based on the frequencies of user clicks on the outlinks in a page that are captured by navigation sessions of users through the web site. We ran experiments on artificially created web sites of d… ▽ More We compare two link analysis ranking methods of web pages in a site. The first, called Site Rank, is an adaptation of PageRank to the granularity of a web site and the second, called Popularity Rank, is based on the frequencies of user clicks on the outlinks in a page that are captured by navigation sessions of users through the web site. We ran experiments on artificially created web sites of different sizes and on two real data sets, employing the relative entropy to compare the distributions of the two ranking methods. For the real data sets we also employ a nonparametric measure, called Spearman's footrule, which we use to compare the top-ten web pages ranked by the two methods. Our main result is that the distributions of the Popularity Rank and Site Rank are surprisingly close to each other, implying that the topology of a web site is very instrumental in guiding users through the site. Thus, in practice, the Site Rank provides a reasonable first order approximation of the aggregate behaviour of users within a web site given by the Popularity Rank. △ Less

Submitted 9 December, 2004; v1 submitted 1 December, 2004; originally announced December 2004.

Comments: 15 pages, 6 figures

arXiv:cs/0406032 [pdf]

A Dynamic Clustering-Based Markov Model for Web Usage Mining

Authors: José Borges, Mark Levene

Abstract: Markov models have been widely utilized for modelling user web navigation behaviour. In this work we propose a dynamic clustering-based method to increase a Markov model's accuracy in representing a collection of user web navigation sessions. The method makes use of the state cloning concept to duplicate states in a way that separates in-links whose corresponding second-order probabilities diver… ▽ More Markov models have been widely utilized for modelling user web navigation behaviour. In this work we propose a dynamic clustering-based method to increase a Markov model's accuracy in representing a collection of user web navigation sessions. The method makes use of the state cloning concept to duplicate states in a way that separates in-links whose corresponding second-order probabilities diverge. In addition, the new method incorporates a clustering technique which determines an effcient way to assign in-links with similar second-order probabilities to the same clone. We report on experiments conducted with both real and random data and we provide a comparison with the N-gram Markov concept. The results show that the number of additional states induced by the dynamic clustering method can be controlled through a threshold parameter, and suggest that the method's performance is linear time in the size of the model. △ Less

Submitted 17 June, 2004; originally announced June 2004.

arXiv:cs/0307073 [pdf, ps, other]

Search and Navigation in Relational Databases

Authors: Richard Wheeldon, Mark Levene, Kevin Keenoy

Abstract: We present a new application for keyword search within relational databases, which uses a novel algorithm to solve the join discovery problem by finding Memex-like trails through the graph of foreign key dependencies. It differs from previous efforts in the algorithms used, in the presentation mechanism and in the use of primary-key only database queries at query-time to maintain a fast response… ▽ More We present a new application for keyword search within relational databases, which uses a novel algorithm to solve the join discovery problem by finding Memex-like trails through the graph of foreign key dependencies. It differs from previous efforts in the algorithms used, in the presentation mechanism and in the use of primary-key only database queries at query-time to maintain a fast response for users. We present examples using the DBLP data set. △ Less

Submitted 31 July, 2003; originally announced July 2003.

Comments: 12 pages, 7 figures

ACM Class: H.3; H.4; H.5

arXiv:cs/0306122 [pdf, ps, other]

The Best Trail Algorithm for Assisted Navigation of Web Sites

Authors: Richard Wheeldon, Mark Levene

Abstract: We present an algorithm called the Best Trail Algorithm, which helps solve the hypertext navigation problem by automating the construction of memex-like trails through the corpus. The algorithm performs a probabilistic best-first expansion of a set of navigation trees to find relevant and compact trails. We describe the implementation of the algorithm, scoring methods for trails, filtering algor… ▽ More We present an algorithm called the Best Trail Algorithm, which helps solve the hypertext navigation problem by automating the construction of memex-like trails through the corpus. The algorithm performs a probabilistic best-first expansion of a set of navigation trees to find relevant and compact trails. We describe the implementation of the algorithm, scoring methods for trails, filtering algorithms and a new metric called \emph{potential gain} which measures the potential of a page for future navigation opportunities. △ Less

Submitted 22 June, 2003; originally announced June 2003.

Comments: 11 pages, 11 figures

ACM Class: H.3.3; H.5.4; G.2.2; F.2.2

Showing 1–21 of 21 results for author: Levene, M