-
Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics
Authors:
Manik Bhandari,
Pranav Gour,
Atabak Ashfaq,
Pengfei Liu
Abstract:
In text summarization, evaluating the efficacy of automatic metrics without human judgments has become recently popular. One exemplar work concludes that automatic metrics strongly disagree when ranking high-scoring summaries. In this paper, we revisit their experiments and find that their observations stem from the fact that metrics disagree in ranking summaries from any narrow scoring range. We…
▽ More
In text summarization, evaluating the efficacy of automatic metrics without human judgments has become recently popular. One exemplar work concludes that automatic metrics strongly disagree when ranking high-scoring summaries. In this paper, we revisit their experiments and find that their observations stem from the fact that metrics disagree in ranking summaries from any narrow scoring range. We hypothesize that this may be because summaries are similar to each other in a narrow scoring range and are thus, difficult to rank. Apart from the width of the scoring range of summaries, we analyze three other properties that impact inter-metric agreement - Ease of Summarization, Abstractiveness, and Coverage. To encourage reproducible research, we make all our analysis code and data publicly available.
△ Less
Submitted 8 November, 2020;
originally announced November 2020.
-
Re-evaluating Evaluation in Text Summarization
Authors:
Manik Bhandari,
Pranav Gour,
Atabak Ashfaq,
Pengfei Liu,
Graham Neubig
Abstract:
Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -- for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for tex…
▽ More
Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -- for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
△ Less
Submitted 14 October, 2020;
originally announced October 2020.
-
Avoiding Improper Treatment of Persons with Dementia by Care Robots
Authors:
Martin Cooney,
Sepideh Pashami,
Eric Järpe,
Awais Ashfaq
Abstract:
The phrase "most cruel and revolting crimes" has been used to describe some poor historical treatment of vulnerable impaired persons by precisely those who should have had the responsibility of protecting and hel** them. We believe we might be poised to see history repeat itself, as increasingly human-like aware robots become capable of engaging in behavior which we would consider immoral in a h…
▽ More
The phrase "most cruel and revolting crimes" has been used to describe some poor historical treatment of vulnerable impaired persons by precisely those who should have had the responsibility of protecting and hel** them. We believe we might be poised to see history repeat itself, as increasingly human-like aware robots become capable of engaging in behavior which we would consider immoral in a human--either unknowingly or deliberately. In the current paper we focus in particular on exploring some potential dangers affecting persons with dementia (PWD), which could arise from insufficient software or external factors, and describe a proposed solution involving rich causal models and accountability measures: Specifically, the Consequences of Needs-driven Dementia-compromised Behaviour model (C-NDB) could be adapted to be used with conversation topic detection, causal networks and multi-criteria decision making, alongside reports, audits, and deterrents. Our aim is that the considerations raised could help inform the design of care robots intended to support well-being in PWD.
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
Machine learning in healthcare -- a system's perspective
Authors:
Awais Ashfaq,
Slawomir Nowaczyk
Abstract:
A consequence of the fragmented and siloed healthcare landscape is that patient care (and data) is split along multitude of different facilities and computer systems and enabling interoperability between these systems is hard. The lack interoperability not only hinders continuity of care and burdens providers, but also hinders effective application of Machine Learning (ML) algorithms. Thus, most c…
▽ More
A consequence of the fragmented and siloed healthcare landscape is that patient care (and data) is split along multitude of different facilities and computer systems and enabling interoperability between these systems is hard. The lack interoperability not only hinders continuity of care and burdens providers, but also hinders effective application of Machine Learning (ML) algorithms. Thus, most current ML algorithms, designed to understand patient care and facilitate clinical decision-support, are trained on limited datasets. This approach is analogous to the Newtonian paradigm of Reductionism in which a system is broken down into elementary components and a description of the whole is formed by understanding those components individually. A key limitation of the reductionist approach is that it ignores the component-component interactions and dynamics within the system which are often of prime significance in understanding the overall behaviour of complex adaptive systems (CAS). Healthcare is a CAS.
Though the application of ML on health data have shown incremental improvements for clinical decision support, ML has a much a broader potential to restructure care delivery as a whole and maximize care value. However, this ML potential remains largely untapped: primarily due to functional limitations of Electronic Health Records (EHR) and the inability to see the healthcare system as a whole. This viewpoint (i) articulates the healthcare as a complex system which has a biological and an organizational perspective, (ii) motivates with examples, the need of a system's approach when addressing healthcare challenges via ML and, (iii) emphasizes to unleash EHR functionality - while duly respecting all ethical and legal concerns - to reap full benefits of ML.
△ Less
Submitted 19 January, 2020; v1 submitted 14 September, 2019;
originally announced September 2019.
-
Properties for the Frechet Mean in Billera-Holmes-Vogtmann Treespace
Authors:
Maria Anaya,
Olga Anipchenko-Ulaj,
Aisha Ashfaq,
Joyce Chiu,
Mahedi Kaiser,
Max Shoji Ohsawa,
Megan Owen,
Ella Pavlechko,
Katherine St. John,
Shivam Suleria,
Keith Thompson,
Corrine Yap
Abstract:
The Billera-Holmes-Vogtmann (BHV) space of weighted trees can be embedded in Euclidean space, but the extrinsic Euclidean mean often lies outside of treespace. Sturm showed that the intrinsic Frechet mean exists and is unique in treespace. This Frechet mean can be approximated with an iterative algorithm, but bounds on the convergence of the algorithm are not known, and there is no other known pol…
▽ More
The Billera-Holmes-Vogtmann (BHV) space of weighted trees can be embedded in Euclidean space, but the extrinsic Euclidean mean often lies outside of treespace. Sturm showed that the intrinsic Frechet mean exists and is unique in treespace. This Frechet mean can be approximated with an iterative algorithm, but bounds on the convergence of the algorithm are not known, and there is no other known polynomial algorithm for computing the Frechet mean nor even the edges present in the mean. We give the first necessary and sufficient conditions for an edge to be in the Frechet mean. The conditions are in the form of inequalities on the weights of the edges. These conditions provide a pre-processing step for finding the treespace orthant containing the Frechet mean. This work generalizes to orthant spaces.
△ Less
Submitted 12 July, 2019;
originally announced July 2019.
-
A modified fuzzy C means algorithm for shading correction in craniofacial CBCT images
Authors:
Awais Ashfaq,
Jonas Adler
Abstract:
CBCT images suffer from acute shading artifacts primarily due to scatter. Numerous image-domain correction algorithms have been proposed in the literature that use patient-specific planning CT images to estimate shading contributions in CBCT images. However, in the context of radiosurgery applications such as gamma knife, planning images are often acquired through MRI which impedes the use of poly…
▽ More
CBCT images suffer from acute shading artifacts primarily due to scatter. Numerous image-domain correction algorithms have been proposed in the literature that use patient-specific planning CT images to estimate shading contributions in CBCT images. However, in the context of radiosurgery applications such as gamma knife, planning images are often acquired through MRI which impedes the use of polynomial fitting approaches for shading correction. We present a new shading correction approach that is independent of planning CT images. Our algorithm is based on the assumption that true CBCT images follow a uniform volumetric intensity distribution per material, and scatter perturbs this uniform texture by contributing cup** and shading artifacts in the image domain. The framework is a combination of fuzzy C-means coupled with a neighborhood regularization term and Otsu's method. Experimental results on artificially simulated craniofacial CBCT images are provided to demonstrate the effectiveness of our algorithm. Spatial non-uniformity is reduced from 16% to 7% in soft tissue and from 44% to 8% in bone regions. With shading-correction, thresholding based segmentation accuracy for bone pixels is improved from 85% to 91% when compared to thresholding without shading-correction. The proposed algorithm is thus practical and qualifies as a plug and play extension into any CBCT reconstruction software for shading correction.
△ Less
Submitted 17 January, 2018;
originally announced January 2018.
-
On Determining if Tree-based Networks Contain Fixed Trees
Authors:
Maria Anaya,
Olga Anipchenko-Ulaj,
Aisha Ashfaq,
Joyce Chiu,
Mahedi Kaiser,
Max Shoji Ohsawa,
Megan Owen,
Ella Pavlechko,
Katherine St. John,
Shivam Suleria,
Keith Thompson,
Corrine Yap
Abstract:
We address an open question of Francis and Steel about phylogenetic networks and trees. They give a polynomial time algorithm to decide if a phylogenetic network, N, is tree-based and pose the problem: given a fixed tree T and network N, is N based on T? We show that it is NP-hard to decide, by reduction from 3-Dimensional Matching (3DM), and further, that the problem is fixed parameter tractable.
We address an open question of Francis and Steel about phylogenetic networks and trees. They give a polynomial time algorithm to decide if a phylogenetic network, N, is tree-based and pose the problem: given a fixed tree T and network N, is N based on T? We show that it is NP-hard to decide, by reduction from 3-Dimensional Matching (3DM), and further, that the problem is fixed parameter tractable.
△ Less
Submitted 8 February, 2016;
originally announced February 2016.
-
Performance of a Large-Area GEM Detector Prototype for the Upgrade of the CMS Muon Endcap System
Authors:
D. Abbaneo,
M. Abbas,
M. Abbrescia,
A. A. Abdelalim,
M. Abi Akl,
W. Ahmed,
W. Ahmed,
P. Altieri,
R. Aly,
C. Asawatangtrakuldee,
A. Ashfaq,
P. Aspell,
Y. Assran,
I. Awan,
S. Bally,
Y. Ban,
S. Banerjee,
P. Barria,
L. Benussi,
V. Bhopatkar,
S. Bianco,
J. Bos,
O. Bouhali,
S. Braibant,
S. Buontempo
, et al. (113 additional authors not shown)
Abstract:
Gas Electron Multiplier (GEM) technology is being considered for the forward muon upgrade of the CMS experiment in Phase 2 of the CERN LHC. Its first implementation is planned for the GE1/1 system in the $1.5 < \midη\mid < 2.2$ region of the muon endcap mainly to control muon level-1 trigger rates after the second long LHC shutdown. A GE1/1 triple-GEM detector is read out by 3,072 radial strips wi…
▽ More
Gas Electron Multiplier (GEM) technology is being considered for the forward muon upgrade of the CMS experiment in Phase 2 of the CERN LHC. Its first implementation is planned for the GE1/1 system in the $1.5 < \midη\mid < 2.2$ region of the muon endcap mainly to control muon level-1 trigger rates after the second long LHC shutdown. A GE1/1 triple-GEM detector is read out by 3,072 radial strips with 455 $μ$rad pitch arranged in eight $η$-sectors. We assembled a full-size GE1/1 prototype of 1m length at Florida Tech and tested it in 20-120 GeV hadron beams at Fermilab using Ar/CO$_{2}$ 70:30 and the RD51 scalable readout system. Four small GEM detectors with 2-D readout and an average measured azimuthal resolution of 36 $μ$rad provided precise reference tracks. Construction of this largest GEM detector built to-date is described. Strip cluster parameters, detection efficiency, and spatial resolution are studied with position and high voltage scans. The plateau detection efficiency is [97.1 $\pm$ 0.2 (stat)]\%. The azimuthal resolution is found to be [123.5 $\pm$ 1.6 (stat)] $μ$rad when operating in the center of the efficiency plateau and using full pulse height information. The resolution can be slightly improved by $\sim$ 10 $μ$rad when correcting for the bias due to discrete readout strips. The CMS upgrade design calls for readout electronics with binary hit output. When strip clusters are formed correspondingly without charge-weighting and with fixed hit thresholds, a position resolution of [136.8 $\pm$ 2.5 stat] $μ$rad is measured, consistent with the expected resolution of strip-pitch/$\sqrt{12}$ = 131.3 $μ$rad. Other $η$-sectors of the detector show similar response and performance.
△ Less
Submitted 8 December, 2014; v1 submitted 30 November, 2014;
originally announced December 2014.