Skip to main content

Showing 1–26 of 26 results for author: Opitz, J

.
  1. arXiv:2405.05966  [pdf, other

    cs.CL cs.AI

    Natural Language Processing RELIES on Linguistics

    Authors: Juri Opitz, Shira Wein, Nathan Schneider

    Abstract: Large Language Models (LLMs) have become capable of generating highly fluent text in certain languages, without modules specially designed to capture grammar or semantic coherence. What does this mean for the future of linguistic expertise in NLP? We highlight several aspects in which NLP (still) relies on linguistics, or where linguistic thinking can illuminate new directions. We argue our case a… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  2. A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice

    Authors: Juri Opitz

    Abstract: Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently, metrics are selected without arguments, and blurry terminology invites misconceptions. For instance, many works use so-called 'macro' metrics to rank systems (e.g., 'macro F1') but do not clearly specify what they would expect from such a `macro' metric. Th… ▽ More

    Submitted 2 July, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: appeared in TACL journal. MIT press publication available at https://doi.org/10.1162/tacl_a_00675

  3. arXiv:2404.03344  [pdf, other

    cs.CL

    Schroedinger's Threshold: When the AUC doesn't predict Accuracy

    Authors: Juri Opitz

    Abstract: The Area Under Curve measure (AUC) seems apt to evaluate and compare diverse models, possibly without calibration. An important example of AUC application is the evaluation and benchmarking of models that predict faithfulness of generated text. But we show that the AUC yields an academic and optimistic notion of accuracy that can misalign with the actual accuracy observed in application, yielding… ▽ More

    Submitted 27 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

    Comments: LREC-COLING 2024, added more details on data setups, fixed typo

  4. arXiv:2404.01701  [pdf, other

    cs.CL

    On the Role of Summary Content Units in Text Summarization Evaluation

    Authors: Marcel Nawrath, Agnieszka Nowak, Tristan Ratz, Danilo C. Walenta, Juri Opitz, Leonardo F. R. Ribeiro, João Sedoc, Daniel Deutsch, Simon Mille, Yixin Liu, Lining Zhang, Sebastian Gehrmann, Saad Mahamood, Miruna Clinciu, Khyathi Chandu, Yufang Hou

    Abstract: At the heart of the Pyramid evaluation method for text summarization lie human written summary content units (SCUs). These SCUs are concise sentences that decompose a summary into small facts. Such SCUs can be used to judge the quality of a candidate summary, possibly partially automated via natural language inference (NLI) systems. Interestingly, with the aim to fully automate the Pyramid evaluat… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: 10 Pages, 3 Figures, 3 Tables, camera ready version accepted at NAACL 2024

  5. arXiv:2310.19792  [pdf, other

    cs.CL

    The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics

    Authors: Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror, Steffen Eger

    Abstract: With an increasing number of parameters and pre-training data, generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples. Notably, LLMs have been successfully employed as evaluation metrics in text generation tasks. Within this context, we introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

  6. arXiv:2307.15002  [pdf, other

    cs.CL

    Gzip versus bag-of-words for text classification

    Authors: Juri Opitz

    Abstract: The effectiveness of compression in text classification ('gzip') has recently garnered lots of attention. In this note we show that `bag-of-words' approaches can achieve similar or better results, and are more efficient.

    Submitted 8 August, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

    Comments: improved writing, extended with more results

  7. arXiv:2306.00936  [pdf, other

    cs.CL cs.IR

    AMR4NLI: Interpretable and robust NLI measures from semantic graphs

    Authors: Juri Opitz, Shira Wein, Julius Steen, Anette Frank, Nathan Schneider

    Abstract: The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent… ▽ More

    Submitted 5 September, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: International Conference on Computational Semantics (IWCS 2023); v2 fixes an imprecise sentence below Eq. 5

  8. arXiv:2305.16819  [pdf, other

    cs.CL

    With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness

    Authors: Julius Steen, Juri Opitz, Anette Frank, Katja Markert

    Abstract: Conditional language models still generate unfaithful output that is not supported by their input. These unfaithful generations jeopardize trust in real-world applications such as summarization or human-machine interaction, motivating a need for automatic faithfulness metrics. To implement such metrics, NLI models seem attractive, since they solve a strongly related task that comes with a wealth o… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: ACL 2023 (short paper)

  9. arXiv:2305.08495  [pdf, other

    cs.CL cs.DB

    Similarity-weighted Construction of Contextualized Commonsense Knowledge Graphs for Knowledge-intense Argumentation Tasks

    Authors: Moritz Plenz, Juri Opitz, Philipp Heinisch, Philipp Cimiano, Anette Frank

    Abstract: Arguments often do not make explicit how a conclusion follows from its premises. To compensate for this lack, we enrich arguments with structured background knowledge to support knowledge-intense argumentation tasks. We present a new unsupervised method for constructing Contextualized Commonsense Knowledge Graphs (CCKGs) that selects contextually relevant knowledge from large knowledge graphs (KGs… ▽ More

    Submitted 15 May, 2023; originally announced May 2023.

    Comments: Accepted at ACL 2023

  10. arXiv:2305.06993  [pdf, other

    cs.CL cs.AI

    SMATCH++: Standardized and Extended Evaluation of Semantic Graphs

    Authors: Juri Opitz

    Abstract: The Smatch metric is a popular method for evaluating graph distances, as is necessary, for instance, to assess the performance of semantic graph parsing systems. However, we observe some issues in the metric that jeopardize meaningful evaluation. E.g., opaque pre-processing choices can affect results, and current graph-alignment solvers do not provide us with upper-bounds. Without upper-bounds, ho… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

    Comments: EACL 2023 findings, Code: https://github.com/flipz357/smatchpp

  11. arXiv:2210.06461  [pdf, other

    cs.CL cs.AI

    Better Smatch = Better Parser? AMR evaluation is not so simple anymore

    Authors: Juri Opitz, Anette Frank

    Abstract: Recently, astonishing advances have been observed in AMR parsing, as measured by the structural Smatch metric. In fact, today's systems achieve performance levels that seem to surpass estimates of human inter annotator agreement (IAA). Therefore, it is unclear how well Smatch (still) relates to human estimates of parse quality, as in this situation potentially fine-grained errors of similar weight… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

    Comments: accepted at "Evaluation and Comparison of NLP Systems" Workshop (Eval4NLP 2022)

  12. arXiv:2206.07023  [pdf, other

    cs.CL cs.AI

    SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features

    Authors: Juri Opitz, Anette Frank

    Abstract: Models based on large-pretrained language models, such as S(entence)BERT, provide effective and efficient sentence embeddings that show high correlation to human similarity ratings, but lack interpretability. On the other hand, graph metrics for graph-based meaning representations (e.g., Abstract Meaning Representation, AMR) can make explicit the semantic aspects in which two sentences are similar… ▽ More

    Submitted 28 October, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

    Comments: to appear in AACL 2022 (main)

  13. arXiv:2205.12176  [pdf, other

    cs.CL

    A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric Evaluation -- through the Lens of Semantic Similarity Rating

    Authors: Laura Zeidler, Juri Opitz, Anette Frank

    Abstract: Evaluating the quality of generated text is difficult, since traditional NLG evaluation metrics, focusing more on surface form than meaning, often fail to assign appropriate scores. This is especially problematic for AMR-to-text evaluation, given the abstract nature of AMR. Our work aims to support the development and improvement of NLG evaluation metrics that focus on meaning, by develo** a dyn… ▽ More

    Submitted 24 May, 2022; originally announced May 2022.

    Comments: to appear in *SEM 2022

  14. arXiv:2203.13226  [pdf, other

    cs.CL

    SMARAGD: Learning SMatch for Accurate and Rapid Approximate Graph Distance

    Authors: Juri Opitz, Philipp Meier, Anette Frank

    Abstract: The similarity of graph structures, such as Meaning Representations (MRs), is often assessed via structural matching algorithms, such as Smatch (Cai and Knight, 2013). However, Smatch involves a combinatorial problem that suffers from NP-completeness, making large-scale applications, e.g., graph clustering or search, infeasible. To alleviate this issue, we learn SMARAGD: Semantic Match for Accurat… ▽ More

    Submitted 1 June, 2023; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: to appear at 15th International Conference on Computational Semantics (IWCS 2023)

  15. arXiv:2108.11949  [pdf, other

    cs.CL cs.AI

    Weisfeiler-Leman in the BAMBOO: Novel AMR Graph Metrics and a Benchmark for AMR Graph Similarity

    Authors: Juri Opitz, Angel Daza, Anette Frank

    Abstract: Several metrics have been proposed for assessing the similarity of (abstract) meaning representations (AMRs), but little is known about how they relate to human similarity ratings. Moreover, the current metrics have complementary strengths and weaknesses: some emphasize speed, while others make the alignment of graph structures explicit, at the price of a costly alignment step. In this work we p… ▽ More

    Submitted 26 August, 2021; originally announced August 2021.

    Comments: to appear in TACL, this is a pre-MIT Press publication version

  16. arXiv:2106.04565  [pdf, other

    cs.CL cs.AI

    Translate, then Parse! A strong baseline for Cross-Lingual AMR Parsing

    Authors: Sarah Uhrig, Yoalli Rezepka Garcia, Juri Opitz, Anette Frank

    Abstract: In cross-lingual Abstract Meaning Representation (AMR) parsing, researchers develop models that project sentences from various languages onto their AMRs to capture their essential semantic structures: given a sentence in any language, we aim to capture its core semantic content through concepts connected by manifold types of semantic relations. Methods typically leverage large silver training data… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: IWPT 2021

  17. arXiv:2008.08896  [pdf, other

    cs.CL

    Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR

    Authors: Juri Opitz, Anette Frank

    Abstract: Systems that generate natural language text from abstract meaning representations such as AMR are typically evaluated using automatic surface matching metrics that compare the generated texts to reference texts from which the input meaning representations were constructed. We show that besides well-known issues from which such metrics suffer, an additional problem arises when applying these metric… ▽ More

    Submitted 26 January, 2021; v1 submitted 20 August, 2020; originally announced August 2020.

    Comments: EACL 2021

  18. arXiv:2005.12187  [pdf, other

    cs.CL cs.LG

    AMR Quality Rating with a Lightweight CNN

    Authors: Juri Opitz

    Abstract: Structured semantic sentence representations such as Abstract Meaning Representations (AMRs) are potentially useful in various NLP tasks. However, the quality of automatic parses can vary greatly and jeopardizes their usefulness. This can be mitigated by models that can accurately rate AMR quality in the absence of costly gold data, allowing us to inform downstream systems about an incorporated pa… ▽ More

    Submitted 16 December, 2020; v1 submitted 25 May, 2020; originally announced May 2020.

    Comments: AACL-IJCNLP 2020

  19. AMR Similarity Metrics from Principles

    Authors: Juri Opitz, Letitia Parcalabescu, Anette Frank

    Abstract: Different metrics have been proposed to compare Abstract Meaning Representation (AMR) graphs. The canonical Smatch metric (Cai and Knight, 2013) aligns the variables of two graphs and assesses triple matches. The recent SemBleu metric (Song and Gildea, 2019) is based on the machine-translation metric Bleu (Papineni et al., 2002) and increases computational efficiency by ablating the variable-align… ▽ More

    Submitted 17 September, 2020; v1 submitted 29 January, 2020; originally announced January 2020.

    Comments: TACL 2020 https://doi.org/10.1162/tacl_a_00329

  20. arXiv:1911.03347  [pdf, other

    cs.LG stat.ML

    Macro F1 and Macro F1

    Authors: Juri Opitz, Sebastian Burst

    Abstract: The 'macro F1' metric is frequently used to evaluate binary, multi-class and multi-label classification problems. Yet, we find that there exist two different formulas to calculate this quantity. In this note, we show that only under rare circumstances the two computations can be considered equivalent. More specifically, one formula well 'rewards' classifiers which produce a skewed error type distr… ▽ More

    Submitted 8 February, 2021; v1 submitted 8 November, 2019; originally announced November 2019.

    Comments: 6 pages (+ appendix), 6 figures, fixed typo

  21. arXiv:1909.09031  [pdf, other

    cs.CL

    Argumentative Relation Classification as Plausibility Ranking

    Authors: Juri Opitz

    Abstract: We formulate argumentative relation classification (support vs. attack) as a text-plausibility ranking task. To this aim, we propose a simple reconstruction trick which enables us to build minimal pairs of plausible and implausible texts by simulating natural contexts in which two argumentative units are likely or unlikely to appear. We show that this method is competitive with previous work albei… ▽ More

    Submitted 19 September, 2019; originally announced September 2019.

    Comments: 15th Conference on Natural Language Processing (KONVENS 2019)

  22. arXiv:1906.03338  [pdf, other

    cs.CL

    Dissecting Content and Context in Argumentative Relation Analysis

    Authors: Juri Opitz, Anette Frank

    Abstract: When assessing relations between argumentative units (e.g., support or attack), computational systems often exploit disclosing indicators or markers that are not part of elementary argumentative units (EAUs) themselves, but are gained from their context (position in paragraph, preceding tokens, etc.). We show that this dependency is much stronger than previously assumed. In fact, we show that by c… ▽ More

    Submitted 7 June, 2019; originally announced June 2019.

    Comments: accepted at 6th Workshop on Argument Mining

  23. arXiv:1904.08301  [pdf, other

    cs.CL

    Automatic Accuracy Prediction for AMR Parsing

    Authors: Juri Opitz, Anette Frank

    Abstract: Abstract Meaning Representation (AMR) represents sentences as directed, acyclic and rooted graphs, aiming at capturing their meaning in a machine readable format. AMR parsing converts natural language sentences into such graphs. However, evaluating a parser on new data by means of comparison to manually created AMR graphs is very costly. Also, we would like to be able to detect parses of questiona… ▽ More

    Submitted 17 April, 2019; originally announced April 2019.

    Comments: accepted at *SEM 2019

  24. arXiv:1902.01349  [pdf, other

    cs.CL

    An Argument-Marker Model for Syntax-Agnostic Proto-Role Labeling

    Authors: Juri Opitz, Anette Frank

    Abstract: Semantic proto-role labeling (SPRL) is an alternative to semantic role labeling (SRL) that moves beyond a categorical definition of roles, following Dowty's feature-based view of proto-roles. This theory determines agenthood vs. patienthood based on a participant's instantiation of more or less typical agent vs. patient properties, such as, for example, volition in an event. To perform SPRL, we de… ▽ More

    Submitted 12 April, 2019; v1 submitted 4 February, 2019; originally announced February 2019.

    Comments: accepted at *SEM 2019

  25. arXiv:1706.02256  [pdf, other

    cs.CL stat.ML

    A Mention-Ranking Model for Abstract Anaphora Resolution

    Authors: Ana Marasović, Leo Born, Juri Opitz, Anette Frank

    Abstract: Resolving abstract anaphora is an important, but difficult task for text understanding. Yet, with recent advances in representation learning this task becomes a more tangible aim. A central property of abstract anaphora is that it establishes a relation between the anaphor embedded in the anaphoric sentence and its (typically non-nominal) antecedent. We propose a mention-ranking model that learns… ▽ More

    Submitted 21 July, 2017; v1 submitted 7 June, 2017; originally announced June 2017.

    Comments: In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP). Copenhagen, Denmark

  26. arXiv:1012.1882  [pdf, ps, other

    cs.MM

    Evaluating Modelling Approaches for Medical Image Annotations

    Authors: Jasmin Opitz, Bijan Parsia, Ulrike Sattler

    Abstract: Information system designers face many challenges w.r.t. selecting appropriate semantic technologies and deciding on a modelling approach for their system. However, there is no clear methodology yet to evaluate "semantically enriched" information systems. In this paper we present a case study on different modelling approaches for annotating medical images and introduce a conceptual framework that… ▽ More

    Submitted 8 December, 2010; originally announced December 2010.

    Comments: in Adrian Paschke, Albert Burger, Andrea Splendiani, M. Scott Marshall, Paolo Romano: Proceedings of the 3rd International Workshop on Semantic Web Applications and Tools for the Life Sciences, Berlin,Germany, December 8-10, 2010

    Report number: SWAT4LS 2010 ACM Class: J.3