-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1092 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 14 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Identifying mechanisms driving the early response of triple negative breast cancer patients to neoadjuvant chemotherapy using a mechanistic model integrating in vitro and in vivo imaging data
Authors:
Guillermo Lorenzo,
Angela M. Jarrett,
Christian T. Meyer,
Vito Quaranta,
Darren R. Tyson,
Thomas E. Yankeelov
Abstract:
Neoadjuvant chemotherapy (NAC) is a standard-of-care treatment for locally advanced triple negative breast cancer (TNBC) before surgery. The early assessment of TNBC response to NAC would enable an oncologist to adapt the therapeutic plan of a non-responding patient, thereby improving treatment outcomes while preventing unnecessary toxicities. To this end, a promising approach consists of obtainin…
▽ More
Neoadjuvant chemotherapy (NAC) is a standard-of-care treatment for locally advanced triple negative breast cancer (TNBC) before surgery. The early assessment of TNBC response to NAC would enable an oncologist to adapt the therapeutic plan of a non-responding patient, thereby improving treatment outcomes while preventing unnecessary toxicities. To this end, a promising approach consists of obtaining \textsl{in silico} personalized forecasts of tumor response to NAC \textsl{via} computer simulation of mechanistic models constrained with patient-specific magnetic resonance imaging (MRI) data acquired early during NAC. Here, we present a model featuring the essential mechanisms of TNBC growth and response to NAC, including an explicit description of drug pharmacodynamics and pharmacokinetics. As longitudinal \textsl{in vivo} MRI data for model calibration is limited, we perform a sensitivity analysis to identify the model mechanisms driving the response to two NAC drug combinations: doxorubicin with cyclophosphamide, and paclitaxel with carboplatin. The model parameter space is constructed by combining patient-specific MRI-based \textsl{in silico} parameter estimates and \textit{in vitro} measurements of pharmacodynamic parameters obtained using time-resolved microscopy assays of several TNBC lines. The sensitivity analysis is run in two MRI-based scenarios corresponding to a well-perfused and a poorly-perfused tumor. Out of the 15 parameters considered herein, only the baseline tumor cell net proliferation rate along with the maximum concentrations and effects of doxorubicin, carboplatin, and paclitaxel exhibit a relevant impact on model forecasts (total effect index, $S_T>$0.1). These results dramatically limit the number of parameters that require \textsl{in vivo} MRI-constrained calibration, thereby facilitating the clinical application of our model.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
Gradual Soundness: Lessons from Static Python
Authors:
Kuang-Chen Lu,
Ben Greenman,
Carl Meyer,
Dino Viehland,
Aniket Panse,
Shriram Krishnamurthi
Abstract:
Context: Gradually-typed languages allow typed and untyped code to interoperate, but typically come with significant drawbacks. In some languages, the types are unreliable; in others, communication across type boundaries can be extremely expensive; and still others allow only limited forms of interoperability. The research community is actively seeking a sound, fast, and expressive approach to gra…
▽ More
Context: Gradually-typed languages allow typed and untyped code to interoperate, but typically come with significant drawbacks. In some languages, the types are unreliable; in others, communication across type boundaries can be extremely expensive; and still others allow only limited forms of interoperability. The research community is actively seeking a sound, fast, and expressive approach to gradual ty**.
Inquiry: This paper describes Static Python, a language developed by engineers at Instagram that has proven itself sound, fast, and reasonably expressive in production. Static Python's approach to gradual types is essentially a programmer-tunable combination of the concrete and transient approaches from the literature. Concrete types provide full soundness and low performance overhead, but impose nonlocal constraints. Transient types are sound in a shallow sense and easier to use; they help to bridge the gap between untyped code and typed concrete code.
Approach: We evaluate the language in its current state and develop a model that captures the essence of its approach to gradual types. We draw upon personal communication, bug reports, and the Static Python regression test suite to develop this model.
Knowledge: Our main finding is that the gradual soundness that arises from a mix of concrete and transient types is an effective way to lower the maintenance cost of the concrete approach. We also find that method-based JIT technology can eliminate the costs of the transient approach. On a more technical level, this paper describes two contributions: a model of Static Python and a performance evaluation of Static Python. The process of formalization found several errors in the implementation, including fatal errors.
Grounding: Our model of Static Python is implemented in PLT Redex and tested using property-based soundness tests and 265 tests from the Static Python regression suite. This paper includes a small core of the model to convey the main ideas of the Static Python approach and its soundness. Our performance claims are based on production experience in the Instagram web server. Migrations to Static Python in the server have caused a 3.7\% increase in requests handled per second at maximum CPU load.
Importance: Static Python is the first sound gradual language whose piece-meal application to a realistic codebase has consistently improved performance. Other language designers may wish to replicate its approach, especially those who currently maintain unsound gradual languages and are seeking a path to soundness.
△ Less
Submitted 28 June, 2022;
originally announced June 2022.
-
QU-BraTS: MICCAI BraTS 2020 Challenge on Quantifying Uncertainty in Brain Tumor Segmentation - Analysis of Ranking Scores and Benchmarking Results
Authors:
Raghav Mehta,
Angelos Filos,
Ujjwal Baid,
Chiharu Sako,
Richard McKinley,
Michael Rebsamen,
Katrin Datwyler,
Raphael Meier,
Piotr Radojewski,
Gowtham Krishnan Murugesan,
Sahil Nalawade,
Chandan Ganesh,
Ben Wagner,
Fang F. Yu,
Baowei Fei,
Ananth J. Madhuranthakam,
Joseph A. Maldjian,
Laura Daza,
Catalina Gomez,
Pablo Arbelaez,
Chengliang Dai,
Shuo Wang,
Hadrien Reynaud,
Yuan-han Mo,
Elsa Angelini
, et al. (67 additional authors not shown)
Abstract:
Deep learning (DL) models have provided state-of-the-art performance in various medical imaging benchmarking challenges, including the Brain Tumor Segmentation (BraTS) challenges. However, the task of focal pathology multi-compartment segmentation (e.g., tumor and lesion sub-regions) is particularly challenging, and potential errors hinder translating DL models into clinical workflows. Quantifying…
▽ More
Deep learning (DL) models have provided state-of-the-art performance in various medical imaging benchmarking challenges, including the Brain Tumor Segmentation (BraTS) challenges. However, the task of focal pathology multi-compartment segmentation (e.g., tumor and lesion sub-regions) is particularly challenging, and potential errors hinder translating DL models into clinical workflows. Quantifying the reliability of DL model predictions in the form of uncertainties could enable clinical review of the most uncertain regions, thereby building trust and paving the way toward clinical translation. Several uncertainty estimation methods have recently been introduced for DL medical image segmentation tasks. Develo** scores to evaluate and compare the performance of uncertainty measures will assist the end-user in making more informed decisions. In this study, we explore and evaluate a score developed during the BraTS 2019 and BraTS 2020 task on uncertainty quantification (QU-BraTS) and designed to assess and rank uncertainty estimates for brain tumor multi-compartment segmentation. This score (1) rewards uncertainty estimates that produce high confidence in correct assertions and those that assign low confidence levels at incorrect assertions, and (2) penalizes uncertainty measures that lead to a higher percentage of under-confident correct assertions. We further benchmark the segmentation uncertainties generated by 14 independent participating teams of QU-BraTS 2020, all of which also participated in the main BraTS segmentation task. Overall, our findings confirm the importance and complementary value that uncertainty estimates provide to segmentation algorithms, highlighting the need for uncertainty quantification in medical image analyses. Finally, in favor of transparency and reproducibility, our evaluation code is made publicly available at: https://github.com/RagMeh11/QU-BraTS.
△ Less
Submitted 23 August, 2022; v1 submitted 19 December, 2021;
originally announced December 2021.
-
General Cross-Architecture Distillation of Pretrained Language Models into Matrix Embeddings
Authors:
Lukas Galke,
Isabelle Cuber,
Christoph Meyer,
Henrik Ferdinand Nölscher,
Angelina Sonderecker,
Ansgar Scherp
Abstract:
Large pretrained language models (PreLMs) are revolutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive for small laboratories or for deployment on mobile devices. Approaches like pruning and distillation reduce the model size but typically retain the same model architecture. In contrast, we explore distilling PreLMs into a different, more efficien…
▽ More
Large pretrained language models (PreLMs) are revolutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive for small laboratories or for deployment on mobile devices. Approaches like pruning and distillation reduce the model size but typically retain the same model architecture. In contrast, we explore distilling PreLMs into a different, more efficient architecture, Continual Multiplication of Words (CMOW), which embeds each word as a matrix and uses matrix multiplication to encode sequences. We extend the CMOW architecture and its CMOW/CBOW-Hybrid variant with a bidirectional component for more expressive power, per-token representations for a general (task-agnostic) distillation during pretraining, and a two-sequence encoding scheme that facilitates downstream tasks on sentence pairs, such as sentence similarity and natural language inference. Our matrix-based bidirectional CMOW/CBOW-Hybrid model is competitive to DistilBERT on question similarity and recognizing textual entailment, but uses only half of the number of parameters and is three times faster in terms of inference speed. We match or exceed the scores of ELMo for all tasks of the GLUE benchmark except for the sentiment analysis task SST-2 and the linguistic acceptability task CoLA. However, compared to previous cross-architecture distillation approaches, we demonstrate a doubling of the scores on detecting linguistic acceptability. This shows that matrix-based embeddings can be used to distill large PreLM into competitive models and motivates further research in this direction.
△ Less
Submitted 28 July, 2022; v1 submitted 17 September, 2021;
originally announced September 2021.
-
Deep Learning methods for automatic evaluation of delayed enhancement-MRI. The results of the EMIDEC challenge
Authors:
Alain Lalande,
Zhihao Chen,
Thibaut Pommier,
Thomas Decourselle,
Abdul Qayyum,
Michel Salomon,
Dominique Ginhac,
Youssef Skandarani,
Arnaud Boucher,
Khawla Brahim,
Marleen de Bruijne,
Robin Camarasa,
Teresa M. Correia,
Xue Feng,
Kibrom B. Girum,
Anja Hennemuth,
Markus Huellebrand,
Raabid Hussain,
Matthias Ivantsits,
Jun Ma,
Craig Meyer,
Rishabh Sharma,
Jixi Shi,
Nikolaos V. Tsekos,
Marta Varela
, et al. (8 additional authors not shown)
Abstract:
A key factor for assessing the state of the heart after myocardial infarction (MI) is to measure whether the myocardium segment is viable after reperfusion or revascularization therapy. Delayed enhancement-MRI or DE-MRI, which is performed several minutes after injection of the contrast agent, provides high contrast between viable and nonviable myocardium and is therefore a method of choice to eva…
▽ More
A key factor for assessing the state of the heart after myocardial infarction (MI) is to measure whether the myocardium segment is viable after reperfusion or revascularization therapy. Delayed enhancement-MRI or DE-MRI, which is performed several minutes after injection of the contrast agent, provides high contrast between viable and nonviable myocardium and is therefore a method of choice to evaluate the extent of MI. To automatically assess myocardial status, the results of the EMIDEC challenge that focused on this task are presented in this paper. The challenge's main objectives were twofold. First, to evaluate if deep learning methods can distinguish between normal and pathological cases. Second, to automatically calculate the extent of myocardial infarction. The publicly available database consists of 150 exams divided into 50 cases with normal MRI after injection of a contrast agent and 100 cases with myocardial infarction (and then with a hyperenhanced area on DE-MRI), whatever their inclusion in the cardiac emergency department. Along with MRI, clinical characteristics are also provided. The obtained results issued from several works show that the automatic classification of an exam is a reachable task (the best method providing an accuracy of 0.92), and the automatic segmentation of the myocardium is possible. However, the segmentation of the diseased area needs to be improved, mainly due to the small size of these areas and the lack of contrast with the surrounding structures.
△ Less
Submitted 10 August, 2021; v1 submitted 9 August, 2021;
originally announced August 2021.
-
Traversing Steep and Granular Martian Analog Slopes With a Dynamic Quadrupedal Robot
Authors:
Hendrik Kolvenbach,
Philip Arm,
Elias Hampp,
Alexander Dietsche,
Valentin Bickel,
Benjamin Sun,
Christoph Meyer,
Marco Hutter
Abstract:
Celestial bodies such as the Moon and Mars are mainly covered by loose, granular soil, a notoriously challenging terrain to traverse with (wheeled) robotic systems. Here, we present experimental work on traversing steep, granular slopes with the dynamically walking quadrupedal robot SpaceBok. To adapt to the challenging environment, we developed passive-adaptive planar feet and optimized grouser p…
▽ More
Celestial bodies such as the Moon and Mars are mainly covered by loose, granular soil, a notoriously challenging terrain to traverse with (wheeled) robotic systems. Here, we present experimental work on traversing steep, granular slopes with the dynamically walking quadrupedal robot SpaceBok. To adapt to the challenging environment, we developed passive-adaptive planar feet and optimized grouser pads to reduce sinkage and increase traction on planar and inclined granular soil. Single-foot experiments revealed that a large surface area of 110cm2 per foot reduces sinkage to an acceptable level even on highly collapsible soil (ES-1). Implementing several 12mm grouser blades increases traction by 22% to 66% on granular media compared to grouser-less designs. Together with a terrain-adapting walking controller, we validate - for the first time - static and dynamic locomotion on Mars analog slopes of up to 25°(the maximum of the testbed). We evaluated the performance between point- and planar feet and static and dynamic gaits regarding stability (safety), velocity, and energy consumption. We show that dynamic gaits are energetically more efficient than static gaits but are riskier on steep slopes. Our tests also revealed that planar feet's energy consumption drastically increases when the slope inclination approaches the soil's angle of internal friction due to shearing. Point feet are less affected by slippage due to their excessive sinkage, but in turn, are prone to instabilities and trip**. We present and discuss safe and energy-efficient global path-planning strategies for accessing steep topography on Mars based on our findings.
△ Less
Submitted 3 June, 2021;
originally announced June 2021.
-
Autotuning Benchmarking Techniques: A Roofline Model Case Study
Authors:
Jacob Odgård Tørring,
Jan Christian Meyer,
Anne C. Elster
Abstract:
Peak performance metrics published by vendors often do not correspond to what can be achieved in practice. It is therefore of great interest to do extensive benchmarking on core applications and library routines. Since DGEMM is one of the most used in compute-intensive numerical codes, it is typically highly vendor optimized and of great interest for empirical benchmarks. In this paper we show how…
▽ More
Peak performance metrics published by vendors often do not correspond to what can be achieved in practice. It is therefore of great interest to do extensive benchmarking on core applications and library routines. Since DGEMM is one of the most used in compute-intensive numerical codes, it is typically highly vendor optimized and of great interest for empirical benchmarks. In this paper we show how to build a novel tool that autotunes the benchmarking process for the Roofline model. Our novel approach can efficiently and reliably find optimal configurations for any target hardware. Results of our tool on a range of hardware architectures and comparisons to theoretical peak performance are included.
Our tool autotunes the benchmarks for the target architecture by deciding the optimal parameters through state space reductions and exhaustive search. Our core idea includes calculating the confidence interval using the variance and mean and comparing it against the current optimum solution. We can then terminate the evaluation process early if the confidence interval's maximum is lower than the current optimum solution. This dynamic approach yields a search time improvement of up to 116.33x for the DGEMM benchmarking process compared to a traditional fixed sample-size methodology. Our tool produces the same benchmarking result with an error of less than 2% for each of the optimization techniques we apply, while providing a great reduction in search time. We compare these results against hand-tuned benchmarking parameters. Results from the memory-intensive TRIAD benchmark, and some ideas for future directions are also included.
△ Less
Submitted 18 March, 2021; v1 submitted 15 March, 2021;
originally announced March 2021.
-
Empowering Active Learning to Jointly Optimize System and User Demands
Authors:
Ji-Ung Lee,
Christian M. Meyer,
Iryna Gurevych
Abstract:
Existing approaches to active learning maximize the system performance by sampling unlabeled instances for annotation that yield the most efficient training. However, when active learning is integrated with an end-user application, this can lead to frustration for participating users, as they spend time labeling instances that they would not otherwise be interested in reading. In this paper, we pr…
▽ More
Existing approaches to active learning maximize the system performance by sampling unlabeled instances for annotation that yield the most efficient training. However, when active learning is integrated with an end-user application, this can lead to frustration for participating users, as they spend time labeling instances that they would not otherwise be interested in reading. In this paper, we propose a new active learning approach that jointly optimizes the seemingly counteracting objectives of the active learning system (training efficiently) and the user (receiving useful instances). We study our approach in an educational application, which particularly benefits from this technique as the system needs to rapidly learn to predict the appropriateness of an exercise to a particular user, while the users should receive only exercises that match their skills. We evaluate multiple learning strategies and user types with data from real users and find that our joint approach better satisfies both objectives when alternative methods lead to many unsuitable exercises for end users.
△ Less
Submitted 11 May, 2020; v1 submitted 9 May, 2020;
originally announced May 2020.
-
RVSDG: An Intermediate Representation for Optimizing Compilers
Authors:
Nico Reissmann,
Jan Christian Meyer,
Helge Bahmann,
Magnus Själander
Abstract:
Intermediate Representations (IRs) are central to optimizing compilers as the way the program is represented may enhance or limit analyses and transformations. Suitable IRs focus on exposing the most relevant information and establish invariants that different compiler passes can rely on. While control-flow centric IRs appear to be a natural fit for imperative programming languages, analyses requi…
▽ More
Intermediate Representations (IRs) are central to optimizing compilers as the way the program is represented may enhance or limit analyses and transformations. Suitable IRs focus on exposing the most relevant information and establish invariants that different compiler passes can rely on. While control-flow centric IRs appear to be a natural fit for imperative programming languages, analyses required by compilers have increasingly shifted to understand data dependencies and work at multiple abstraction layers at the same time. This is partially evidenced in recent developments such as the MLIR proposed by Google. However, rigorous use of data flow centric IRs in general purpose compilers has not been evaluated for feasibility and usability as previous works provide no practical implementations. We present the Regionalized Value State Dependence Graph (RVSDG) IR for optimizing compilers. The RVSDG is a data flow centric IR where nodes represent computations, edges represent computational dependencies, and regions capture the hierarchical structure of programs. It represents programs in demand-dependence form, implicitly supports structured control flow, and models entire programs within a single IR. We provide a complete specification of the RVSDG, construction and destruction methods, as well as exemplify its utility by presenting Dead Node and Common Node Elimination optimizations. We implemented a prototype compiler and evaluate it in terms of performance, code size, compilation time, and representational overhead. Our results indicate that the RVSDG can serve as a competitive IR in optimizing compilers while reducing complexity.
△ Less
Submitted 17 March, 2020; v1 submitted 10 December, 2019;
originally announced December 2019.
-
When is ACL's Deadline? A Scientific Conversational Agent
Authors:
Mohsen Mesgar,
Paul Youssef,
Lin Li,
Dominik Bierwirth,
Yihao Li,
Christian M. Meyer,
Iryna Gurevych
Abstract:
Our conversational agent UKP-ATHENA assists NLP researchers in finding and exploring scientific literature, identifying relevant authors, planning or post-processing conference visits, and preparing paper submissions using a unified interface based on natural language inputs and responses. UKP-ATHENA enables new access paths to our swiftly evolving research area with its massive amounts of scienti…
▽ More
Our conversational agent UKP-ATHENA assists NLP researchers in finding and exploring scientific literature, identifying relevant authors, planning or post-processing conference visits, and preparing paper submissions using a unified interface based on natural language inputs and responses. UKP-ATHENA enables new access paths to our swiftly evolving research area with its massive amounts of scientific information and high turnaround times. UKP-ATHENA's responses connect information from multiple heterogeneous sources which researchers currently have to explore manually one after another. Unlike a search engine, UKP-ATHENA maintains the context of a conversation to allow for efficient information access on papers, researchers, and conferences. Our architecture consists of multiple components with reference implementations that can be easily extended by new skills and domains. Our user-based evaluation shows that UKP-ATHENA already responds 45% of different formulations of defined intents with 37% information coverage rate.
△ Less
Submitted 23 November, 2019;
originally announced November 2019.
-
A Dynamic Preference Logic for reasoning about Agent Programming
Authors:
Marlo Souza,
Álvaro Moreira,
Renata Vieira,
John-Jules Ch. Meyer
Abstract:
In this work, we investigate the use of Dynamic Preference Logic to encode BDI mental attitudes. Further, exploring this codification and the representation of preferences over possible worlds by preferences over propositional formulas, here called priority graphs, we comment on how to interpret BDI agent programs in this logic. Also, using the connection between dynamic operations defined over pr…
▽ More
In this work, we investigate the use of Dynamic Preference Logic to encode BDI mental attitudes. Further, exploring this codification and the representation of preferences over possible worlds by preferences over propositional formulas, here called priority graphs, we comment on how to interpret BDI agent programs in this logic. Also, using the connection between dynamic operations defined over preference models and their encoding as transformations on priority graphs, we show how our logic can be used not only to reason about agent programs, but as a tool to specify reasoning mechanisms to guarantee certain properties in the theory of rationality for the programming language.
△ Less
Submitted 13 November, 2019;
originally announced November 2019.
-
Harmonise and integrate heterogeneous areal data with the R package arealDB
Authors:
Steffen Ehrmann,
Ralf Seppelt,
Carsten Meyer
Abstract:
Many relevant applications in the environmental and socioeconomic sciences use areal data, such as biodiversity checklists, agricultural statistics, or socioeconomic surveys. For applications that surpass the spatial, temporal or thematic scope of any single data source, data must be integrated from several heterogeneous sources. Inconsistent concepts, definitions, or messy data tables make this a…
▽ More
Many relevant applications in the environmental and socioeconomic sciences use areal data, such as biodiversity checklists, agricultural statistics, or socioeconomic surveys. For applications that surpass the spatial, temporal or thematic scope of any single data source, data must be integrated from several heterogeneous sources. Inconsistent concepts, definitions, or messy data tables make this a tedious and error-prone process. To date, a dedicated tool to address these challenges is still lacking. Here, we introduce the R package arealDB that integrates heterogeneous areal data and associated geometries into a consistent database, in an easy-to-use workflow. It is useful for harmonising language and semantics of variables, relating data to geometries, and documenting metadata and provenance. We illustrate the functionality by integrating two disparate datasets (Brazil, USA) on the harvested area of soybean. arealDB promises quality-improvements to downstream scientific, monitoring, and management applications but also substantial time-savings to database collation efforts.
△ Less
Submitted 14 July, 2020; v1 submitted 14 September, 2019;
originally announced September 2019.
-
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
Authors:
Wei Zhao,
Maxime Peyrard,
Fei Liu,
Yang Gao,
Christian M. Meyer,
Steffen Eger
Abstract:
A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric,…
▽ More
A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.
△ Less
Submitted 26 September, 2019; v1 submitted 5 September, 2019;
originally announced September 2019.
-
Better Rewards Yield Better Summaries: Learning to Summarise Without References
Authors:
Florian Böhm,
Yang Gao,
Christian M. Meyer,
Ori Shapira,
Ido Dagan,
Iryna Gurevych
Abstract:
Reinforcement Learning (RL) based document summarisation systems yield state-of-the-art performance in terms of ROUGE scores, because they directly use ROUGE as the rewards during training. However, summaries with high ROUGE scores often receive low human judgement. To find a better reward function that can guide RL to generate human-appealing summaries, we learn a reward function from human ratin…
▽ More
Reinforcement Learning (RL) based document summarisation systems yield state-of-the-art performance in terms of ROUGE scores, because they directly use ROUGE as the rewards during training. However, summaries with high ROUGE scores often receive low human judgement. To find a better reward function that can guide RL to generate human-appealing summaries, we learn a reward function from human ratings on 2,500 summaries. Our reward function only takes the document and system summary as input. Hence, once trained, it can be used to train RL-based summarisation systems without using any reference summaries. We show that our learned rewards have significantly higher correlation with human ratings than previous approaches. Human evaluation experiments show that, compared to the state-of-the-art supervised-learning systems and ROUGE-as-rewards RL summarisation systems, the RL systems using our learned rewards during training generate summarieswith higher human ratings. The learned reward function and our source code are available at https://github.com/yg211/summary-reward-no-reference.
△ Less
Submitted 3 September, 2019;
originally announced September 2019.
-
FAMULUS: Interactive Annotation and Feedback Generation for Teaching Diagnostic Reasoning
Authors:
Jonas Pfeiffer,
Christian M. Meyer,
Claudia Schulz,
Jan Kiesewetter,
Jan Zottmann,
Michael Sailer,
Elisabeth Bauer,
Frank Fischer,
Martin R. Fischer,
Iryna Gurevych
Abstract:
Our proposed system FAMULUS helps students learn to diagnose based on automatic feedback in virtual patient simulations, and it supports instructors in labeling training data.
Diagnosing is an exceptionally difficult skill to obtain but vital for many different professions (e.g., medical doctors, teachers).
Previous case simulation systems are limited to multiple-choice questions and thus cann…
▽ More
Our proposed system FAMULUS helps students learn to diagnose based on automatic feedback in virtual patient simulations, and it supports instructors in labeling training data.
Diagnosing is an exceptionally difficult skill to obtain but vital for many different professions (e.g., medical doctors, teachers).
Previous case simulation systems are limited to multiple-choice questions and thus cannot give constructive individualized feedback on a student's diagnostic reasoning process.
Given initially only limited data, we leverage a (replaceable) NLP model to both support experts in their further data annotation with automatic suggestions, and we provide automatic feedback for students.
We argue that because the central model consistently improves, our interactive approach encourages both students and instructors to recurrently use the tool, and thus accelerate the speed of data creation and annotation.
We show results from two user studies on diagnostic reasoning in medicine and teacher education and outline how our system can be extended to further use cases.
△ Less
Submitted 29 August, 2019;
originally announced August 2019.
-
Reward Learning for Efficient Reinforcement Learning in Extractive Document Summarisation
Authors:
Yang Gao,
Christian M. Meyer,
Mohsen Mesgar,
Iryna Gurevych
Abstract:
Document summarisation can be formulated as a sequential decision-making problem, which can be solved by Reinforcement Learning (RL) algorithms. The predominant RL paradigm for summarisation learns a cross-input policy, which requires considerable time, data and parameter tuning due to the huge search spaces and the delayed rewards. Learning input-specific RL policies is a more efficient alternati…
▽ More
Document summarisation can be formulated as a sequential decision-making problem, which can be solved by Reinforcement Learning (RL) algorithms. The predominant RL paradigm for summarisation learns a cross-input policy, which requires considerable time, data and parameter tuning due to the huge search spaces and the delayed rewards. Learning input-specific RL policies is a more efficient alternative but so far depends on handcrafted rewards, which are difficult to design and yield poor performance. We propose RELIS, a novel RL paradigm that learns a reward function with Learning-to-Rank (L2R) algorithms at training time and uses this reward function to train an input-specific RL policy at test time. We prove that RELIS guarantees to generate near-optimal summaries with appropriate L2R and RL algorithms. Empirically, we evaluate our approach on extractive multi-document summarisation. We show that RELIS reduces the training time by two orders of magnitude compared to the state-of-the-art models while performing on par with them.
△ Less
Submitted 30 July, 2019;
originally announced July 2019.
-
Manipulating the Difficulty of C-Tests
Authors:
Ji-Ung Lee,
Erik Schwan,
Christian M. Meyer
Abstract:
We propose two novel manipulation strategies for increasing and decreasing the difficulty of C-tests automatically. This is a crucial step towards generating learner-adaptive exercises for self-directed language learning and preparing language assessment tests. To reach the desired difficulty level, we manipulate the size and the distribution of gaps based on absolute and relative gap difficulty p…
▽ More
We propose two novel manipulation strategies for increasing and decreasing the difficulty of C-tests automatically. This is a crucial step towards generating learner-adaptive exercises for self-directed language learning and preparing language assessment tests. To reach the desired difficulty level, we manipulate the size and the distribution of gaps based on absolute and relative gap difficulty predictions. We evaluate our approach in corpus-based experiments and in a user study with 60 participants. We find that both strategies are able to generate C-tests with the desired difficulty level.
△ Less
Submitted 2 July, 2019; v1 submitted 17 June, 2019;
originally announced June 2019.
-
Preference-based Interactive Multi-Document Summarisation
Authors:
Yang Gao,
Christian M. Meyer,
Iryna Gurevych
Abstract:
Interactive NLP is a promising paradigm to close the gap between automatic NLP systems and the human upper bound. Preference-based interactive learning has been successfully applied, but the existing methods require several thousand interaction rounds even in simulations with perfect user feedback. In this paper, we study preference-based interactive summarisation. To reduce the number of interact…
▽ More
Interactive NLP is a promising paradigm to close the gap between automatic NLP systems and the human upper bound. Preference-based interactive learning has been successfully applied, but the existing methods require several thousand interaction rounds even in simulations with perfect user feedback. In this paper, we study preference-based interactive summarisation. To reduce the number of interaction rounds, we propose the Active Preference-based ReInforcement Learning (APRIL) framework. APRIL uses Active Learning to query the user, Preference Learning to learn a summary ranking function from the preferences, and neural Reinforcement Learning to efficiently search for the (near-)optimal summary. Our results show that users can easily provide reliable preferences over summaries and that APRIL outperforms the state-of-the-art preference-based interactive method in both simulation and real-user experiments.
△ Less
Submitted 7 June, 2019;
originally announced June 2019.
-
Analysis of Automatic Annotation Suggestions for Hard Discourse-Level Tasks in Expert Domains
Authors:
Claudia Schulz,
Christian M. Meyer,
Jan Kiesewetter,
Michael Sailer,
Elisabeth Bauer,
Martin R. Fischer,
Frank Fischer,
Iryna Gurevych
Abstract:
Many complex discourse-level tasks can aid domain experts in their work but require costly expert annotations for data creation. To speed up and ease annotations, we investigate the viability of automatically generated annotation suggestions for such tasks. As an example, we choose a task that is particularly hard for both humans and machines: the segmentation and classification of epistemic activ…
▽ More
Many complex discourse-level tasks can aid domain experts in their work but require costly expert annotations for data creation. To speed up and ease annotations, we investigate the viability of automatically generated annotation suggestions for such tasks. As an example, we choose a task that is particularly hard for both humans and machines: the segmentation and classification of epistemic activities in diagnostic reasoning texts. We create and publish a new dataset covering two domains and carefully analyse the suggested annotations. We find that suggestions have positive effects on annotation speed and performance, while not introducing noteworthy biases. Envisioning suggestion models that improve with newly annotated texts, we contrast methods for continuous model adjustment and suggest the most effective setup for suggestions in future expert tasks.
△ Less
Submitted 6 June, 2019;
originally announced June 2019.
-
Prostate Segmentation from 3D MRI Using a Two-Stage Model and Variable-Input Based Uncertainty Measure
Authors:
Huitong Pan,
Yushan Feng,
Quan Chen,
Craig Meyer,
Xue Feng
Abstract:
This paper proposes a two-stage segmentation model, variable-input based uncertainty measures and an uncertainty-guided post-processing method for prostate segmentation on 3D magnetic resonance images (MRI). The two-stage model was based on 3D dilated U-Nets with the first stage to localize the prostate and the second stage to obtain an accurate segmentation from cropped images. For data augmentat…
▽ More
This paper proposes a two-stage segmentation model, variable-input based uncertainty measures and an uncertainty-guided post-processing method for prostate segmentation on 3D magnetic resonance images (MRI). The two-stage model was based on 3D dilated U-Nets with the first stage to localize the prostate and the second stage to obtain an accurate segmentation from cropped images. For data augmentation, we proposed the variable-input method which crops the region of interest with additional random variations. Similar to other deep learning models, the proposed model also faced the challenge of suboptimal performance in certain testing cases due to varied training and testing image characteristics. Therefore, it is valuable to evaluate the confidence and performance of the network using uncertainty measures, which are often calculated from the probability maps or their standard deviations with multiple model outputs for the same testing case. However, few studies have quantitatively compared different methods of uncertainty calculation. Furthermore, unlike the commonly used Bayesian dropout during testing, we developed uncertainty measures based on the variable input images at the second stage and evaluated its performance by calculating the correlation with ground-truth-based performance metrics, such as Dice score. For performance estimation, we predicted Dice scores and Hausdorff distance with the most correlated uncertainty measure. For post-processing, we performed Gaussian filter on the underperformed slices to improve segmentation quality. Using PROMISE-12 data, we demonstrated the robustness of the two-stage model and showed high correlation of the proposed variable-input based uncertainty measures with GT-based performance. The uncertainty-guided post-processing method significantly improved label smoothness.
△ Less
Submitted 6 March, 2019;
originally announced March 2019.
-
Truly Visual Polymorphic Algebraic Data Structures through Maramafication
Authors:
Chide Groenouwe,
Jesse Nortier,
John-Jules Ch. Meyer
Abstract:
This paper presents a so-called maramafication of an essential part of functional programming languages such as Haskell or Clean: the construction of fully polymorphic well-typed algebraic data structures based on type definitions with at most one type parameter. As such, this work extends our previous work, in which only a very limited form of polymorphism was present. Maramafication means the de…
▽ More
This paper presents a so-called maramafication of an essential part of functional programming languages such as Haskell or Clean: the construction of fully polymorphic well-typed algebraic data structures based on type definitions with at most one type parameter. As such, this work extends our previous work, in which only a very limited form of polymorphism was present. Maramafication means the design of visual 'twins' of existing programming constructs using spatial metaphors rooted in common sense or inborn spatial intuition, to achieve self-explanatoriness. This is, among others, useful to considerably reduce the gap between programmers and non-programmers in the creation of programs, for educational purposes, for inclusion of non-typical programmers and for invoking enthusiasm among non-programmers.
△ Less
Submitted 14 December, 2018;
originally announced December 2018.
-
Cerebrovascular Network Segmentation on MRA Images with Deep Learning
Authors:
Pedro Sanches,
Cyril Meyer,
Vincent Vigon,
Benoît Naegel
Abstract:
Deep learning has been shown to produce state of the art results in many tasks in biomedical imaging, especially in segmentation. Moreover, segmentation of the cerebrovascular structure from magnetic resonance angiography is a challenging problem because its complex geometry and topology have a large inter-patient variability. Therefore, in this work, we present a convolutional neural network appr…
▽ More
Deep learning has been shown to produce state of the art results in many tasks in biomedical imaging, especially in segmentation. Moreover, segmentation of the cerebrovascular structure from magnetic resonance angiography is a challenging problem because its complex geometry and topology have a large inter-patient variability. Therefore, in this work, we present a convolutional neural network approach for this problem. Particularly, a new network topology inspired by the U-net 3D and by the Inception modules, entitled Uception. In addition, a discussion about the best objective function for sparse data also guided most choices during the project. State of the art models are also implemented for a comparison purpose and final results show that the proposed architecture has the best performance in this particular context.
△ Less
Submitted 4 December, 2018;
originally announced December 2018.
-
Brain Tumor Segmentation using an Ensemble of 3D U-Nets and Overall Survival Prediction using Radiomic Features
Authors:
Xue Feng,
Nicholas Tustison,
Craig Meyer
Abstract:
Accurate segmentation of different sub-regions of gliomas including peritumoral edema, necrotic core, enhancing and non-enhancing tumor core from multimodal MRI scans has important clinical relevance in diagnosis, prognosis and treatment of brain tumors. However, due to the highly heterogeneous appearance and shape, segmentation of the sub-regions is very challenging. Recent development using deep…
▽ More
Accurate segmentation of different sub-regions of gliomas including peritumoral edema, necrotic core, enhancing and non-enhancing tumor core from multimodal MRI scans has important clinical relevance in diagnosis, prognosis and treatment of brain tumors. However, due to the highly heterogeneous appearance and shape, segmentation of the sub-regions is very challenging. Recent development using deep learning models has proved its effectiveness in the past several brain segmentation challenges as well as other semantic and medical image segmentation problems. Most models in brain tumor segmentation use a 2D/3D patch to predict the class label for the center voxel and variant patch sizes and scales are used to improve the model performance. However, it has low computation efficiency and also has limited receptive field. U-Net is a widely used network structure for end-to-end segmentation and can be used on the entire image or extracted patches to provide classification labels over the entire input voxels so that it is more efficient and expect to yield better performance with larger input size. Furthermore, instead of picking the best network structure, an ensemble of multiple models, trained on different dataset or different hyper-parameters, can generally improve the segmentation performance. In this study we propose to use an ensemble of 3D U-Nets with different hyper-parameters for brain tumor segmentation. Preliminary results showed effectiveness of this model. In addition, we developed a linear model for survival prediction using extracted imaging and non-imaging features, which, despite the simplicity, can effectively reduce overfitting and regression errors.
△ Less
Submitted 3 December, 2018;
originally announced December 2018.
-
Challenges in the Automatic Analysis of Students' Diagnostic Reasoning
Authors:
Claudia Schulz,
Christian M. Meyer,
Michael Sailer,
Jan Kiesewetter,
Elisabeth Bauer,
Frank Fischer,
Martin R. Fischer,
Iryna Gurevych
Abstract:
Diagnostic reasoning is a key component of many professions. To improve students' diagnostic reasoning skills, educational psychologists analyse and give feedback on epistemic activities used by these students while diagnosing, in particular, hypothesis generation, evidence generation, evidence evaluation, and drawing conclusions. However, this manual analysis is highly time-consuming. We aim to e…
▽ More
Diagnostic reasoning is a key component of many professions. To improve students' diagnostic reasoning skills, educational psychologists analyse and give feedback on epistemic activities used by these students while diagnosing, in particular, hypothesis generation, evidence generation, evidence evaluation, and drawing conclusions. However, this manual analysis is highly time-consuming. We aim to enable the large-scale adoption of diagnostic reasoning analysis and feedback by automating the epistemic activity identification. We create the first corpus for this task, comprising diagnostic reasoning self-explanations of students from two domains annotated with epistemic activities. Based on insights from the corpus creation and the task's characteristics, we discuss three challenges for the automatic identification of epistemic activities using AI methods: the correct identification of epistemic activity spans, the reliable distinction of similar epistemic activities, and the detection of overlap** epistemic activities. We propose a separate performance metric for each challenge and thus provide an evaluation framework for future research. Indeed, our evaluation of various state-of-the-art recurrent neural network architectures reveals that current techniques fail to address some of these challenges.
△ Less
Submitted 26 November, 2018;
originally announced November 2018.
-
A Self-Adaptive Network For Multiple Sclerosis Lesion Segmentation From Multi-Contrast MRI With Various Imaging Protocols
Authors:
Yushan Feng,
Huitong Pan,
Craig Meyer,
Xue Feng
Abstract:
Deep neural networks (DNN) have shown promises in the lesion segmentation of multiple sclerosis (MS) from multicontrast MRI including T1, T2, proton density (PD) and FLAIR sequences. However, one challenge in deploying such networks into clinical practice is the variability of imaging protocols, which often differ from the training dataset as certain MRI sequences may be unavailable or unusable. T…
▽ More
Deep neural networks (DNN) have shown promises in the lesion segmentation of multiple sclerosis (MS) from multicontrast MRI including T1, T2, proton density (PD) and FLAIR sequences. However, one challenge in deploying such networks into clinical practice is the variability of imaging protocols, which often differ from the training dataset as certain MRI sequences may be unavailable or unusable. Therefore, trained networks need to adapt to practical situations when imaging protocols are different in deployment. In this paper, we propose a DNN-based MS lesion segmentation framework with a novel technique called sequence dropout which can adapt to various combinations of input MRI sequences during deployment and achieve the maximal possible performance from the given input. In addition, with this framework, we studied the quantitative impact of each MRI sequence on the MS lesion segmentation task without training separate networks. Experiments were performed using the IEEE ISBI 2015 Longitudinal MS Lesion Challenge dataset and our method is currently ranked 2nd with a Dice similarity coefficient of 0.684. Furthermore, we showed our network achieved the maximal possible performance when one sequence is unavailable during deployment by comparing with separate networks trained on the corresponding input MRI sequences. In particular, we discovered T1 and PD have minor impact on segmentation performance while FLAIR is the predominant sequence. Experiments with multiple missing sequences were also performed and showed the robustness of our network.
△ Less
Submitted 18 November, 2018;
originally announced November 2018.
-
Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge
Authors:
Spyridon Bakas,
Mauricio Reyes,
Andras Jakab,
Stefan Bauer,
Markus Rempfler,
Alessandro Crimi,
Russell Takeshi Shinohara,
Christoph Berger,
Sung Min Ha,
Martin Rozycki,
Marcel Prastawa,
Esther Alberts,
Jana Lipkova,
John Freymann,
Justin Kirby,
Michel Bilello,
Hassan Fathallah-Shaykh,
Roland Wiest,
Jan Kirschke,
Benedikt Wiestler,
Rivka Colen,
Aikaterini Kotrotsou,
Pamela Lamontagne,
Daniel Marcus,
Mikhail Milchenko
, et al. (402 additional authors not shown)
Abstract:
Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles dissem…
▽ More
Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles disseminated across multi-parametric magnetic resonance imaging (mpMRI) scans, reflecting varying biological properties. Their heterogeneous shape, extent, and location are some of the factors that make these tumors difficult to resect, and in some cases inoperable. The amount of resected tumor is a factor also considered in longitudinal scans, when evaluating the apparent tumor for potential diagnosis of progression. Furthermore, there is mounting evidence that accurate segmentation of the various tumor sub-regions can offer the basis for quantitative image analysis towards prediction of patient overall survival. This study assesses the state-of-the-art machine learning (ML) methods used for brain tumor image analysis in mpMRI scans, during the last seven instances of the International Brain Tumor Segmentation (BraTS) challenge, i.e., 2012-2018. Specifically, we focus on i) evaluating segmentations of the various glioma sub-regions in pre-operative mpMRI scans, ii) assessing potential tumor progression by virtue of longitudinal growth of tumor sub-regions, beyond use of the RECIST/RANO criteria, and iii) predicting the overall survival from pre-operative mpMRI scans of patients that underwent gross total resection. Finally, we investigate the challenge of identifying the best ML algorithms for each of these tasks, considering that apart from being diverse on each instance of the challenge, the multi-institutional mpMRI BraTS dataset has also been a continuously evolving/growing dataset.
△ Less
Submitted 23 April, 2019; v1 submitted 5 November, 2018;
originally announced November 2018.
-
Deep learning to achieve clinically applicable segmentation of head and neck anatomy for radiotherapy
Authors:
Stanislav Nikolov,
Sam Blackwell,
Alexei Zverovitch,
Ruheena Mendes,
Michelle Livne,
Jeffrey De Fauw,
Yojan Patel,
Clemens Meyer,
Harry Askham,
Bernardino Romera-Paredes,
Christopher Kelly,
Alan Karthikesalingam,
Carlton Chu,
Dawn Carnell,
Cheng Boon,
Derek D'Souza,
Syed Ali Moinuddin,
Bethany Garie,
Yasmin McQuinlan,
Sarah Ireland,
Kiarna Hampton,
Krystle Fuller,
Hugh Montgomery,
Geraint Rees,
Mustafa Suleyman
, et al. (4 additional authors not shown)
Abstract:
Over half a million individuals are diagnosed with head and neck cancer each year worldwide. Radiotherapy is an important curative treatment for this disease, but it requires manual time consuming delineation of radio-sensitive organs at risk (OARs). This planning process can delay treatment, while also introducing inter-operator variability with resulting downstream radiation dose differences. Wh…
▽ More
Over half a million individuals are diagnosed with head and neck cancer each year worldwide. Radiotherapy is an important curative treatment for this disease, but it requires manual time consuming delineation of radio-sensitive organs at risk (OARs). This planning process can delay treatment, while also introducing inter-operator variability with resulting downstream radiation dose differences. While auto-segmentation algorithms offer a potentially time-saving solution, the challenges in defining, quantifying and achieving expert performance remain. Adopting a deep learning approach, we demonstrate a 3D U-Net architecture that achieves expert-level performance in delineating 21 distinct head and neck OARs commonly segmented in clinical practice. The model was trained on a dataset of 663 deidentified computed tomography (CT) scans acquired in routine clinical practice and with both segmentations taken from clinical practice and segmentations created by experienced radiographers as part of this research, all in accordance with consensus OAR definitions. We demonstrate the model's clinical applicability by assessing its performance on a test set of 21 CT scans from clinical practice, each with the 21 OARs segmented by two independent experts. We also introduce surface Dice similarity coefficient (surface DSC), a new metric for the comparison of organ delineation, to quantify deviation between OAR surface contours rather than volumes, better reflecting the clinical task of correcting errors in the automated organ segmentations. The model's generalisability is then demonstrated on two distinct open source datasets, reflecting different centres and countries to model training. With appropriate validation studies and regulatory approvals, this system could improve the efficiency, consistency, and safety of radiotherapy pathways.
△ Less
Submitted 13 January, 2021; v1 submitted 12 September, 2018;
originally announced September 2018.
-
APRIL: Interactively Learning to Summarise by Combining Active Preference Learning and Reinforcement Learning
Authors:
Yang Gao,
Christian M. Meyer,
Iryna Gurevych
Abstract:
We propose a method to perform automatic document summarisation without using reference summaries. Instead, our method interactively learns from users' preferences. The merit of preference-based interactive summarisation is that preferences are easier for users to provide than reference summaries. Existing preference-based interactive learning methods suffer from high sample complexity, i.e. they…
▽ More
We propose a method to perform automatic document summarisation without using reference summaries. Instead, our method interactively learns from users' preferences. The merit of preference-based interactive summarisation is that preferences are easier for users to provide than reference summaries. Existing preference-based interactive learning methods suffer from high sample complexity, i.e. they need to interact with the oracle for many rounds in order to converge. In this work, we propose a new objective function, which enables us to leverage active learning, preference learning and reinforcement learning techniques in order to reduce the sample complexity. Both simulation and real-user experiments suggest that our method significantly advances the state of the art. Our source code is freely available at https://github.com/UKPLab/emnlp2018-april.
△ Less
Submitted 29 August, 2018;
originally announced August 2018.
-
A Retrospective Analysis of the Fake News Challenge Stance Detection Task
Authors:
Andreas Hanselowski,
Avinesh PVS,
Benjamin Schiller,
Felix Caspelherr,
Debanjan Chaudhuri,
Christian M. Meyer,
Iryna Gurevych
Abstract:
The 2017 Fake News Challenge Stage 1 (FNC-1) shared task addressed a stance classification task as a crucial first step towards detecting fake news. To date, there is no in-depth analysis paper to critically discuss FNC-1's experimental setup, reproduce the results, and draw conclusions for next-generation stance classification methods. In this paper, we provide such an in-depth analysis for the t…
▽ More
The 2017 Fake News Challenge Stage 1 (FNC-1) shared task addressed a stance classification task as a crucial first step towards detecting fake news. To date, there is no in-depth analysis paper to critically discuss FNC-1's experimental setup, reproduce the results, and draw conclusions for next-generation stance classification methods. In this paper, we provide such an in-depth analysis for the three top-performing systems. We first find that FNC-1's proposed evaluation metric favors the majority class, which can be easily classified, and thus overestimates the true discriminative power of the methods. Therefore, we propose a new F1-based metric yielding a changed system ranking. Next, we compare the features and architectures used, which leads to a novel feature-rich stacked LSTM model that performs on par with the best systems, but is superior in predicting minority classes. To understand the methods' ability to generalize, we derive a new dataset and perform both in-domain and cross-domain experiments. Our qualitative and quantitative study helps interpreting the original FNC-1 scores and understand which features help improving performance and why. Our new dataset and all source code used during the reproduction study are publicly available for future research.
△ Less
Submitted 13 June, 2018;
originally announced June 2018.
-
A Probabilistic U-Net for Segmentation of Ambiguous Images
Authors:
Simon A. A. Kohl,
Bernardino Romera-Paredes,
Clemens Meyer,
Jeffrey De Fauw,
Joseph R. Ledsam,
Klaus H. Maier-Hein,
S. M. Ali Eslami,
Danilo Jimenez Rezende,
Olaf Ronneberger
Abstract:
Many real-world vision problems suffer from inherent ambiguities. In clinical applications for example, it might not be clear from a CT scan alone which particular region is cancer tissue. Therefore a group of graders typically produces a set of diverse but plausible segmentations. We consider the task of learning a distribution over segmentations given an input. To this end we propose a generativ…
▽ More
Many real-world vision problems suffer from inherent ambiguities. In clinical applications for example, it might not be clear from a CT scan alone which particular region is cancer tissue. Therefore a group of graders typically produces a set of diverse but plausible segmentations. We consider the task of learning a distribution over segmentations given an input. To this end we propose a generative segmentation model based on a combination of a U-Net with a conditional variational autoencoder that is capable of efficiently producing an unlimited number of plausible hypotheses. We show on a lung abnormalities segmentation task and on a Cityscapes segmentation task that our model reproduces the possible segmentation variants as well as the frequencies with which they occur, doing so significantly better than published approaches. These models could have a high impact in real-world applications, such as being used as clinical decision-making algorithms accounting for multiple plausible semantic segmentation hypotheses to provide possible diagnoses and recommend further actions to resolve the present ambiguities.
△ Less
Submitted 29 January, 2019; v1 submitted 13 June, 2018;
originally announced June 2018.
-
Proof-of-Concept Examples of Performance-Transparent Programming Models
Authors:
Benjamin Andreassen Bjørnseth,
Jan Christian Meyer,
Lasse Natvig
Abstract:
Machine-specific optimizations command the machine to behave in a specific way. As current programming models largely leave machine details unexposed, they cannot accommodate direct encoding of such commands. In previous work we have proposed the design of performance-transparent programming models to facilitate this use-case; this report contains proof-of-concept examples of such programming mode…
▽ More
Machine-specific optimizations command the machine to behave in a specific way. As current programming models largely leave machine details unexposed, they cannot accommodate direct encoding of such commands. In previous work we have proposed the design of performance-transparent programming models to facilitate this use-case; this report contains proof-of-concept examples of such programming models. We demonstrate how programming model abstractions may reveal the memory footprint, vector unit utilization and data reuse of an application, with prediction accuracy ranging from 0 to 25 \%.
△ Less
Submitted 29 March, 2018;
originally announced March 2018.
-
Clip** free attacks against artificial neural networks
Authors:
Boussad Addad,
Jerome Kodjabachian,
Christophe Meyer
Abstract:
During the last years, a remarkable breakthrough has been made in AI domain thanks to artificial deep neural networks that achieved a great success in many machine learning tasks in computer vision, natural language processing, speech recognition, malware detection and so on. However, they are highly vulnerable to easily crafted adversarial examples. Many investigations have pointed out this fact…
▽ More
During the last years, a remarkable breakthrough has been made in AI domain thanks to artificial deep neural networks that achieved a great success in many machine learning tasks in computer vision, natural language processing, speech recognition, malware detection and so on. However, they are highly vulnerable to easily crafted adversarial examples. Many investigations have pointed out this fact and different approaches have been proposed to generate attacks while adding a limited perturbation to the original data. The most robust known method so far is the so called C&W attack [1]. Nonetheless, a countermeasure known as feature squeezing coupled with ensemble defense showed that most of these attacks can be destroyed [6]. In this paper, we present a new method we call Centered Initial Attack (CIA) whose advantage is twofold : first, it insures by construction the maximum perturbation to be smaller than a threshold fixed beforehand, without the clip** process that degrades the quality of attacks. Second, it is robust against recently introduced defenses such as feature squeezing, JPEG encoding and even against a voting ensemble of defenses. While its application is not limited to images, we illustrate this using five of the current best classifiers on ImageNet dataset among which two are adversarialy retrained on purpose to be robust against attacks. With a fixed maximum perturbation of only 1.5% on any pixel, around 80% of attacks (targeted) fool the voting ensemble defense and nearly 100% when the perturbation is only 6%. While this shows how it is difficult to defend against CIA attacks, the last section of the paper gives some guidelines to limit their impact.
△ Less
Submitted 28 March, 2018; v1 submitted 26 March, 2018;
originally announced March 2018.
-
RUSSE: The First Workshop on Russian Semantic Similarity
Authors:
Alexander Panchenko,
Natalia Loukachevitch,
Dmitry Ustalov,
Denis Paperno,
Christian Meyer,
Natalia Konstantinova
Abstract:
The paper gives an overview of the Russian Semantic Similarity Evaluation (RUSSE) shared task held in conjunction with the Dialogue 2015 conference. There exist a lot of comparative studies on semantic similarity, yet no analysis of such measures was ever performed for the Russian language. Exploring this problem for the Russian language is even more interesting, because this language has features…
▽ More
The paper gives an overview of the Russian Semantic Similarity Evaluation (RUSSE) shared task held in conjunction with the Dialogue 2015 conference. There exist a lot of comparative studies on semantic similarity, yet no analysis of such measures was ever performed for the Russian language. Exploring this problem for the Russian language is even more interesting, because this language has features, such as rich morphology and free word order, which make it significantly different from English, German, and other well-studied languages. We attempt to bridge this gap by proposing a shared task on the semantic similarity of Russian nouns. Our key contribution is an evaluation methodology based on four novel benchmark datasets for the Russian language. Our analysis of the 105 submissions from 19 teams reveals that successful approaches for English, such as distributional and skip-gram models, are directly applicable to Russian as well. On the one hand, the best results in the contest were obtained by sophisticated supervised models that combine evidence from different sources. On the other hand, completely unsupervised approaches, such as a skip-gram model estimated on a large-scale corpus, were able score among the top 5 systems.
△ Less
Submitted 15 March, 2018;
originally announced March 2018.
-
Live Blog Corpus for Summarization
Authors:
Avinesh P. V. S.,
Maxime Peyrard,
Christian M. Meyer
Abstract:
Live blogs are an increasingly popular news format to cover breaking news and live events in online journalism. Online news websites around the world are using this medium to give their readers a minute by minute update on an event. Good summaries enhance the value of the live blogs for a reader but are often not available. In this paper, we study a way of collecting corpora for automatic live blo…
▽ More
Live blogs are an increasingly popular news format to cover breaking news and live events in online journalism. Online news websites around the world are using this medium to give their readers a minute by minute update on an event. Good summaries enhance the value of the live blogs for a reader but are often not available. In this paper, we study a way of collecting corpora for automatic live blog summarization. In an empirical evaluation using well-known state-of-the-art summarization systems, we show that live blogs corpus poses new challenges in the field of summarization. We make our tools publicly available to reconstruct the corpus to encourage the research community and replicate our results.
△ Less
Submitted 27 February, 2018;
originally announced February 2018.
-
A Study of Energy and Locality Effects using Space-filling Curves
Authors:
Nico Reissmann,
Jan Christian Meyer,
Magnus Jahre
Abstract:
The cost of energy is becoming an increasingly important driver for the operating cost of HPC systems, adding yet another facet to the challenge of producing efficient code. In this paper, we investigate the energy implications of trading computation for locality using Hilbert and Morton space-filling curves with dense matrix-matrix multiplication. The advantage of these curves is that they exhibi…
▽ More
The cost of energy is becoming an increasingly important driver for the operating cost of HPC systems, adding yet another facet to the challenge of producing efficient code. In this paper, we investigate the energy implications of trading computation for locality using Hilbert and Morton space-filling curves with dense matrix-matrix multiplication. The advantage of these curves is that they exhibit an inherent tiling effect without requiring specific architecture tuning. By accessing the matrices in the order determined by the space-filling curves, we can trade computation for locality. The index computation overhead of the Morton curve is found to be balanced against its locality and energy efficiency, while the overhead of the Hilbert curve outweighs its improvements on our test system.
△ Less
Submitted 20 June, 2016;
originally announced June 2016.
-
Multimodal MRI Neuroimaging with Motion Compensation Based on Particle Filtering
Authors:
Yu-Hui Chen,
Roni Mittelman,
Boklye Kim,
Charles Meyer,
Alfred Hero
Abstract:
Head movement during scanning impedes activation detection in fMRI studies. Head motion in fMRI acquired using slice-based Echo Planar Imaging (EPI) can be estimated and compensated by aligning the images onto a reference volume through image registration. However, registering EPI images volume to volume fails to consider head motion between slices, which may lead to severely biased head motion es…
▽ More
Head movement during scanning impedes activation detection in fMRI studies. Head motion in fMRI acquired using slice-based Echo Planar Imaging (EPI) can be estimated and compensated by aligning the images onto a reference volume through image registration. However, registering EPI images volume to volume fails to consider head motion between slices, which may lead to severely biased head motion estimates. Slice-to-volume registration can be used to estimate motion parameters for each slice by more accurately representing the image acquisition sequence. However, accurate slice to volume map** is dependent on the information content of the slices: middle slices are information rich, while edge slides are information poor and more prone to distortion. In this work, we propose a Gaussian particle filter based head motion tracking algorithm to reduce the image misregistration errors. The algorithm uses a dynamic state space model of head motion with an observation equation that models continuous slice acquisition of the scanner. Under this model the particle filter provides more accurate motion estimates and voxel position estimates. We demonstrate significant performance improvement of the proposed approach as compared to registration-only methods of head motion estimation and brain activation detection.
△ Less
Submitted 10 November, 2015;
originally announced November 2015.
-
HFirst: A Temporal Approach to Object Recognition
Authors:
Garrick Orchard,
Cedric Meyer,
Ralph Etienne-Cummings,
Christoph Posch,
Nitish Thakor,
Ryad Benosman
Abstract:
This paper introduces a spiking hierarchical model for object recognition which utilizes the precise timing information inherently present in the output of biologically inspired asynchronous Address Event Representation (AER) vision sensors. The asynchronous nature of these systems frees computation and communication from the rigid predetermined timing enforced by system clocks in conventional sys…
▽ More
This paper introduces a spiking hierarchical model for object recognition which utilizes the precise timing information inherently present in the output of biologically inspired asynchronous Address Event Representation (AER) vision sensors. The asynchronous nature of these systems frees computation and communication from the rigid predetermined timing enforced by system clocks in conventional systems. Freedom from rigid timing constraints opens the possibility of using true timing to our advantage in computation. We show not only how timing can be used in object recognition, but also how it can in fact simplify computation. Specifically, we rely on a simple temporal-winner-take-all rather than more computationally intensive synchronous operations typically used in biologically inspired neural networks for object recognition. This approach to visual computation represents a major paradigm shift from conventional clocked systems and can find application in other sensory modalities and computational tasks. We showcase effectiveness of the approach by achieving the highest reported accuracy to date (97.5\%$\pm$3.5\%) for a previously published four class card pip recognition task and an accuracy of 84.9\%$\pm$1.9\% for a new more difficult 36 class character recognition task.
△ Less
Submitted 5 August, 2015;
originally announced August 2015.
-
A Case Study in Text Mining: Interpreting Twitter Data From World Cup Tweets
Authors:
Daniel Godfrey,
Caley Johns,
Carl Meyer,
Shaina Race,
Carol Sadek
Abstract:
Cluster analysis is a field of data analysis that extracts underlying patterns in data. One application of cluster analysis is in text-mining, the analysis of large collections of text to find similarities between documents. We used a collection of about 30,000 tweets extracted from Twitter just before the World Cup started. A common problem with real world text data is the presence of linguistic…
▽ More
Cluster analysis is a field of data analysis that extracts underlying patterns in data. One application of cluster analysis is in text-mining, the analysis of large collections of text to find similarities between documents. We used a collection of about 30,000 tweets extracted from Twitter just before the World Cup started. A common problem with real world text data is the presence of linguistic noise. In our case it would be extraneous tweets that are unrelated to dominant themes. To combat this problem, we created an algorithm that combined the DBSCAN algorithm and a consensus matrix. This way we are left with the tweets that are related to those dominant themes. We then used cluster analysis to find those topics that the tweets describe. We clustered the tweets using k-means, a commonly used clustering algorithm, and Non-Negative Matrix Factorization (NMF) and compared the results. The two algorithms gave similar results, but NMF proved to be faster and provided more easily interpreted results. We explored our results using two visualization tools, Gephi and Wordle.
△ Less
Submitted 21 August, 2014;
originally announced August 2014.
-
A Flexible Iterative Framework for Consensus Clustering
Authors:
Shaina Race,
Carl Meyer
Abstract:
A novel framework for consensus clustering is presented which has the ability to determine both the number of clusters and a final solution using multiple algorithms. A consensus similarity matrix is formed from an ensemble using multiple algorithms and several values for k. A variety of dimension reduction techniques and clustering algorithms are considered for analysis. For noisy or high-dimensi…
▽ More
A novel framework for consensus clustering is presented which has the ability to determine both the number of clusters and a final solution using multiple algorithms. A consensus similarity matrix is formed from an ensemble using multiple algorithms and several values for k. A variety of dimension reduction techniques and clustering algorithms are considered for analysis. For noisy or high-dimensional data, an iterative technique is presented to refine this consensus matrix in way that encourages algorithms to agree upon a common solution. We utilize the theory of nearly uncoupled Markov chains to determine the number, k , of clusters in a dataset by considering a random walk on the graph defined by the consensus matrix. The eigenvalues of the associated transition probability matrix are used to determine the number of clusters. This method succeeds at determining the number of clusters in many datasets where previous methods fail. On every considered dataset, our consensus method provides a final result with accuracy well above the average of the individual algorithms.
△ Less
Submitted 5 August, 2014;
originally announced August 2014.
-
Determining the Number of Clusters via Iterative Consensus Clustering
Authors:
Shaina Race,
Carl Meyer,
Kevin Valakuzhy
Abstract:
We use a cluster ensemble to determine the number of clusters, k, in a group of data. A consensus similarity matrix is formed from the ensemble using multiple algorithms and several values for k. A random walk is induced on the graph defined by the consensus matrix and the eigenvalues of the associated transition probability matrix are used to determine the number of clusters. For noisy or high-di…
▽ More
We use a cluster ensemble to determine the number of clusters, k, in a group of data. A consensus similarity matrix is formed from the ensemble using multiple algorithms and several values for k. A random walk is induced on the graph defined by the consensus matrix and the eigenvalues of the associated transition probability matrix are used to determine the number of clusters. For noisy or high-dimensional data, an iterative technique is presented to refine this consensus matrix in way that encourages a block-diagonal form. It is shown that the resulting consensus matrix is generally superior to existing similarity matrices for this type of spectral analysis.
△ Less
Submitted 5 August, 2014;
originally announced August 2014.
-
Algorithms, Initializations, and Convergence for the Nonnegative Matrix Factorization
Authors:
Amy N. Langville,
Carl D. Meyer,
Russell Albright,
James Cox,
David Duling
Abstract:
It is well known that good initializations can improve the speed and accuracy of the solutions of many nonnegative matrix factorization (NMF) algorithms. Many NMF algorithms are sensitive with respect to the initialization of W or H or both. This is especially true of algorithms of the alternating least squares (ALS) type, including the two new ALS algorithms that we present in this paper. We comp…
▽ More
It is well known that good initializations can improve the speed and accuracy of the solutions of many nonnegative matrix factorization (NMF) algorithms. Many NMF algorithms are sensitive with respect to the initialization of W or H or both. This is especially true of algorithms of the alternating least squares (ALS) type, including the two new ALS algorithms that we present in this paper. We compare the results of six initialization procedures (two standard and four new) on our ALS algorithms. Lastly, we discuss the practical issue of choosing an appropriate convergence criterion.
△ Less
Submitted 27 July, 2014;
originally announced July 2014.
-
Coherent Knowledge Processing at Maximum Entropy by SPIRIT
Authors:
Wilhelm Roedder,
Carl-Heinz Meyer
Abstract:
SPIRIT is an expert system shell for probabilistic knowledge bases. Knowledge acquisition is performed by processing facts and rules on discrete variables in a rich syntax. The shell generates a probability distribution which respects all acquired facts and rules and which maximizes entropy. The user-friendly devices of SPIRIT to define variables, formulate rules and create the knowledge base a…
▽ More
SPIRIT is an expert system shell for probabilistic knowledge bases. Knowledge acquisition is performed by processing facts and rules on discrete variables in a rich syntax. The shell generates a probability distribution which respects all acquired facts and rules and which maximizes entropy. The user-friendly devices of SPIRIT to define variables, formulate rules and create the knowledge base are revealed in detail. Inductive learning is possible. Medium sized applications show the power of the system.
△ Less
Submitted 13 February, 2013;
originally announced February 2013.
-
Data Clustering via Principal Direction Gap Partitioning
Authors:
Ralph Abbey,
Jeremy Diepenbrock,
Amy Langville,
Carl Meyer,
Shaina Race,
Dexin Zhou
Abstract:
We explore the geometrical interpretation of the PCA based clustering algorithm Principal Direction Divisive Partitioning (PDDP). We give several examples where this algorithm breaks down, and suggest a new method, gap partitioning, which takes into account natural gaps in the data between clusters. Geometric features of the PCA space are derived and illustrated and experimental results are given…
▽ More
We explore the geometrical interpretation of the PCA based clustering algorithm Principal Direction Divisive Partitioning (PDDP). We give several examples where this algorithm breaks down, and suggest a new method, gap partitioning, which takes into account natural gaps in the data between clusters. Geometric features of the PCA space are derived and illustrated and experimental results are given which show our method is comparable on the datasets used in the original paper on PDDP.
△ Less
Submitted 17 November, 2012;
originally announced November 2012.
-
Agent Programming with Declarative Goals
Authors:
F. S. de Boer,
K. V. Hindriks,
W. van der Hoek,
J. -J. Ch. Meyer
Abstract:
A long and lasting problem in agent research has been to close the gap between agent logics and agent programming frameworks. The main reason for this problem of establishing a link between agent logics and agent programming frameworks is identified and explained by the fact that agent programming frameworks have not incorporated the concept of a `declarative goal'. Instead, such frameworks have…
▽ More
A long and lasting problem in agent research has been to close the gap between agent logics and agent programming frameworks. The main reason for this problem of establishing a link between agent logics and agent programming frameworks is identified and explained by the fact that agent programming frameworks have not incorporated the concept of a `declarative goal'. Instead, such frameworks have focused mainly on plans or `goals-to-do' instead of the end goals to be realised which are also called `goals-to-be'. In this paper, a new programming language called GOAL is introduced which incorporates such declarative goals. The notion of a `commitment strategy' - one of the main theoretical insights due to agent logics, which explains the relation between beliefs and goals - is used to construct a computational semantics for GOAL. Finally, a proof theory for proving properties of GOAL agents is introduced. Thus, we offer a complete theory of agent programming in the sense that our theory provides both for a programming framework and a programming logic for such agents. An example program is proven correct by using this programming logic.
△ Less
Submitted 3 July, 2002;
originally announced July 2002.