-
Open Government Data Corpus for Table Search
Authors:
Michael Glass,
Sugato Bagchi,
Oktie Hassanzadeh,
Gaetano Rossiello,
Alfio Gliozzo
Abstract:
Increasing amounts of structured data can provide value for research and business if the relevant data can be located. Often the data is in a data lake without a consistent schema, making locating useful data challenging. Table search is a growing research area, but existing benchmarks have been limited to displayed tables. Tables sized and formatted for display in a Wikipedia page or ArXiv paper…
▽ More
Increasing amounts of structured data can provide value for research and business if the relevant data can be located. Often the data is in a data lake without a consistent schema, making locating useful data challenging. Table search is a growing research area, but existing benchmarks have been limited to displayed tables. Tables sized and formatted for display in a Wikipedia page or ArXiv paper are considerably different from data tables in both scale and style. By using metadata associated with open data from government portals, we create the first dataset to benchmark search over data tables at scale. We demonstrate three styles of table-to-table related table search. The three notions of table relatedness are: tables produced by the same organization, tables distributed as part of the same dataset, and tables with a high degree of overlap in the annotated tags. The keyword tags provided with the metadata also permit the automatic creation of a keyword search over tables benchmark. We provide baselines on this dataset using existing methods including traditional and neural approaches.
△ Less
Submitted 24 August, 2023;
originally announced August 2023.
-
Retrieval-Based Transformer for Table Augmentation
Authors:
Michael Glass,
Xueqing Wu,
Ankita Rajaram Naik,
Gaetano Rossiello,
Alfio Gliozzo
Abstract:
Data preparation, also called data wrangling, is considered one of the most expensive and time-consuming steps when performing analytics or building machine learning models. Preparing data typically involves collecting and merging data from complex heterogeneous, and often large-scale data sources, such as data lakes. In this paper, we introduce a novel approach toward automatic data wrangling in…
▽ More
Data preparation, also called data wrangling, is considered one of the most expensive and time-consuming steps when performing analytics or building machine learning models. Preparing data typically involves collecting and merging data from complex heterogeneous, and often large-scale data sources, such as data lakes. In this paper, we introduce a novel approach toward automatic data wrangling in an attempt to alleviate the effort of end-users, e.g. data analysts, in structuring dynamic views from data lakes in the form of tabular data. We aim to address table augmentation tasks, including row/column population and data imputation. Given a corpus of tables, we propose a retrieval augmented self-trained transformer model. Our self-learning strategy consists in randomly ablating tables from the corpus and training the retrieval-based model to reconstruct the original values or headers given the partial tables as input. We adopt this strategy to first train the dense neural retrieval model encoding table-parts to vectors, and then the end-to-end model trained to perform table augmentation tasks. We test on EntiTables, the standard benchmark for table augmentation, as well as introduce a new benchmark to advance further research: WebTables. Our model consistently and substantially outperforms both supervised statistical methods and the current state-of-the-art transformer-based models.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
Knowledge-Driven New Drug Recommendation
Authors:
Zhenbang Wu,
Huaxiu Yao,
Zhe Su,
David M Liebovitz,
Lucas M Glass,
James Zou,
Chelsea Finn,
Jimeng Sun
Abstract:
Drug recommendation assists doctors in prescribing personalized medications to patients based on their health conditions. Existing drug recommendation solutions adopt the supervised multi-label classification setup and only work with existing drugs with sufficient prescription data from many patients. However, newly approved drugs do not have much historical prescription data and cannot leverage e…
▽ More
Drug recommendation assists doctors in prescribing personalized medications to patients based on their health conditions. Existing drug recommendation solutions adopt the supervised multi-label classification setup and only work with existing drugs with sufficient prescription data from many patients. However, newly approved drugs do not have much historical prescription data and cannot leverage existing drug recommendation methods. To address this, we formulate the new drug recommendation as a few-shot learning problem. Yet, directly applying existing few-shot learning algorithms faces two challenges: (1) complex relations among diseases and drugs and (2) numerous false-negative patients who were eligible but did not yet use the new drugs. To tackle these challenges, we propose EDGE, which can quickly adapt to the recommendation for a new drug with limited prescription data from a few support patients. EDGE maintains a drug-dependent multi-phenotype few-shot learner to bridge the gap between existing and new drugs. Specifically, EDGE leverages the drug ontology to link new drugs to existing drugs with similar treatment effects and learns ontology-based drug representations. Such drug representations are used to customize the metric space of the phenotype-driven patient representations, which are composed of a set of phenotypes capturing complex patient health status. Lastly, EDGE eliminates the false-negative supervision signal using an external drug-disease knowledge base. We evaluate EDGE on two real-world datasets: the public EHR data (MIMIC-IV) and private industrial claims data. Results show that EDGE achieves 7.3% improvement on the ROC-AUC score over the best baseline.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Artificial Intelligence for In Silico Clinical Trials: A Review
Authors:
Zifeng Wang,
Chufan Gao,
Lucas M. Glass,
Jimeng Sun
Abstract:
A clinical trial is an essential step in drug development, which is often costly and time-consuming. In silico trials are clinical trials conducted digitally through simulation and modeling as an alternative to traditional clinical trials. AI-enabled in silico trials can increase the case group size by creating virtual cohorts as controls. In addition, it also enables automation and optimization o…
▽ More
A clinical trial is an essential step in drug development, which is often costly and time-consuming. In silico trials are clinical trials conducted digitally through simulation and modeling as an alternative to traditional clinical trials. AI-enabled in silico trials can increase the case group size by creating virtual cohorts as controls. In addition, it also enables automation and optimization of trial design and predicts the trial success rate. This article systematically reviews papers under three main topics: clinical simulation, individualized predictive modeling, and computer-aided trial design. We focus on how machine learning (ML) may be applied in these applications. In particular, we present the machine learning problem formulation and available data sources for each task. We end with discussing the challenges and opportunities of AI for in silico trials in real-world applications.
△ Less
Submitted 16 September, 2022;
originally announced September 2022.
-
Re2G: Retrieve, Rerank, Generate
Authors:
Michael Glass,
Gaetano Rossiello,
Md Faisal Mahbub Chowdhury,
Ankita Rajaram Naik,
Pengshan Cai,
Alfio Gliozzo
Abstract:
As demonstrated by GPT-3 and T5, transformers grow in capability as parameter spaces become larger and larger. However, for tasks that require a large amount of knowledge, non-parametric memory allows models to grow dramatically with a sub-linear increase in computational cost and GPU memory requirements. Recent models such as RAG and REALM have introduced retrieval into conditional generation. Th…
▽ More
As demonstrated by GPT-3 and T5, transformers grow in capability as parameter spaces become larger and larger. However, for tasks that require a large amount of knowledge, non-parametric memory allows models to grow dramatically with a sub-linear increase in computational cost and GPU memory requirements. Recent models such as RAG and REALM have introduced retrieval into conditional generation. These models incorporate neural initial retrieval from a corpus of passages. We build on this line of research, proposing Re2G, which combines both neural initial retrieval and reranking into a BART-based sequence-to-sequence generation. Our reranking approach also permits merging retrieval results from sources with incomparable scores, enabling an ensemble of BM25 and neural initial retrieval. To train our system end-to-end, we introduce a novel variation of knowledge distillation to train the initial retrieval, reranker, and generation using only ground truth on the target sequence output. We find large gains in four diverse tasks: zero-shot slot filling, question answering, fact-checking, and dialog, with relative gains of 9% to 34% over the previous state-of-the-art on the KILT leaderboard. We make our code available as open source at https://github.com/IBM/kgi-slot-filling/tree/re2g.
△ Less
Submitted 13 July, 2022;
originally announced July 2022.
-
KGI: An Integrated Framework for Knowledge Intensive Language Tasks
Authors:
Md Faisal Mahbub Chowdhury,
Michael Glass,
Gaetano Rossiello,
Alfio Gliozzo,
Nandana Mihindukulasooriya
Abstract:
In this paper, we present a system to showcase the capabilities of the latest state-of-the-art retrieval augmented generation models trained on knowledge-intensive language tasks, such as slot filling, open domain question answering, dialogue, and fact-checking. Moreover, given a user query, we show how the output from these different models can be combined to cross-examine the outputs of each oth…
▽ More
In this paper, we present a system to showcase the capabilities of the latest state-of-the-art retrieval augmented generation models trained on knowledge-intensive language tasks, such as slot filling, open domain question answering, dialogue, and fact-checking. Moreover, given a user query, we show how the output from these different models can be combined to cross-examine the outputs of each other. Particularly, we show how accuracy in dialogue can be improved using the question answering model. We are also releasing all models used in the demo as a contribution of this paper. A short video demonstrating the system is available at https://ibm.box.com/v/emnlp2022-demo.
△ Less
Submitted 21 September, 2022; v1 submitted 8 April, 2022;
originally announced April 2022.
-
End-to-End Table Question Answering via Retrieval-Augmented Generation
Authors:
Feifei Pan,
Mustafa Canim,
Michael Glass,
Alfio Gliozzo,
James Hendler
Abstract:
Most existing end-to-end Table Question Answering (Table QA) models consist of a two-stage framework with a retriever to select relevant table candidates from a corpus and a reader to locate the correct answers from table candidates. Even though the accuracy of the reader models is significantly improved with the recent transformer-based approaches, the overall performance of such frameworks still…
▽ More
Most existing end-to-end Table Question Answering (Table QA) models consist of a two-stage framework with a retriever to select relevant table candidates from a corpus and a reader to locate the correct answers from table candidates. Even though the accuracy of the reader models is significantly improved with the recent transformer-based approaches, the overall performance of such frameworks still suffers from the poor accuracy of using traditional information retrieval techniques as retrievers. To alleviate this problem, we introduce T-RAG, an end-to-end Table QA model, where a non-parametric dense vector index is fine-tuned jointly with BART, a parametric sequence-to-sequence model to generate answer tokens. Given any natural language question, T-RAG utilizes a unified pipeline to automatically search through a table corpus to directly locate the correct answer from the table cells. We apply T-RAG to recent open-domain Table QA benchmarks and demonstrate that the fine-tuned T-RAG model is able to achieve state-of-the-art performance in both the end-to-end Table QA and the table retrieval tasks.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
AutoMap: Automatic Medical Code Map** for Clinical Prediction Model Deployment
Authors:
Zhenbang Wu,
Cao Xiao,
Lucas M Glass,
David M Liebovitz,
Jimeng Sun
Abstract:
Given a deep learning model trained on data from a source site, how to deploy the model to a target hospital automatically? How to accommodate heterogeneous medical coding systems across different hospitals? Standard approaches rely on existing medical code map** tools, which have significant practical limitations.
To tackle this problem, we propose AutoMap to automatically map the medical cod…
▽ More
Given a deep learning model trained on data from a source site, how to deploy the model to a target hospital automatically? How to accommodate heterogeneous medical coding systems across different hospitals? Standard approaches rely on existing medical code map** tools, which have significant practical limitations.
To tackle this problem, we propose AutoMap to automatically map the medical codes across different EHR systems in a coarse-to-fine manner: (1) Ontology-level Alignment: We leverage the ontology structure to learn a coarse alignment between the source and target medical coding systems; (2) Code-level Refinement: We refine the alignment at a fine-grained code level for the downstream tasks using a teacher-student framework.
We evaluate AutoMap using several deep learning models with two real-world EHR datasets: eICU and MIMIC-III. Results show that AutoMap achieves relative improvements up to 3.9% (AUC-ROC) and 8.7% (AUC-PR) for mortality prediction, and up to 4.7% (AUC-ROC) and 3.7% (F1) for length-of-stay estimation. Further, we show that AutoMap can provide accurate map** across coding systems. Lastly, we demonstrate that AutoMap can adapt to the two challenging scenarios: (1) map** between completely different coding systems and (2) between completely different hospitals.
△ Less
Submitted 4 March, 2022;
originally announced March 2022.
-
Multi-Objective Design Space Exploration for the Optimization of the HEVC Mode Decision Process
Authors:
Christian Herglotz,
Rafael Rosales,
Michael Glass,
Jürgen Teich,
André Kaup
Abstract:
Finding the best possible encoding decisions for compressing a video sequence is a highly complex problem. In this work, we propose a multi-objective Design Space Exploration (DSE) method to automatically find HEVC encoder implementations that are optimized for several different criteria. The DSE shall optimize the coding mode evaluation order of the mode decision process and jointly explore early…
▽ More
Finding the best possible encoding decisions for compressing a video sequence is a highly complex problem. In this work, we propose a multi-objective Design Space Exploration (DSE) method to automatically find HEVC encoder implementations that are optimized for several different criteria. The DSE shall optimize the coding mode evaluation order of the mode decision process and jointly explore early skip conditions to minimize the four objectives a) bitrate, b) distortion, c) encoding time, and d) decoding energy. In this context, we use a SystemC-based actor model of the HM test model encoder for the evaluation of each explored solution. The evaluation that is based on real measurements shows that our framework can automatically generate encoder solutions that save more than 60% of encoding time or 3% of decoding energy when accepting bitrate increases of around 3%.
△ Less
Submitted 3 March, 2022;
originally announced March 2022.
-
PopNet: Real-Time Population-Level Disease Prediction with Data Latency
Authors:
Junyi Gao,
Cao Xiao,
Lucas M. Glass,
Jimeng Sun
Abstract:
Population-level disease prediction estimates the number of potential patients of particular diseases in some location at a future time based on (frequently updated) historical disease statistics. Existing approaches often assume the existing disease statistics are reliable and will not change. However, in practice, data collection is often time-consuming and has time delays, with both historical…
▽ More
Population-level disease prediction estimates the number of potential patients of particular diseases in some location at a future time based on (frequently updated) historical disease statistics. Existing approaches often assume the existing disease statistics are reliable and will not change. However, in practice, data collection is often time-consuming and has time delays, with both historical and current disease statistics being updated continuously. In this work, we propose a real-time population-level disease prediction model which captures data latency (PopNet) and incorporates the updated data for improved predictions. To achieve this goal, PopNet models real-time data and updated data using two separate systems, each capturing spatial and temporal effects using hybrid graph attention networks and recurrent neural networks. PopNet then fuses the two systems using both spatial and temporal latency-aware attentions in an end-to-end manner. We evaluate PopNet on real-world disease datasets and show that PopNet consistently outperforms all baseline disease prediction and general spatial-temporal prediction models, achieving up to 47% lower root mean squared error and 24% lower mean absolute error compared with the best baselines.
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
Applying a Generic Sequence-to-Sequence Model for Simple and Effective Keyphrase Generation
Authors:
Md Faisal Mahbub Chowdhury,
Gaetano Rossiello,
Michael Glass,
Nandana Mihindukulasooriya,
Alfio Gliozzo
Abstract:
In recent years, a number of keyphrase generation (KPG) approaches were proposed consisting of complex model architectures, dedicated training paradigms and decoding strategies. In this work, we opt for simplicity and show how a commonly used seq2seq language model, BART, can be easily adapted to generate keyphrases from the text in a single batch computation using a simple training procedure. Emp…
▽ More
In recent years, a number of keyphrase generation (KPG) approaches were proposed consisting of complex model architectures, dedicated training paradigms and decoding strategies. In this work, we opt for simplicity and show how a commonly used seq2seq language model, BART, can be easily adapted to generate keyphrases from the text in a single batch computation using a simple training procedure. Empirical results on five benchmarks show that our approach is as good as the existing state-of-the-art KPG systems, but using a much simpler and easy to deploy framework.
△ Less
Submitted 13 January, 2022;
originally announced January 2022.
-
Robust Retrieval Augmented Generation for Zero-shot Slot Filling
Authors:
Michael Glass,
Gaetano Rossiello,
Md Faisal Mahbub Chowdhury,
Alfio Gliozzo
Abstract:
Automatically inducing high quality knowledge graphs from a given collection of documents still remains a challenging problem in AI. One way to make headway for this problem is through advancements in a related task known as slot filling. In this task, given an entity query in form of [Entity, Slot, ?], a system is asked to fill the slot by generating or extracting the missing value exploiting evi…
▽ More
Automatically inducing high quality knowledge graphs from a given collection of documents still remains a challenging problem in AI. One way to make headway for this problem is through advancements in a related task known as slot filling. In this task, given an entity query in form of [Entity, Slot, ?], a system is asked to fill the slot by generating or extracting the missing value exploiting evidence extracted from relevant passage(s) in the given document collection. The recent works in the field try to solve this task in an end-to-end fashion using retrieval-based language models. In this paper, we present a novel approach to zero-shot slot filling that extends dense passage retrieval with hard negatives and robust training procedures for retrieval augmented generation models. Our model reports large improvements on both T-REx and zsRE slot filling datasets, improving both passage retrieval and slot value generation, and ranking at the top-1 position in the KILT leaderboard. Moreover, we demonstrate the robustness of our system showing its domain adaptation capability on a new variant of the TACRED dataset for slot filling, through a combination of zero/few-shot learning. We release the source code and pre-trained models.
△ Less
Submitted 13 September, 2021; v1 submitted 31 August, 2021;
originally announced August 2021.
-
AIT-QA: Question Answering Dataset over Complex Tables in the Airline Industry
Authors:
Yannis Katsis,
Saneem Chemmengath,
Vishwajeet Kumar,
Samarth Bharadwaj,
Mustafa Canim,
Michael Glass,
Alfio Gliozzo,
Feifei Pan,
Jaydeep Sen,
Karthik Sankaranarayanan,
Soumen Chakrabarti
Abstract:
Recent advances in transformers have enabled Table Question Answering (Table QA) systems to achieve high accuracy and SOTA results on open domain datasets like WikiTableQuestions and WikiSQL. Such transformers are frequently pre-trained on open-domain content such as Wikipedia, where they effectively encode questions and corresponding tables from Wikipedia as seen in Table QA dataset. However, web…
▽ More
Recent advances in transformers have enabled Table Question Answering (Table QA) systems to achieve high accuracy and SOTA results on open domain datasets like WikiTableQuestions and WikiSQL. Such transformers are frequently pre-trained on open-domain content such as Wikipedia, where they effectively encode questions and corresponding tables from Wikipedia as seen in Table QA dataset. However, web tables in Wikipedia are notably flat in their layout, with the first row as the sole column header. The layout lends to a relational view of tables where each row is a tuple. Whereas, tables in domain-specific business or scientific documents often have a much more complex layout, including hierarchical row and column headers, in addition to having specialized vocabulary terms from that domain.
To address this problem, we introduce the domain-specific Table QA dataset AIT-QA (Airline Industry Table QA). The dataset consists of 515 questions authored by human annotators on 116 tables extracted from public U.S. SEC filings (publicly available at: https://www.sec.gov/edgar.shtml) of major airline companies for the fiscal years 2017-2019. We also provide annotations pertaining to the nature of questions, marking those that require hierarchical headers, domain-specific terminology, and paraphrased forms. Our zero-shot baseline evaluation of three transformer-based SOTA Table QA methods - TaPAS (end-to-end), TaBERT (semantic parsing-based), and RCI (row-column encoding-based) - clearly exposes the limitation of these methods in this practical setting, with the best accuracy at just 51.8\% (RCI). We also present pragmatic table preprocessing steps used to pivot and project these complex tables into a layout suitable for the SOTA Table QA models.
△ Less
Submitted 24 June, 2021;
originally announced June 2021.
-
CLTR: An End-to-End, Transformer-Based System for Cell Level Table Retrieval and Table Question Answering
Authors:
Feifei Pan,
Mustafa Canim,
Michael Glass,
Alfio Gliozzo,
Peter Fox
Abstract:
We present the first end-to-end, transformer-based table question answering (QA) system that takes natural language questions and massive table corpus as inputs to retrieve the most relevant tables and locate the correct table cells to answer the question. Our system, CLTR, extends the current state-of-the-art QA over tables model to build an end-to-end table QA architecture. This system has succe…
▽ More
We present the first end-to-end, transformer-based table question answering (QA) system that takes natural language questions and massive table corpus as inputs to retrieve the most relevant tables and locate the correct table cells to answer the question. Our system, CLTR, extends the current state-of-the-art QA over tables model to build an end-to-end table QA architecture. This system has successfully tackled many real-world table QA problems with a simple, unified pipeline. Our proposed system can also generate a heatmap of candidate columns and rows over complex tables and allow users to quickly identify the correct cells to answer questions. In addition, we introduce two new open-domain benchmarks, E2E_WTQ and E2E_GNQ, consisting of 2,005 natural language questions over 76,242 tables. The benchmarks are designed to validate CLTR as well as accommodate future table retrieval and end-to-end table QA research and experiments. Our experiments demonstrate that our system is the current state-of-the-art model on the table retrieval task and produces promising results for end-to-end table QA.
△ Less
Submitted 9 June, 2021; v1 submitted 8 June, 2021;
originally announced June 2021.
-
Machine Learning Applications for Therapeutic Tasks with Genomics Data
Authors:
Kexin Huang,
Cao Xiao,
Lucas M. Glass,
Cathy W. Critchlow,
Greg Gibson,
Jimeng Sun
Abstract:
Thanks to the increasing availability of genomics and other biomedical data, many machine learning approaches have been proposed for a wide range of therapeutic discovery and development tasks. In this survey, we review the literature on machine learning applications for genomics through the lens of therapeutic development. We investigate the interplay among genomics, compounds, proteins, electron…
▽ More
Thanks to the increasing availability of genomics and other biomedical data, many machine learning approaches have been proposed for a wide range of therapeutic discovery and development tasks. In this survey, we review the literature on machine learning applications for genomics through the lens of therapeutic development. We investigate the interplay among genomics, compounds, proteins, electronic health records (EHR), cellular images, and clinical texts. We identify twenty-two machine learning in genomics applications across the entire therapeutics pipeline, from discovering novel targets, personalized medicine, develo** gene-editing tools all the way to clinical trials and post-market studies. We also pinpoint seven important challenges in this field with opportunities for expansion and impact. This survey overviews recent research at the intersection of machine learning, genomics, and therapeutic development.
△ Less
Submitted 3 May, 2021;
originally announced May 2021.
-
Zero-shot Slot Filling with DPR and RAG
Authors:
Michael Glass,
Gaetano Rossiello,
Alfio Gliozzo
Abstract:
The ability to automatically extract Knowledge Graphs (KG) from a given collection of documents is a long-standing problem in Artificial Intelligence. One way to assess this capability is through the task of slot filling. Given an entity query in form of [Entity, Slot, ?], a system is asked to `fill' the slot by generating or extracting the missing value from a relevant passage or passages. This c…
▽ More
The ability to automatically extract Knowledge Graphs (KG) from a given collection of documents is a long-standing problem in Artificial Intelligence. One way to assess this capability is through the task of slot filling. Given an entity query in form of [Entity, Slot, ?], a system is asked to `fill' the slot by generating or extracting the missing value from a relevant passage or passages. This capability is crucial to create systems for automatic knowledge base population, which is becoming in ever-increasing demand, especially in enterprise applications. Recently, there has been a promising direction in evaluating language models in the same way we would evaluate knowledge bases, and the task of slot filling is the most suitable to this intent. The recent advancements in the field try to solve this task in an end-to-end fashion using retrieval-based language models. Models like Retrieval Augmented Generation (RAG) show surprisingly good performance without involving complex information extraction pipelines. However, the results achieved by these models on the two slot filling tasks in the KILT benchmark are still not at the level required by real-world information extraction systems. In this paper, we describe several strategies we adopted to improve the retriever and the generator of RAG in order to make it a better slot filler. Our KGI0 system (available at https://github.com/IBM/retrieve-write-slot-filling) reached the top-1 position on the KILT leaderboard on both T-REx and zsRE dataset with a large margin.
△ Less
Submitted 17 April, 2021;
originally announced April 2021.
-
Capturing Row and Column Semantics in Transformer Based Question Answering over Tables
Authors:
Michael Glass,
Mustafa Canim,
Alfio Gliozzo,
Saneem Chemmengath,
Vishwajeet Kumar,
Rishav Chakravarti,
Avi Sil,
Feifei Pan,
Samarth Bharadwaj,
Nicolas Rodolfo Fauceglia
Abstract:
Transformer based architectures are recently used for the task of answering questions over tables. In order to improve the accuracy on this task, specialized pre-training techniques have been developed and applied on millions of open-domain web tables. In this paper, we propose two novel approaches demonstrating that one can achieve superior performance on table QA task without even using any of t…
▽ More
Transformer based architectures are recently used for the task of answering questions over tables. In order to improve the accuracy on this task, specialized pre-training techniques have been developed and applied on millions of open-domain web tables. In this paper, we propose two novel approaches demonstrating that one can achieve superior performance on table QA task without even using any of these specialized pre-training techniques. The first model, called RCI interaction, leverages a transformer based architecture that independently classifies rows and columns to identify relevant cells. While this model yields extremely high accuracy at finding cell values on recent benchmarks, a second model we propose, called RCI representation, provides a significant efficiency advantage for online QA systems over tables by materializing embeddings for existing tables. Experiments on recent benchmarks prove that the proposed methods can effectively locate cell values on tables (up to ~98% Hit@1 accuracy on WikiSQL lookup questions). Also, the interaction model outperforms the state-of-the-art transformer based approaches, pre-trained on very large table corpora (TAPAS and TaBERT), achieving ~3.4% and ~18.86% additional precision improvement on the standard WikiSQL benchmark.
△ Less
Submitted 26 April, 2021; v1 submitted 16 April, 2021;
originally announced April 2021.
-
HINT: Hierarchical Interaction Network for Trial Outcome Prediction Leveraging Web Data
Authors:
Tianfan Fu,
Kexin Huang,
Cao Xiao,
Lucas M. Glass,
Jimeng Sun
Abstract:
Clinical trials are crucial for drug development but are time consuming, expensive, and often burdensome on patients. More importantly, clinical trials face uncertain outcomes due to issues with efficacy, safety, or problems with patient recruitment. If we were better at predicting the results of clinical trials, we could avoid having to run trials that will inevitably fail more resources could be…
▽ More
Clinical trials are crucial for drug development but are time consuming, expensive, and often burdensome on patients. More importantly, clinical trials face uncertain outcomes due to issues with efficacy, safety, or problems with patient recruitment. If we were better at predicting the results of clinical trials, we could avoid having to run trials that will inevitably fail more resources could be devoted to trials that are likely to succeed. In this paper, we propose Hierarchical INteraction Network (HINT) for more general, clinical trial outcome predictions for all diseases based on a comprehensive and diverse set of web data including molecule information of the drugs, target disease information, trial protocol and biomedical knowledge. HINT first encode these multi-modal data into latent embeddings, where an imputation module is designed to handle missing data. Next, these embeddings will be fed into the knowledge embedding module to generate knowledge embeddings that are pretrained using external knowledge on pharmaco-kinetic properties and trial risk from the web. Then the interaction graph module will connect all the embedding via domain knowledge to fully capture various trial components and their complex relations as well as their influences on trial outcomes. Finally, HINT learns a dynamic attentive graph neural network to predict trial outcome. Comprehensive experimental results show that HINT achieves strong predictive performance, obtaining 0.772, 0.607, 0.623, 0.703 on PR-AUC for Phase I, II, III, and indication outcome prediction, respectively. It also consistently outperforms the best baseline method by up to 12.4\% on PR-AUC.
△ Less
Submitted 12 March, 2022; v1 submitted 8 February, 2021;
originally announced February 2021.
-
STELAR: Spatio-temporal Tensor Factorization with Latent Epidemiological Regularization
Authors:
Nikos Kargas,
Cheng Qian,
Nicholas D. Sidiropoulos,
Cao Xiao,
Lucas M. Glass,
Jimeng Sun
Abstract:
Accurate prediction of the transmission of epidemic diseases such as COVID-19 is crucial for implementing effective mitigation measures. In this work, we develop a tensor method to predict the evolution of epidemic trends for many regions simultaneously. We construct a 3-way spatio-temporal tensor (location, attribute, time) of case counts and propose a nonnegative tensor factorization with latent…
▽ More
Accurate prediction of the transmission of epidemic diseases such as COVID-19 is crucial for implementing effective mitigation measures. In this work, we develop a tensor method to predict the evolution of epidemic trends for many regions simultaneously. We construct a 3-way spatio-temporal tensor (location, attribute, time) of case counts and propose a nonnegative tensor factorization with latent epidemiological model regularization named STELAR. Unlike standard tensor factorization methods which cannot predict slabs ahead, STELAR enables long-term prediction by incorporating latent temporal regularization through a system of discrete-time difference equations of a widely adopted epidemiological model. We use latent instead of location/attribute-level epidemiological dynamics to capture common epidemic profile sub-types and improve collaborative learning and prediction. We conduct experiments using both county- and state-level COVID-19 data and show that our model can identify interesting latent patterns of the epidemic. Finally, we evaluate the predictive ability of our method and show superior performance compared to the baselines, achieving up to 21% lower root mean square error and 25% lower mean absolute error for county-level prediction.
△ Less
Submitted 17 March, 2021; v1 submitted 8 December, 2020;
originally announced December 2020.
-
FLANNEL: Focal Loss Based Neural Network Ensemble for COVID-19 Detection
Authors:
Zhi Qiao,
Austin Bae,
Lucas M. Glass,
Cao Xiao,
Jimeng Sun
Abstract:
To test the possibility of differentiating chest x-ray images of COVID-19 against other pneumonia and healthy patients using deep neural networks. We construct the X-ray imaging data from two publicly available sources, which include 5508 chest x-ray images across 2874 patients with four classes: normal, bacterial pneumonia, non-COVID-19 viral pneumonia, and COVID-19. To identify COVID-19, we prop…
▽ More
To test the possibility of differentiating chest x-ray images of COVID-19 against other pneumonia and healthy patients using deep neural networks. We construct the X-ray imaging data from two publicly available sources, which include 5508 chest x-ray images across 2874 patients with four classes: normal, bacterial pneumonia, non-COVID-19 viral pneumonia, and COVID-19. To identify COVID-19, we propose a Focal Loss Based Neural Ensemble Network (FLANNEL), a flexible module to ensemble several convolutional neural network (CNN) models and fuse with a focal loss for accurate COVID-19 detection on class imbalance data. FLANNEL consistently outperforms baseline models on COVID-19 identification task in all metrics. Compared with the best baseline, FLANNEL shows a higher macro-F1 score with 6% relative increase on Covid-19 identification task where it achieves 0.7833(0.07) in Precision, 0.8609(0.03) in Recall, and 0.8168(0.03) F1 score.
△ Less
Submitted 29 October, 2020;
originally announced October 2020.
-
UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced Data
Authors:
Chacha Chen,
Junjie Liang,
Fenglong Ma,
Lucas M. Glass,
Jimeng Sun,
Cao Xiao
Abstract:
Successful health risk prediction demands accuracy and reliability of the model. Existing predictive models mainly depend on mining electronic health records (EHR) with advanced deep learning techniques to improve model accuracy. However, they all ignore the importance of publicly available online health data, especially socioeconomic status, environmental factors, and detailed demographic informa…
▽ More
Successful health risk prediction demands accuracy and reliability of the model. Existing predictive models mainly depend on mining electronic health records (EHR) with advanced deep learning techniques to improve model accuracy. However, they all ignore the importance of publicly available online health data, especially socioeconomic status, environmental factors, and detailed demographic information for each location, which are all strong predictive signals and can definitely augment precision medicine. To achieve model reliability, the model needs to provide accurate prediction and uncertainty score of the prediction. However, existing uncertainty estimation approaches often failed in handling high-dimensional data, which are present in multi-sourced data. To fill the gap, we propose UNcertaInTy-based hEalth risk prediction (UNITE) model. Building upon an adaptive multimodal deep kernel and a stochastic variational inference module, UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data including EHR data, patient demographics, and public health data collected from the web. We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD). UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19\%$ over the best baseline. We also show UNITE can model meaningful uncertainties and can provide evidence-based clinical support by clustering similar patients.
△ Less
Submitted 25 April, 2021; v1 submitted 21 October, 2020;
originally announced October 2020.
-
MolDesigner: Interactive Design of Efficacious Drugs with Deep Learning
Authors:
Kexin Huang,
Tianfan Fu,
Dawood Khan,
Ali Abid,
Ali Abdalla,
Abubakar Abid,
Lucas M. Glass,
Marinka Zitnik,
Cao Xiao,
Jimeng Sun
Abstract:
The efficacy of a drug depends on its binding affinity to the therapeutic target and pharmacokinetics. Deep learning (DL) has demonstrated remarkable progress in predicting drug efficacy. We develop MolDesigner, a human-in-the-loop web user-interface (UI), to assist drug developers leverage DL predictions to design more effective drugs. A developer can draw a drug molecule in the interface. In the…
▽ More
The efficacy of a drug depends on its binding affinity to the therapeutic target and pharmacokinetics. Deep learning (DL) has demonstrated remarkable progress in predicting drug efficacy. We develop MolDesigner, a human-in-the-loop web user-interface (UI), to assist drug developers leverage DL predictions to design more effective drugs. A developer can draw a drug molecule in the interface. In the backend, more than 17 state-of-the-art DL models generate predictions on important indices that are crucial for a drug's efficacy. Based on these predictions, drug developers can edit the drug molecule and reiterate until satisfaction. MolDesigner can make predictions in real-time with a latency of less than a second.
△ Less
Submitted 5 October, 2020;
originally announced October 2020.
-
MIMOSA: Multi-constraint Molecule Sampling for Molecule Optimization
Authors:
Tianfan Fu,
Cao Xiao,
Xinhao Li,
Lucas M. Glass,
Jimeng Sun
Abstract:
Molecule optimization is a fundamental task for accelerating drug discovery, with the goal of generating new valid molecules that maximize multiple drug properties while maintaining similarity to the input molecule. Existing generative models and reinforcement learning approaches made initial success, but still face difficulties in simultaneously optimizing multiple drug properties. To address suc…
▽ More
Molecule optimization is a fundamental task for accelerating drug discovery, with the goal of generating new valid molecules that maximize multiple drug properties while maintaining similarity to the input molecule. Existing generative models and reinforcement learning approaches made initial success, but still face difficulties in simultaneously optimizing multiple drug properties. To address such challenges, we propose the MultI-constraint MOlecule SAmpling (MIMOSA) approach, a sampling framework to use input molecule as an initial guess and sample molecules from the target distribution. MIMOSA first pretrains two property agnostic graph neural networks (GNNs) for molecule topology and substructure-type prediction, where a substructure can be either atom or single ring. For each iteration, MIMOSA uses the GNNs' prediction and employs three basic substructure operations (add, replace, delete) to generate new molecules and associated weights. The weights can encode multiple constraints including similarity and drug property constraints, upon which we select promising molecules for next iteration. MIMOSA enables flexible encoding of multiple property- and similarity-constraints and can efficiently generate new molecules that satisfy various property constraints and achieved up to 49.6% relative improvement over the best baseline in terms of success rate. The code repository (including readme file, data preprocessing and model construction, evaluation) is available https://github.com/futianfan/MIMOSA.
△ Less
Submitted 30 June, 2024; v1 submitted 5 October, 2020;
originally announced October 2020.
-
SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization
Authors:
Yue Yu,
Kexin Huang,
Chao Zhang,
Lucas M. Glass,
Jimeng Sun,
Cao Xiao
Abstract:
Thanks to the increasing availability of drug-drug interactions (DDI) datasets and large biomedical knowledge graphs (KGs), accurate detection of adverse DDI using machine learning models becomes possible. However, it remains largely an open problem how to effectively utilize large and noisy biomedical KG for DDI detection. Due to its sheer size and amount of noise in KGs, it is often less benefic…
▽ More
Thanks to the increasing availability of drug-drug interactions (DDI) datasets and large biomedical knowledge graphs (KGs), accurate detection of adverse DDI using machine learning models becomes possible. However, it remains largely an open problem how to effectively utilize large and noisy biomedical KG for DDI detection. Due to its sheer size and amount of noise in KGs, it is often less beneficial to directly integrate KGs with other smaller but higher quality data (e.g., experimental data). Most of the existing approaches ignore KGs altogether. Some try to directly integrate KGs with other data via graph neural networks with limited success. Furthermore, most previous works focus on binary DDI prediction whereas the multi-typed DDI pharmacological effect prediction is a more meaningful but harder task. To fill the gaps, we propose a new method SumGNN: knowledge summarization graph neural network, which is enabled by a subgraph extraction module that can efficiently anchor on relevant subgraphs from a KG, a self-attention based subgraph summarization scheme to generate a reasoning path within the subgraph, and a multi-channel knowledge and data integration module that utilizes massive external biomedical knowledge for significantly improved multi-typed DDI predictions. SumGNN outperforms the best baseline by up to 5.54\%, and the performance gain is particularly significant in low data relation types. In addition, SumGNN provides interpretable prediction via the generated reasoning paths for each prediction.
△ Less
Submitted 6 May, 2021; v1 submitted 3 October, 2020;
originally announced October 2020.
-
STAN: Spatio-Temporal Attention Network for Pandemic Prediction Using Real World Evidence
Authors:
Junyi Gao,
Rakshith Sharma,
Cheng Qian,
Lucas M. Glass,
Jeffrey Spaeder,
Justin Romberg,
Jimeng Sun,
Cao Xiao
Abstract:
Objective: The COVID-19 pandemic has created many challenges that need immediate attention. Various epidemiological and deep learning models have been developed to predict the COVID-19 outbreak, but all have limitations that affect the accuracy and robustness of the predictions. Our method aims at addressing these limitations and making earlier and more accurate pandemic outbreak predictions by (1…
▽ More
Objective: The COVID-19 pandemic has created many challenges that need immediate attention. Various epidemiological and deep learning models have been developed to predict the COVID-19 outbreak, but all have limitations that affect the accuracy and robustness of the predictions. Our method aims at addressing these limitations and making earlier and more accurate pandemic outbreak predictions by (1) using patients' EHR data from different counties and states that encode local disease status and medical resource utilization condition; (2) considering demographic similarity and geographical proximity between locations; and (3) integrating pandemic transmission dynamics into deep learning models. Materials and Methods: We proposed a spatio-temporal attention network (STAN) for pandemic prediction. It uses an attention-based graph convolutional network to capture geographical and temporal trends and predict the number of cases for a fixed number of days into the future. We also designed a physical law-based loss term for enhancing long-term prediction. STAN was tested using both massive real-world patient data and open source COVID-19 statistics provided by Johns Hopkins university across all U.S. counties. Results: STAN outperforms epidemiological modeling methods such as SIR and SEIR and deep learning models on both long-term and short-term predictions, achieving up to 87% lower mean squared error compared to the best baseline prediction model. Conclusions: By using information from real-world patient data and geographical data, STAN can better capture the disease status and medical resource utilization information and thus provides more accurate pandemic modeling. With pandemic transmission law based regularization, STAN also achieves good long-term prediction performance.
△ Less
Submitted 7 December, 2020; v1 submitted 23 July, 2020;
originally announced August 2020.
-
COMPOSE: Cross-Modal Pseudo-Siamese Network for Patient Trial Matching
Authors:
Junyi Gao,
Cao Xiao,
Lucas M. Glass,
Jimeng Sun
Abstract:
Clinical trials play important roles in drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. The availability of massive electronic health records (EHR) data and trial eligibility criteria (EC) bring a new opportunity to data driven patient recruitment. One key task named patient-trial matching is to find qualified patients for clinical trials given st…
▽ More
Clinical trials play important roles in drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. The availability of massive electronic health records (EHR) data and trial eligibility criteria (EC) bring a new opportunity to data driven patient recruitment. One key task named patient-trial matching is to find qualified patients for clinical trials given structured EHR and unstructured EC text (both inclusion and exclusion criteria). How to match complex EC text with longitudinal patient EHRs? How to embed many-to-many relationships between patients and trials? How to explicitly handle the difference between inclusion and exclusion criteria? In this paper, we proposed CrOss-Modal PseudO-SiamEse network (COMPOSE) to address these challenges for patient-trial matching. One path of the network encodes EC using convolutional highway network. The other path processes EHR with multi-granularity memory network that encodes structured patient records into multiple levels based on medical ontology. Using the EC embedding as query, COMPOSE performs attentional record alignment and thus enables dynamic patient-trial matching. COMPOSE also introduces a composite loss term to maximize the similarity between patient records and inclusion criteria while minimize the similarity to the exclusion criteria. Experiment results show COMPOSE can reach 98.0% AUC on patient-criteria matching and 83.7% accuracy on patient-trial matching, which leads 24.3% improvement over the best baseline on real-world patient-trial matching tasks.
△ Less
Submitted 15 June, 2020;
originally announced June 2020.
-
CLARA: Clinical Report Auto-completion
Authors:
Siddharth Biswal,
Cao Xiao,
Lucas M. Glass,
M. Brandon Westover,
Jimeng Sun
Abstract:
Generating clinical reports from raw recordings such as X-rays and electroencephalogram (EEG) is an essential and routine task for doctors. However, it is often time-consuming to write accurate and detailed reports. Most existing methods try to generate the whole reports from the raw input with limited success because 1) generated reports often contain errors that need manual review and correction…
▽ More
Generating clinical reports from raw recordings such as X-rays and electroencephalogram (EEG) is an essential and routine task for doctors. However, it is often time-consuming to write accurate and detailed reports. Most existing methods try to generate the whole reports from the raw input with limited success because 1) generated reports often contain errors that need manual review and correction, 2) it does not save time when doctors want to write additional information into the report, and 3) the generated reports are not customized based on individual doctors' preference. We propose {\it CL}inic{\it A}l {\it R}eport {\it A}uto-completion (CLARA), an interactive method that generates reports in a sentence by sentence fashion based on doctors' anchor words and partially completed sentences. CLARA searches for most relevant sentences from existing reports as the template for the current report. The retrieved sentences are sequentially modified by combining with the input feature representations to create the final report. In our experimental evaluation, CLARA achieved 0.393 CIDEr and 0.248 BLEU-4 on X-ray reports and 0.482 CIDEr and 0.491 BLEU-4 for EEG reports for sentence-level generation, which is up to 35% improvement over the best baseline. Also via our qualitative evaluation, CLARA is shown to produce reports which have a significantly higher level of approval by doctors in a user study (3.74 out of 5 for CLARA vs 2.52 out of 5 for the baseline).
△ Less
Submitted 4 March, 2020; v1 submitted 26 February, 2020;
originally announced February 2020.
-
StageNet: Stage-Aware Neural Networks for Health Risk Prediction
Authors:
Junyi Gao,
Cao Xiao,
Yasha Wang,
Wen Tang,
Lucas M. Glass,
Jimeng Sun
Abstract:
Deep learning has demonstrated success in health risk prediction especially for patients with chronic and progressing conditions. Most existing works focus on learning disease Network (StageNet) model to extract disease stage information from patient data and integrate it into risk prediction. StageNet is enabled by (1) a stage-aware long short-term memory (LSTM) module that extracts health stage…
▽ More
Deep learning has demonstrated success in health risk prediction especially for patients with chronic and progressing conditions. Most existing works focus on learning disease Network (StageNet) model to extract disease stage information from patient data and integrate it into risk prediction. StageNet is enabled by (1) a stage-aware long short-term memory (LSTM) module that extracts health stage variations unsupervisedly; (2) a stage-adaptive convolutional module that incorporates stage-related progression patterns into risk prediction. We evaluate StageNet on two real-world datasets and show that StageNet outperforms state-of-the-art models in risk prediction task and patient subty** task. Compared to the best baseline model, StageNet achieves up to 12% higher AUPRC for risk prediction task on two real-world patient datasets. StageNet also achieves over 58% higher Calinski-Harabasz score (a cluster quality metric) for a patient subty** task.
△ Less
Submitted 24 January, 2020;
originally announced January 2020.
-
DeepEnroll: Patient-Trial Matching with Deep Embedding and Entailment Prediction
Authors:
Xingyao Zhang,
Cao Xiao,
Lucas M. Glass,
Jimeng Sun
Abstract:
Clinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. The core problem of patient-trial matching is to find qualified patients for a trial, where patient information is stored in electronic health records (EHR) while trial eligibility criteria (EC) are described in text documents available on the web. How to represent l…
▽ More
Clinical trials are essential for drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. The core problem of patient-trial matching is to find qualified patients for a trial, where patient information is stored in electronic health records (EHR) while trial eligibility criteria (EC) are described in text documents available on the web. How to represent longitudinal patient EHR? How to extract complex logical rules from EC? Most existing works rely on manual rule-based extraction, which is time consuming and inflexible for complex inference. To address these challenges, we proposed DeepEnroll, a cross-modal inference learning model to jointly encode enrollment criteria (text) and patients records (tabular data) into a shared latent space for matching inference. DeepEnroll applies a pre-trained Bidirectional Encoder Representations from Transformers(BERT) model to encode clinical trial information into sentence embedding. And uses a hierarchical embedding model to represent patient longitudinal EHR. In addition, DeepEnroll is augmented by a numerical information embedding and entailment module to reason over numerical information in both EC and EHR. These encoders are trained jointly to optimize patient-trial matching score. We evaluated DeepEnroll on the trial-patient matching task with demonstrated on real world datasets. DeepEnroll outperformed the best baseline by up to 12.4% in average F1.
△ Less
Submitted 22 January, 2020; v1 submitted 22 January, 2020;
originally announced January 2020.
-
Scalable Hierarchical Clustering with Tree Grafting
Authors:
Nicholas Monath,
Ari Kobren,
Akshay Krishnamurthy,
Michael Glass,
Andrew McCallum
Abstract:
We introduce Grinch, a new algorithm for large-scale, non-greedy hierarchical clustering with general linkage functions that compute arbitrary similarity between two point sets. The key components of Grinch are its rotate and graft subroutines that efficiently reconfigure the hierarchy as new points arrive, supporting discovery of clusters with complex structure. Grinch is motivated by a new notio…
▽ More
We introduce Grinch, a new algorithm for large-scale, non-greedy hierarchical clustering with general linkage functions that compute arbitrary similarity between two point sets. The key components of Grinch are its rotate and graft subroutines that efficiently reconfigure the hierarchy as new points arrive, supporting discovery of clusters with complex structure. Grinch is motivated by a new notion of separability for clustering with linkage functions: we prove that when the model is consistent with a ground-truth clustering, Grinch is guaranteed to produce a cluster tree containing the ground-truth, independent of data arrival order. Our empirical results on benchmark and author coreference datasets (with standard and learned linkage functions) show that Grinch is more accurate than other scalable methods, and orders of magnitude faster than hierarchical agglomerative clustering.
△ Less
Submitted 31 December, 2019;
originally announced January 2020.
-
CONAN: Complementary Pattern Augmentation for Rare Disease Detection
Authors:
Limeng Cui,
Siddharth Biswal,
Lucas M. Glass,
Greg Lever,
Jimeng Sun,
Cao Xiao
Abstract:
Rare diseases affect hundreds of millions of people worldwide but are hard to detect since they have extremely low prevalence rates (varying from 1/1,000 to 1/200,000 patients) and are massively underdiagnosed. How do we reliably detect rare diseases with such low prevalence rates? How to further leverage patients with possibly uncertain diagnosis to improve detection? In this paper, we propose a…
▽ More
Rare diseases affect hundreds of millions of people worldwide but are hard to detect since they have extremely low prevalence rates (varying from 1/1,000 to 1/200,000 patients) and are massively underdiagnosed. How do we reliably detect rare diseases with such low prevalence rates? How to further leverage patients with possibly uncertain diagnosis to improve detection? In this paper, we propose a Complementary pattern Augmentation (CONAN) framework for rare disease detection. CONAN combines ideas from both adversarial training and max-margin classification. It first learns self-attentive and hierarchical embedding for patient pattern characterization. Then, we develop a complementary generative adversarial networks (GAN) model to generate candidate positive and negative samples from the uncertain patients by encouraging a max-margin between classes. In addition, CONAN has a disease detector that serves as the discriminator during the adversarial training for identifying rare diseases. We evaluated CONAN on two disease detection tasks. For low prevalence inflammatory bowel disease (IBD) detection, CONAN achieved .96 precision recall area under the curve (PR-AUC) and 50.1% relative improvement over best baseline. For rare disease idiopathic pulmonary fibrosis (IPF) detection, CONAN achieves .22 PR-AUC with 41.3% relative improvement over the best baseline.
△ Less
Submitted 26 November, 2019;
originally announced November 2019.
-
Doctor2Vec: Dynamic Doctor Representation Learning for Clinical Trial Recruitment
Authors:
Siddharth Biswal,
Cao Xiao,
Lucas M. Glass,
Elizabeth Milkovits,
Jimeng Sun
Abstract:
Massive electronic health records (EHRs) enable the success of learning accurate patient representations to support various predictive health applications. In contrast, doctor representation was not well studied despite that doctors play pivotal roles in healthcare. How to construct the right doctor representations? How to use doctor representation to solve important health analytic problems? In t…
▽ More
Massive electronic health records (EHRs) enable the success of learning accurate patient representations to support various predictive health applications. In contrast, doctor representation was not well studied despite that doctors play pivotal roles in healthcare. How to construct the right doctor representations? How to use doctor representation to solve important health analytic problems? In this work, we study the problem on {\it clinical trial recruitment}, which is about identifying the right doctors to help conduct the trials based on the trial description and patient EHR data of those doctors. We propose doctor2vec which simultaneously learns 1) doctor representations from EHR data and 2) trial representations from the description and categorical information about the trials. In particular, doctor2vec utilizes a dynamic memory network where the doctor's experience with patients are stored in the memory bank and the network will dynamically assign weights based on the trial representation via an attention mechanism. Validated on large real-world trials and EHR data including 2,609 trials, 25K doctors and 430K patients, doctor2vec demonstrated improved performance over the best baseline by up to $8.7\%$ in PR-AUC. We also demonstrated that the doctor2vec embedding can be transferred to benefit data insufficiency settings including trial recruitment in less populated/newly explored country with $13.7\%$ improvement or for rare diseases with $8.1\%$ improvement in PR-AUC.
△ Less
Submitted 23 November, 2019;
originally announced November 2019.
-
CASTER: Predicting Drug Interactions with Chemical Substructure Representation
Authors:
Kexin Huang,
Cao Xiao,
Trong Nghia Hoang,
Lucas M. Glass,
Jimeng Sun
Abstract:
Adverse drug-drug interactions (DDIs) remain a leading cause of morbidity and mortality. Identifying potential DDIs during the drug design process is critical for patients and society. Although several computational models have been proposed for DDI prediction, there are still limitations: (1) specialized design of drug representation for DDI predictions is lacking; (2) predictions are based on li…
▽ More
Adverse drug-drug interactions (DDIs) remain a leading cause of morbidity and mortality. Identifying potential DDIs during the drug design process is critical for patients and society. Although several computational models have been proposed for DDI prediction, there are still limitations: (1) specialized design of drug representation for DDI predictions is lacking; (2) predictions are based on limited labelled data and do not generalize well to unseen drugs or DDIs; and (3) models are characterized by a large number of parameters, thus are hard to interpret. In this work, we develop a ChemicAl SubstrucTurE Representation (CASTER) framework that predicts DDIs given chemical structures of drugs.CASTER aims to mitigate these limitations via (1) a sequential pattern mining module rooted in the DDI mechanism to efficiently characterize functional sub-structures of drugs; (2) an auto-encoding module that leverages both labelled and unlabelled chemical structure data to improve predictive accuracy and generalizability; and (3) a dictionary learning module that explains the prediction via a small set of coefficients which measure the relevance of each input sub-structures to the DDI outcome. We evaluated CASTER on two real-world DDI datasets and showed that it performed better than state-of-the-art baselines and provided interpretable predictions.
△ Less
Submitted 19 November, 2019; v1 submitted 14 November, 2019;
originally announced November 2019.
-
Frustratingly Easy Natural Question Answering
Authors:
Lin Pan,
Rishav Chakravarti,
Anthony Ferritto,
Michael Glass,
Alfio Gliozzo,
Salim Roukos,
Radu Florian,
Avirup Sil
Abstract:
Existing literature on Question Answering (QA) mostly focuses on algorithmic novelty, data augmentation, or increasingly large pre-trained language models like XLNet and RoBERTa. Additionally, a lot of systems on the QA leaderboards do not have associated research documentation in order to successfully replicate their experiments. In this paper, we outline these algorithmic components such as Atte…
▽ More
Existing literature on Question Answering (QA) mostly focuses on algorithmic novelty, data augmentation, or increasingly large pre-trained language models like XLNet and RoBERTa. Additionally, a lot of systems on the QA leaderboards do not have associated research documentation in order to successfully replicate their experiments. In this paper, we outline these algorithmic components such as Attention-over-Attention, coupled with data augmentation and ensembling strategies that have shown to yield state-of-the-art results on benchmark datasets like SQuAD, even achieving super-human performance. Contrary to these prior results, when we evaluate on the recently proposed Natural Questions benchmark dataset, we find that an incredibly simple approach of transfer learning from BERT outperforms the previous state-of-the-art system trained on 4 million more examples than ours by 1.9 F1 points. Adding ensembling strategies further improves that number by 2.3 F1 points.
△ Less
Submitted 11 September, 2019;
originally announced September 2019.
-
Span Selection Pre-training for Question Answering
Authors:
Michael Glass,
Alfio Gliozzo,
Rishav Chakravarti,
Anthony Ferritto,
Lin Pan,
G P Shrivatsa Bhargav,
Dinesh Garg,
Avirup Sil
Abstract:
BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transformers have provided large gains across many language understanding tasks, achieving a new state-of-the-art (SOTA). BERT is pre-trained on two auxiliary tasks: Masked Language Model and Next Sentence Prediction. In this paper we introduce a new pre-training task inspired by reading comprehension to better…
▽ More
BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transformers have provided large gains across many language understanding tasks, achieving a new state-of-the-art (SOTA). BERT is pre-trained on two auxiliary tasks: Masked Language Model and Next Sentence Prediction. In this paper we introduce a new pre-training task inspired by reading comprehension to better align the pre-training from memorization to understanding. Span Selection Pre-Training (SSPT) poses cloze-like training instances, but rather than draw the answer from the model's parameters, it is selected from a relevant passage. We find significant and consistent improvements over both BERT-BASE and BERT-LARGE on multiple reading comprehension (MRC) datasets. Specifically, our proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT-LARGE by 3 F1 points on short answer prediction. We also show significant impact in HotpotQA, improving answer prediction F1 by 4 points and supporting fact prediction F1 by 1 point and outperforming the previous best system. Moreover, we show that our pre-training approach is particularly effective when training data is limited, improving the learning curve by a large amount.
△ Less
Submitted 18 June, 2020; v1 submitted 9 September, 2019;
originally announced September 2019.
-
Populating Web Scale Knowledge Graphs using Distantly Supervised Relation Extraction and Validation
Authors:
Sarthak Dash,
Michael R. Glass,
Alfio Gliozzo,
Mustafa Canim
Abstract:
In this paper, we propose a fully automated system to extend knowledge graphs using external information from web-scale corpora. The designed system leverages a deep learning based technology for relation extraction that can be trained by a distantly supervised approach. In addition to that, the system uses a deep learning approach for knowledge base completion by utilizing the global structure in…
▽ More
In this paper, we propose a fully automated system to extend knowledge graphs using external information from web-scale corpora. The designed system leverages a deep learning based technology for relation extraction that can be trained by a distantly supervised approach. In addition to that, the system uses a deep learning approach for knowledge base completion by utilizing the global structure information of the induced KG to further refine the confidence of the newly discovered relations. The designed system does not require any effort for adaptation to new languages and domains as it does not use any hand-labeled data, NLP analytics and inference rules. Our experiments, performed on a popular academic benchmark demonstrate that the suggested system boosts the performance of relation extraction by a wide margin, reporting error reductions of 50%, resulting in relative improvement of up to 100%. Also, a web-scale experiment conducted to extend DBPedia with knowledge from Common Crawl shows that our system is not only scalable but also does not require any adaptation cost, while yielding substantial accuracy gain.
△ Less
Submitted 10 September, 2019; v1 submitted 21 August, 2019;
originally announced August 2019.
-
P2L: Predicting Transfer Learning for Images and Semantic Relations
Authors:
Bishwaranjan Bhattacharjee,
John R. Kender,
Matthew Hill,
Parijat Dube,
Siyu Huo,
Michael R. Glass,
Brian Belgodere,
Sharath Pankanti,
Noel Codella,
Patrick Watson
Abstract:
Transfer learning enhances learning across tasks, by leveraging previously learned representations -- if they are properly chosen. We describe an efficient method to accurately estimate the appropriateness of a previously trained model for use in a new learning task. We use this measure, which we call "Predict To Learn" ("P2L"), in the two very different domains of images and semantic relations, w…
▽ More
Transfer learning enhances learning across tasks, by leveraging previously learned representations -- if they are properly chosen. We describe an efficient method to accurately estimate the appropriateness of a previously trained model for use in a new learning task. We use this measure, which we call "Predict To Learn" ("P2L"), in the two very different domains of images and semantic relations, where it predicts, from a set of "source" models, the one model most likely to produce effective transfer for training a given "target" model. We validate our approach thoroughly, by assembling a collection of candidate source models, then fine-tuning each candidate to perform each of a collection of target tasks, and finally measuring how well transfer has been enhanced. Across 95 tasks within multiple domains (images classification and semantic relations), the P2L approach was able to select the best transfer learning model on average, while the heuristic of choosing model trained with the largest data set selected the best model in only 55 cases. These results suggest that P2L captures important information in common between source and target tasks, and that this shared informational structure contributes to successful transfer learning more than simple data size.
△ Less
Submitted 15 October, 2020; v1 submitted 20 August, 2019;
originally announced August 2019.
-
CFO: A Framework for Building Production NLP Systems
Authors:
Rishav Chakravarti,
Cezar Pendus,
Andrzej Sakrajda,
Anthony Ferritto,
Lin Pan,
Michael Glass,
Vittorio Castelli,
J. William Murdock,
Radu Florian,
Salim Roukos,
Avirup Sil
Abstract:
This paper introduces a novel orchestration framework, called CFO (COMPUTATION FLOW ORCHESTRATOR), for building, experimenting with, and deploying interactive NLP (Natural Language Processing) and IR (Information Retrieval) systems to production environments. We then demonstrate a question answering system built using this framework which incorporates state-of-the-art BERT based MRC (Machine Readi…
▽ More
This paper introduces a novel orchestration framework, called CFO (COMPUTATION FLOW ORCHESTRATOR), for building, experimenting with, and deploying interactive NLP (Natural Language Processing) and IR (Information Retrieval) systems to production environments. We then demonstrate a question answering system built using this framework which incorporates state-of-the-art BERT based MRC (Machine Reading Comprehension) with IR components to enable end-to-end answer retrieval. Results from the demo system are shown to be high quality in both academic and industry domain specific settings. Finally, we discuss best practices when (pre-)training BERT based MRC models for production systems.
△ Less
Submitted 19 June, 2020; v1 submitted 16 August, 2019;
originally announced August 2019.
-
A hierarchical approach for modelling X-ray beamlines. Application to a coherent beamline
Authors:
Manuel Sanchez del Rio,
Rafael Celestre,
Mark Glass,
Giovanni Pirro,
Juan Reyes-Herrera,
Ray Barrett,
Julio Cesar da Silva,
Peter Cloetens,
Xianbo Shi,
Luca Rebuffi
Abstract:
We consider different approaches to simulate a modern X-ray beamline. Several methodologies with increasing complexity are applied to discuss the relevant parameters that quantify the beamline performance. Parameters such as flux, dimensions and intensity distribution of the focused beam and coherence properties are obtained from simple analytical calculations to sophisticated computer simulations…
▽ More
We consider different approaches to simulate a modern X-ray beamline. Several methodologies with increasing complexity are applied to discuss the relevant parameters that quantify the beamline performance. Parameters such as flux, dimensions and intensity distribution of the focused beam and coherence properties are obtained from simple analytical calculations to sophisticated computer simulations using ray-tracing and wave optics techniques. A latest-generation X-ray nanofocusing beamline for coherent applications (ID16A at the ESRF) has been chosen to study in detail the issues related to highly demagnifying synchrotron sources and exploiting the beam coherence. The performance of the beamline is studied for two storage rings: the old ESRF-1 (emittance 4000~pm) and the new ESRF-EBS (emittance 150~pm). In addition to traditional results in terms of flux and beam sizes, an innovative study on the partial coherence properties based on the propagation of coherent modes is presented. The different algorithms and methodologies are implemented in the software suite OASYS. Those are discussed with emphasis placed upon the their benefits and limitations of each.
△ Less
Submitted 17 June, 2019;
originally announced June 2019.
-
A Design-Time/Run-Time Application Map** Methodology for Predictable Execution Time in MPSoCs
Authors:
Andreas Weichslgartner,
Stefan Wildermann,
Deepak Gangadharan,
Michael Glaß,
Jürgen Teich
Abstract:
Executing multiple applications on a single MPSoC brings the major challenge of satisfying multiple quality requirements regarding real-time, energy, etc. Hybrid application map** denotes the combination of design-time analysis with run-time application map**. In this article, we present such a methodology, which comprises a design space exploration coupled with a formal performance analysis.…
▽ More
Executing multiple applications on a single MPSoC brings the major challenge of satisfying multiple quality requirements regarding real-time, energy, etc. Hybrid application map** denotes the combination of design-time analysis with run-time application map**. In this article, we present such a methodology, which comprises a design space exploration coupled with a formal performance analysis. This results in several resource reservation configurations, optimized for multiple objectives, with verified real-time guarantees for each individual application. The Pareto-optimal configurations are handed over to run-time management which searches for a suitable map** according to this information. To provide any real-time guarantees, the performance analysis needs to be composable and the influence of the applications on each other has to be bounded. We achieve this either by spatial or a novel temporal isolation for tasks and by exploiting composable NoCs. With the proposed temporal isolation, tasks of different applications can be mapped to the same resource while with spatial isolation, one computing resource can be exclusively used by only one application. The experiments reveal that the success rate in finding feasible application map**s can be increased by the proposed temporal isolation by up to 30% and energy consumption can be reduced compared to spatial isolation.
△ Less
Submitted 16 November, 2017;
originally announced November 2017.
-
Language Independent Acquisition of Abbreviations
Authors:
Michael R. Glass,
Md Faisal Mahbub Chowdhury,
Alfio M. Gliozzo
Abstract:
This paper addresses automatic extraction of abbreviations (encompassing acronyms and initialisms) and corresponding long-form expansions from plain unstructured text. We create and are going to release a multilingual resource for abbreviations and their corresponding expansions, built automatically by exploiting Wikipedia redirect and disambiguation pages, that can be used as a benchmark for eval…
▽ More
This paper addresses automatic extraction of abbreviations (encompassing acronyms and initialisms) and corresponding long-form expansions from plain unstructured text. We create and are going to release a multilingual resource for abbreviations and their corresponding expansions, built automatically by exploiting Wikipedia redirect and disambiguation pages, that can be used as a benchmark for evaluation. We address a shortcoming of previous work where only the redirect pages were used, and so every abbreviation had only a single expansion, even though multiple different expansions are possible for many of the abbreviations. We also develop a principled machine learning based approach to scoring expansion candidates using different techniques such as indicators of near synonymy, topical relatedness, and surface similarity. We show improved performance over seven languages, including two with a non-Latin alphabet, relative to strong baselines.
△ Less
Submitted 23 September, 2017;
originally announced September 2017.
-
Coherent modes of X-ray beams emitted by undulators in new storage rings
Authors:
Mark Glass,
Manuel Sanchez del Rio
Abstract:
Synchrotron radiation emitted by electrons passing through an undulator placed in a storage ring is decomposed in coherent modes. The case of ultimate storage rings where the electron emittance is comparable to the emittance of the photon fan is analyzed by means of the cross spectral density and the coherent mode spectrum. The proposed method permits naturally the statistical analysis and propaga…
▽ More
Synchrotron radiation emitted by electrons passing through an undulator placed in a storage ring is decomposed in coherent modes. The case of ultimate storage rings where the electron emittance is comparable to the emittance of the photon fan is analyzed by means of the cross spectral density and the coherent mode spectrum. The proposed method permits naturally the statistical analysis and propagation of the cross spectral density along the beamline optics. The coherence properties of the X-ray beam at any point of the beamline are completely given in terms of the eigenvalues and coherent modes of the cross spectral density.
△ Less
Submitted 14 June, 2017;
originally announced June 2017.
-
Recognizing the real line
Authors:
A. M. W. Glass,
John S. Wilson
Abstract:
Let $(Ω, \leq)$ be a totally ordered set. We prove that if Aut$(Ω,\leq)$ is transitive and satisfies the same first-order sentences as the automorphism group of the real line (in the language of groups) then $Ω$ and and the real line are isomorphic ordered sets. This improvement of a theorem of Gurevich and Holland is obtained as a consequence of a study of centralizers associated with certain tra…
▽ More
Let $(Ω, \leq)$ be a totally ordered set. We prove that if Aut$(Ω,\leq)$ is transitive and satisfies the same first-order sentences as the automorphism group of the real line (in the language of groups) then $Ω$ and and the real line are isomorphic ordered sets. This improvement of a theorem of Gurevich and Holland is obtained as a consequence of a study of centralizers associated with certain transitive subgroups of Aut$(Ω,\leq)$.
△ Less
Submitted 25 January, 2017;
originally announced January 2017.
-
The first-order theory of $\ell$-permutation groups
Authors:
A. M. W. Glass,
John S. Wilson
Abstract:
Let $(Ω, \leq)$ be a totally ordered set. We prove that if $\Aut(Ω,\leq)$ is transitive and satisfies the same first-order sentences as $\Aut(\RR,\leq)$ (in the language of lattice-ordered groups) then $Ω$ and $\RR$ are isomorphic ordered sets. This improvement of a theorem of Gurevich and Holland is obtained as one of many consequences of a study of centralizers and coloured chains associated wit…
▽ More
Let $(Ω, \leq)$ be a totally ordered set. We prove that if $\Aut(Ω,\leq)$ is transitive and satisfies the same first-order sentences as $\Aut(\RR,\leq)$ (in the language of lattice-ordered groups) then $Ω$ and $\RR$ are isomorphic ordered sets. This improvement of a theorem of Gurevich and Holland is obtained as one of many consequences of a study of centralizers and coloured chains associated with certain transitive subgroups of $\Aut(Ω,\leq)$.
△ Less
Submitted 1 June, 2016;
originally announced June 2016.
-
Applying Deep Learning to Answer Selection: A Study and An Open Task
Authors:
Minwei Feng,
Bing Xiang,
Michael R. Glass,
Lidan Wang,
Bowen Zhou
Abstract:
We apply a general deep learning framework to address the non-factoid question answering task. Our approach does not rely on any linguistic tools and can be applied to different languages or domains. Various architectures are presented and compared. We create and release a QA corpus and setup a new QA task in the insurance domain. Experimental results demonstrate superior performance compared to t…
▽ More
We apply a general deep learning framework to address the non-factoid question answering task. Our approach does not rely on any linguistic tools and can be applied to different languages or domains. Various architectures are presented and compared. We create and release a QA corpus and setup a new QA task in the insurance domain. Experimental results demonstrate superior performance compared to the baseline methods and various technologies give further improvements. For this highly challenging task, the top-1 accuracy can reach up to 65.3% on a test set, which indicates a great potential for practical use.
△ Less
Submitted 2 October, 2015; v1 submitted 6 August, 2015;
originally announced August 2015.
-
Towards Cross-layer Reliability Analysis of Transient and Permanent Faults
Authors:
Hananeh Aliee,
Liang Chen,
Mojtaba Ebrahimi,
Michael Glaß,
Faramarz Khosravi,
Mehdi B. Tahoori
Abstract:
Due to the increasing complexity of Multi-Processor Systems on Chip (MPSoCs), system-level design methodologies have got a lot of attention in recent years. However, the significant gap between the system-level reliability analysis and the level where the actual faults occur necessitates a cross-layer approach in which the sufficient data about the effects of faults at low levels are passed to the…
▽ More
Due to the increasing complexity of Multi-Processor Systems on Chip (MPSoCs), system-level design methodologies have got a lot of attention in recent years. However, the significant gap between the system-level reliability analysis and the level where the actual faults occur necessitates a cross-layer approach in which the sufficient data about the effects of faults at low levels are passed to the system level. So far, the cross-layer reliability analysis techniques focus on a specific type of faults, e.g., either permanent or transient faults. In this work, we aim at proposing a cross-layer reliability analysis which considers different fault types concurrently and connects reliability analysis techniques at different levels of abstraction using adapters.
△ Less
Submitted 12 May, 2014;
originally announced May 2014.
-
Residual nilpotence and ordering in one-relator groups and knot groups
Authors:
I. M. Chiswell,
A. M. W. Glass,
John S. Wilson
Abstract:
Let $G=< x,t\mid w>$ be a one-relator group, where $w$ is a word in $x,t$. If $w$ is a product of conjugates of $x$ then, associated with $w$, there is a polynomial $A_w(X)$ over the integers, which in the case when $G$ is a knot group, is the Alexander polynomial of the knot. We prove, subject to certain restrictions on $w$, that if all roots of $A_w(X)$ are real and positive then $G$ is bi-order…
▽ More
Let $G=< x,t\mid w>$ be a one-relator group, where $w$ is a word in $x,t$. If $w$ is a product of conjugates of $x$ then, associated with $w$, there is a polynomial $A_w(X)$ over the integers, which in the case when $G$ is a knot group, is the Alexander polynomial of the knot. We prove, subject to certain restrictions on $w$, that if all roots of $A_w(X)$ are real and positive then $G$ is bi-orderable, and that if $G$ is bi-orderable then at least one root is real and positive. This sheds light on the bi-orderability of certain knot groups and on a question of Clay and Rolfsen. One of the results relies on an extension of work of G. Baumslag on adjunction of roots to groups, and this may have independent interest.
△ Less
Submitted 2 November, 2014; v1 submitted 5 May, 2014;
originally announced May 2014.
-
A finitely presented orderable group with insoluble word problem
Authors:
V. V. Bludov,
A. M. W. Glass
Abstract:
We construct a finitely presented (two-sided) totally orderable group with insoluble word problem.
We construct a finitely presented (two-sided) totally orderable group with insoluble word problem.
△ Less
Submitted 3 August, 2010;
originally announced August 2010.
-
Reducing the Risk of Spreadsheet Usage - a Case Study
Authors:
Mel Glass,
David Ford,
Sebastian Dewhurst
Abstract:
The frequency with which spreadsheets are used and the associated risk is well known. Many tools and techniques have been developed which help reduce risks associate with creating and maintaining spreadsheet. However, little consideration has been given to reducing the risks of routine usage by the "consumers" - for example when entering and editing data. EASA's solution, available commercially,…
▽ More
The frequency with which spreadsheets are used and the associated risk is well known. Many tools and techniques have been developed which help reduce risks associate with creating and maintaining spreadsheet. However, little consideration has been given to reducing the risks of routine usage by the "consumers" - for example when entering and editing data. EASA's solution, available commercially, ensures that any routine process involving spreadsheets can be executed rapidly and without errors by the end-users, often with a significant reduction in manual effort. Specifically, the technology enables the rapid creation and deployment of web-based applications, connected to one or more centralized spreadsheets; this ensures version control, easy and error free usage, and security of intellectual property contained in spreadsheets.
△ Less
Submitted 11 August, 2009;
originally announced August 2009.
-
Unsolved problems in ordered and orderable groups
Authors:
V. V. Bludov,
A. M. W. Glass,
V. M. Kopytov,
N. Ya. Medvedev
Abstract:
We provide a list of (mainly unsolved) problems in ordered and orderable groups. These were originally compiled 10 years ago by the last two authors. New problems have been added to the list. Progress on some of these is noted and references provided. A few have been solved and their solutions are noted and referenced. We hope that this submission will act as a spur to mathematicians to solve so…
▽ More
We provide a list of (mainly unsolved) problems in ordered and orderable groups. These were originally compiled 10 years ago by the last two authors. New problems have been added to the list. Progress on some of these is noted and references provided. A few have been solved and their solutions are noted and referenced. We hope that this submission will act as a spur to mathematicians to solve some of them!
△ Less
Submitted 15 June, 2009;
originally announced June 2009.