Search | arXiv e-print repository

Towards Accurate and Efficient Document Analytics with Large Language Models

Authors: Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham, Aditya G. Parameswaran, Eugene Wu

Abstract: Unstructured data formats account for over 80% of the data currently stored, and extracting value from such formats remains a considerable challenge. In particular, current approaches for managing unstructured documents do not support ad-hoc analytical queries on document collections. Moreover, Large Language Models (LLMs) directly applied to the documents themselves, or on portions of documents t… ▽ More Unstructured data formats account for over 80% of the data currently stored, and extracting value from such formats remains a considerable challenge. In particular, current approaches for managing unstructured documents do not support ad-hoc analytical queries on document collections. Moreover, Large Language Models (LLMs) directly applied to the documents themselves, or on portions of documents through a process of Retrieval-Augmented Generation (RAG), fail to provide high accuracy query results, and in the LLM-only case, additionally incur high costs. Since many unstructured documents in a collection often follow similar templates that impart a common semantic structure, we introduce ZenDB, a document analytics system that leverages this semantic structure, coupled with LLMs, to answer ad-hoc SQL queries on document collections. ZenDB efficiently extracts semantic hierarchical structures from such templatized documents, and introduces a novel query engine that leverages these structures for accurate and cost-effective query execution. Users can impose a schema on their documents, and query it, all via SQL. Extensive experiments on three real-world document collections demonstrate ZenDB's benefits, achieving up to 30% cost savings compared to LLM-based baselines, while maintaining or improving accuracy, and surpassing RAG-based baselines by up to 61% in precision and 80% in recall, at a marginally higher cost. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2405.04042 [pdf, other]

Space-time Reinforcement Network for Video Object Segmentation

Authors: Yadang Chen, Wentao Zhu, Zhi-Xin Yang, Enhua Wu

Abstract: Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching c… ▽ More Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a competitive result 85.0% on YouTube VOS 2018. In addition, our network exhibits a high inference speed of 32+ FPS. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: Accepted by ICME 2024. 6 pages, 10 figures

arXiv:2404.12552 [pdf, other]

Cocoon: Semantic Table Profiling Using Large Language Models

Authors: Zezhou Huang, Eugene Wu

Abstract: Data profilers play a crucial role in the preprocessing phase of data analysis by identifying quality issues such as missing, extreme, or erroneous values. Traditionally, profilers have relied solely on statistical methods, which lead to high false positives and false negatives. For example, they may incorrectly flag missing values where such absences are expected and normal based on the data's se… ▽ More Data profilers play a crucial role in the preprocessing phase of data analysis by identifying quality issues such as missing, extreme, or erroneous values. Traditionally, profilers have relied solely on statistical methods, which lead to high false positives and false negatives. For example, they may incorrectly flag missing values where such absences are expected and normal based on the data's semantic context. To address these, we introduce Cocoon, a data profiling system that integrates LLMs to imbue statistical profiling with semantics. Cocoon enhances traditional profiling methods by adding a three-step process: Semantic Context, Semantic Profile, and Semantic Review. Our user studies show that Cocoon is highly effective at accurately discerning whether anomalies are genuine errors requiring correction or acceptable variations based on the semantics for real-world datasets. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.10198 [pdf, other]

ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence

Authors: Kevin Wu, Eric Wu, James Zou

Abstract: Retrieval augmented generation (RAG) is frequently used to mitigate hallucinations and provide up-to-date knowledge for large language models (LLMs). However, given that document retrieval is an imprecise task and sometimes results in erroneous or even harmful content being presented in context, this raises the question of how LLMs handle retrieved information: If the provided content is incorrect… ▽ More Retrieval augmented generation (RAG) is frequently used to mitigate hallucinations and provide up-to-date knowledge for large language models (LLMs). However, given that document retrieval is an imprecise task and sometimes results in erroneous or even harmful content being presented in context, this raises the question of how LLMs handle retrieved information: If the provided content is incorrect, does the model know to ignore it, or does it recapitulate the error? Conversely, when the model's initial response is incorrect, does it always know to use the retrieved information to correct itself, or does it insist on its wrong prior response? To answer this, we curate a dataset of over 1200 questions across six domains (e.g., drug dosages, Olympic records, locations) along with content relevant to answering each question. We further apply precise perturbations to the answers in the content that range from subtle to blatant errors. We benchmark six top-performing LLMs, including GPT-4o, on this dataset and find that LLMs are susceptible to adopting incorrect retrieved content, overriding their own correct prior knowledge over 60% of the time. However, the more unrealistic the retrieved content is (i.e. more deviated from truth), the less likely the model is to adopt it. Also, the less confident a model is in its initial response (via measuring token probabilities), the more likely it is to adopt the information in the retrieved content. We exploit this finding and demonstrate simple methods for improving model accuracy where there is conflicting retrieved content. Our results highlight a difficult task and benchmark for LLMs -- namely, their ability to correctly discern when it is wrong in light of correct retrieved content and to reject cases when the provided content is incorrect. △ Less

Submitted 10 June, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Comments: Revised June 9 2024

arXiv:2403.04261 [pdf]

Advancing Biomedical Text Mining with Community Challenges

Authors: Hui Zong, Rongrong Wu, Jiaxue Cha, Erman Wu, Jiakun Li, Liang Tao, Zuofeng Li, Buzhou Tang, Bairong Shen

Abstract: The field of biomedical research has witnessed a significant increase in the accumulation of vast amounts of textual data from various sources such as scientific literatures, electronic health records, clinical trial reports, and social media. However, manually processing and analyzing these extensive and complex resources is time-consuming and inefficient. To address this challenge, biomedical te… ▽ More The field of biomedical research has witnessed a significant increase in the accumulation of vast amounts of textual data from various sources such as scientific literatures, electronic health records, clinical trial reports, and social media. However, manually processing and analyzing these extensive and complex resources is time-consuming and inefficient. To address this challenge, biomedical text mining, also known as biomedical natural language processing, has garnered great attention. Community challenge evaluation competitions have played an important role in promoting technology innovation and interdisciplinary collaboration in biomedical text mining research. These challenges provide platforms for researchers to develop state-of-the-art solutions for data mining and information processing in biomedical research. In this article, we review the recent advances in community challenges specific to Chinese biomedical text mining. Firstly, we collect the information of these evaluation tasks, such as data sources and task types. Secondly, we conduct systematic summary and comparative analysis, including named entity recognition, entity normalization, attribute extraction, relation extraction, event extraction, text classification, text similarity, knowledge graph construction, question answering, text generation, and large language model evaluation. Then, we summarize the potential clinical applications of these community challenge tasks from translational informatics perspective. Finally, we discuss the contributions and limitations of these community challenges, while highlighting future directions in the era of large language models. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2402.05160 [pdf, other]

What's documented in AI? Systematic Analysis of 32K AI Model Cards

Authors: Weixin Liang, Nazneen Rajani, Xinyu Yang, Ezinwanne Ozoani, Eric Wu, Yiqun Chen, Daniel Scott Smith, James Zou

Abstract: The rapid proliferation of AI models has underscored the importance of thorough documentation, as it enables users to understand, trust, and effectively utilize these models in various applications. Although developers are encouraged to produce model cards, it's not clear how much information or what information these cards contain. In this study, we conduct a comprehensive analysis of 32,111 AI m… ▽ More The rapid proliferation of AI models has underscored the importance of thorough documentation, as it enables users to understand, trust, and effectively utilize these models in various applications. Although developers are encouraged to produce model cards, it's not clear how much information or what information these cards contain. In this study, we conduct a comprehensive analysis of 32,111 AI model documentations on Hugging Face, a leading platform for distributing and deploying AI models. Our investigation sheds light on the prevailing model card documentation practices. Most of the AI models with substantial downloads provide model cards, though the cards have uneven informativeness. We find that sections addressing environmental impact, limitations, and evaluation exhibit the lowest filled-out rates, while the training section is the most consistently filled-out. We analyze the content of each section to characterize practitioners' priorities. Interestingly, there are substantial discussions of data, sometimes with equal or even greater emphasis than the model itself. To evaluate the impact of model cards, we conducted an intervention study by adding detailed model cards to 42 popular models which had no or sparse model cards previously. We find that adding model cards is moderately correlated with an increase weekly download rates. Our study opens up a new perspective for analyzing community norms and practices for model documentation through large-scale data science and linguistics analysis. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.02008 [pdf, other]

How well do LLMs cite relevant medical references? An evaluation framework and analyses

Authors: Kevin Wu, Eric Wu, Ally Cassasola, Angela Zhang, Kevin Wei, Teresa Nguyen, Sith Riantawan, Patricia Shi Riantawan, Daniel E. Ho, James Zou

Abstract: Large language models (LLMs) are currently being used to answer medical questions across a variety of clinical domains. Recent top-performing commercial LLMs, in particular, are also capable of citing sources to support their responses. In this paper, we ask: do the sources that LLMs generate actually support the claims that they make? To answer this, we propose three contributions. First, as expe… ▽ More Large language models (LLMs) are currently being used to answer medical questions across a variety of clinical domains. Recent top-performing commercial LLMs, in particular, are also capable of citing sources to support their responses. In this paper, we ask: do the sources that LLMs generate actually support the claims that they make? To answer this, we propose three contributions. First, as expert medical annotations are an expensive and time-consuming bottleneck for scalable evaluation, we demonstrate that GPT-4 is highly accurate in validating source relevance, agreeing 88% of the time with a panel of medical doctors. Second, we develop an end-to-end, automated pipeline called \textit{SourceCheckup} and use it to evaluate five top-performing LLMs on a dataset of 1200 generated questions, totaling over 40K pairs of statements and sources. Interestingly, we find that between ~50% to 90% of LLM responses are not fully supported by the sources they provide. We also evaluate GPT-4 with retrieval augmented generation (RAG) and find that, even still, around 30\% of individual statements are unsupported, while nearly half of its responses are not fully supported. Third, we open-source our curated dataset of medical questions and expert annotations for future evaluations. Given the rapid pace of LLM development and the potential harms of incorrect or outdated medical information, it is crucial to also understand and quantify their capability to produce relevant, trustworthy medical references. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.03038 [pdf, other]

SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines

Authors: Shreya Shankar, Haotian Li, Parth Asawa, Madelon Hulsebos, Yiming Lin, J. D. Zamfirescu-Pereira, Harrison Chase, Will Fu-Hinthorn, Aditya G. Parameswaran, Eugene Wu

Abstract: Large language models (LLMs) are being increasingly deployed as part of pipelines that repeatedly process or generate data of some sort. However, a common barrier to deployment are the frequent and often unpredictable errors that plague LLMs. Acknowledging the inevitability of these errors, we propose {\em data quality assertions} to identify when LLMs may be making mistakes. We present SPADE, a m… ▽ More Large language models (LLMs) are being increasingly deployed as part of pipelines that repeatedly process or generate data of some sort. However, a common barrier to deployment are the frequent and often unpredictable errors that plague LLMs. Acknowledging the inevitability of these errors, we propose {\em data quality assertions} to identify when LLMs may be making mistakes. We present SPADE, a method for automatically synthesizing data quality assertions that identify bad LLM outputs. We make the observation that developers often identify data quality issues during prototy** prior to deployment, and attempt to address them by adding instructions to the LLM prompt over time. SPADE therefore analyzes histories of prompt versions over time to create candidate assertion functions and then selects a minimal set that fulfills both coverage and accuracy requirements. In testing across nine different real-world LLM pipelines, SPADE efficiently reduces the number of assertions by 14\% and decreases false failures by 21\% when compared to simpler baselines. SPADE has been deployed as an offering within LangSmith, LangChain's LLM pipeline hub, and has been used to generate data quality assertions for over 2000 pipelines across a spectrum of industries. △ Less

Submitted 31 March, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

Comments: 17 pages, 6 figures

arXiv:2401.01456 [pdf, other]

ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text

Authors: Dingkun Yan, Liang Yuan, Erwin Wu, Yuma Nishioka, Issei Fujishiro, Suguru Saito

Abstract: Diffusion models have recently demonstrated their effectiveness in generating extremely high-quality images and are now utilized in a wide range of applications, including automatic sketch colorization. Although many methods have been developed for guided sketch colorization, there has been limited exploration of the potential conflicts between image prompts and sketch inputs, which can lead to se… ▽ More Diffusion models have recently demonstrated their effectiveness in generating extremely high-quality images and are now utilized in a wide range of applications, including automatic sketch colorization. Although many methods have been developed for guided sketch colorization, there has been limited exploration of the potential conflicts between image prompts and sketch inputs, which can lead to severe deterioration in the results. Therefore, this paper exhaustively investigates reference-based sketch colorization models that aim to colorize sketch images using reference color images. We specifically investigate two critical aspects of reference-based diffusion models: the "distribution problem", which is a major shortcoming compared to text-based counterparts, and the capability in zero-shot sequential text-based manipulation. We introduce two variations of an image-guided latent diffusion model utilizing different image tokens from the pre-trained CLIP image encoder and propose corresponding manipulation methods to adjust their results sequentially using weighted text inputs. We conduct comprehensive evaluations of our models through qualitative and quantitative experiments as well as a user study. △ Less

Submitted 2 July, 2024; v1 submitted 2 January, 2024; originally announced January 2024.

arXiv:2312.14943 [pdf, other]

Flood Event Extraction from News Media to Support Satellite-Based Flood Insurance

Authors: Tejit Pabari, Beth Tellman, Giannis Karamanolakis, Mitchell Thomas, Max Mauerman, Eugene Wu, Upmanu Lall, Marco Tedesco, Michael S Steckler, Paolo Colosio, Daniel E Osgood, Melody Braun, Jens de Bruijn, Shammun Islam

Abstract: Floods cause large losses to property, life, and livelihoods across the world every year, hindering sustainable development. Safety nets to help absorb financial shocks in disasters, such as insurance, are often unavailable in regions of the world most vulnerable to floods, like Bangladesh. Index-based insurance has emerged as an affordable solution, which considers weather data or information fro… ▽ More Floods cause large losses to property, life, and livelihoods across the world every year, hindering sustainable development. Safety nets to help absorb financial shocks in disasters, such as insurance, are often unavailable in regions of the world most vulnerable to floods, like Bangladesh. Index-based insurance has emerged as an affordable solution, which considers weather data or information from satellites to create a "flood index" that should correlate with the damage insured. However, existing flood event databases are often incomplete, and satellite sensors are not reliable under extreme weather conditions (e.g., because of clouds), which limits the spatial and temporal resolution of current approaches for index-based insurance. In this work, we explore a novel approach for supporting satellite-based flood index insurance by extracting high-resolution spatio-temporal information from news media. First, we publish a dataset consisting of 40,000 news articles covering flood events in Bangladesh by 10 prominent news sources, and inundated area estimates for each division in Bangladesh collected from a satellite radar sensor. Second, we show that keyword-based models are not adequate for this novel application, while context-based classifiers cover complex and implicit flood related patterns. Third, we show that time series extracted from news media have substantial correlation Spearman's rho$=0.70 with satellite estimates of inundated area. Our work demonstrates that news media is a promising source for improving the temporal resolution and expanding the spatial coverage of the available flood damage data. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2310.18742 [pdf, other]

Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL

Authors: Zezhou Huang, Pavan Kalyan Damalapati, Eugene Wu

Abstract: Text-to-SQL allows experts to use databases without in-depth knowledge of them. However, real-world tasks have both query and data ambiguities. Most works on Text-to-SQL focused on query ambiguities and designed chat interfaces for experts to provide clarifications. In contrast, the data management community has long studied data ambiguities, but mainly addresses error detection and correction, ra… ▽ More Text-to-SQL allows experts to use databases without in-depth knowledge of them. However, real-world tasks have both query and data ambiguities. Most works on Text-to-SQL focused on query ambiguities and designed chat interfaces for experts to provide clarifications. In contrast, the data management community has long studied data ambiguities, but mainly addresses error detection and correction, rather than documenting them for disambiguation in data tasks. This work delves into these data ambiguities in real-world datasets. We have identified prevalent data ambiguities of value consistency, data coverage, and data granularity that affect tasks. We examine how documentation, originally made to help humans to disambiguate data, can help GPT-4 with Text-to-SQL tasks. By offering documentation on these, we found GPT-4's performance improved by 28.9%. △ Less

Submitted 28 October, 2023; originally announced October 2023.

arXiv:2310.00902 [pdf, other]

DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

Authors: Yongchan Kwon, Eric Wu, Kevin Wu, James Zou

Abstract: Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image… ▽ More Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image models. In this work, we propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Leveraging an easy-to-compute closed-form expression, DataInf outperforms existing influence computation algorithms in terms of computational and memory efficiency. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. Through systematic empirical evaluations, we show that DataInf accurately approximates influence scores and is orders of magnitude faster than existing methods. In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores. Moreover, it can help to identify which data points are mislabeled. △ Less

Submitted 13 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: ICLR 2024

arXiv:2308.12480 [pdf, other]

Lightweight Materialization for Fast Dashboards Over Joins

Authors: Zezhou Huang, Eugene Wu

Abstract: Dashboards are vital in modern business intelligence tools, providing non-technical users with an interface to access comprehensive business data. With the rise of cloud technology, there is an increased number of data sources to provide enriched contexts for various analytical tasks, leading to a demand for interactive dashboards over a large number of joins. Nevertheless, joins are among the mos… ▽ More Dashboards are vital in modern business intelligence tools, providing non-technical users with an interface to access comprehensive business data. With the rise of cloud technology, there is an increased number of data sources to provide enriched contexts for various analytical tasks, leading to a demand for interactive dashboards over a large number of joins. Nevertheless, joins are among the most expensive operations in DBMSes, making the support of interactive dashboards over joins challenging. In this paper, we present Treant, a dashboard accelerator for queries over large joins. Treant uses factorized query execution to handle aggregation queries over large joins, which alone is still insufficient for interactive speeds. To address this, we exploit the incremental nature of user interactions using Calibrated Junction Hypertree (CJT), a novel data structure that applies lightweight materialization of the intermediates during factorized execution. CJT ensures that the work needed to compute a query is proportional to how different it is from the previous query, rather than the overall complexity. Treant manages CJTs to share work between queries and performs materialization offline or during user "think-times." Implemented as a middleware that rewrites SQL, Treant is portable to any SQL-based DBMS. Our experiments on single node and cloud DBMSes show that Treant improves dashboard interactions by two orders of magnitude, and provides 10x improvement for ML augmentation compared to SOTA factorized ML system. △ Less

Submitted 23 August, 2023; originally announced August 2023.

Journal ref: SIGMOD 2024

arXiv:2308.05637 [pdf, other]

The Fast and the Private: Task-based Dataset Search

Authors: Zezhou Huang, Jiaxiang Liu, Haonan Wang, Eugene Wu

Abstract: Modern dataset search platforms employ ML task-based utility metrics instead of relying on metadata-based keywords to comb through extensive dataset repositories. In this setup, requesters provide an initial dataset, and the platform identifies complementary datasets to augment (join or union) the requester's dataset such that the ML model (e.g., linear regression) performance is improved most. Al… ▽ More Modern dataset search platforms employ ML task-based utility metrics instead of relying on metadata-based keywords to comb through extensive dataset repositories. In this setup, requesters provide an initial dataset, and the platform identifies complementary datasets to augment (join or union) the requester's dataset such that the ML model (e.g., linear regression) performance is improved most. Although effective, current task-based data searches are stymied by (1) high latency which deters users, (2) privacy concerns for regulatory standards, and (3) low data quality which provides low utility. We introduce Mileena, a fast, private, and high-quality task-based dataset search platform. At its heart, Mileena is built on pre-computed semi-ring sketches for efficient ML training and evaluation. Based on semi-ring, we develop a novel Factorized Privacy Mechanism that makes the search differentially private and scales to arbitrary corpus sizes and numbers of requests without major quality degradation. We also demonstrate the early promise in using LLM-based agents for automatic data transformation and applying semi-rings to support causal discovery and treatment effect estimation. △ Less

Submitted 20 August, 2023; v1 submitted 10 August, 2023; originally announced August 2023.

arXiv:2307.03336 [pdf, other]

doi 10.1145/3597465.3605223

DIG: The Data Interface Grammar

Authors: Yiru Chen, Jeffery Tao, Eugene Wu

Abstract: Building interactive data interfaces is hard because the design of an interface depends on the data processing needs for the underlying analysis task, yet we do not have a good representation for analysis tasks. To fill this gap, this paper advocates for a Data Interface Grammar (DIG) as an intermediate representation of analysis tasks. We show that DIG is compatible with existing data engineering… ▽ More Building interactive data interfaces is hard because the design of an interface depends on the data processing needs for the underlying analysis task, yet we do not have a good representation for analysis tasks. To fill this gap, this paper advocates for a Data Interface Grammar (DIG) as an intermediate representation of analysis tasks. We show that DIG is compatible with existing data engineering practices, compact to represent any analysis, simple to translate into an interface design, and amenable to offline analysis. We further illustrate the potential benefits of this abstraction, such as automatic interface generation, automatic interface backend optimization, tutorial generation, and workload generation. △ Less

Submitted 16 July, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

Comments: 7 pages, Workshop on Human-In-the-Loop Data Analytics(HILDA) at SIGMOD 2023

ACM Class: H.2; H.5.2; H.2.3

arXiv:2307.00432 [pdf, other]

Saibot: A Differentially Private Data Search Platform

Authors: Zezhou Huang, Jiaxiang Liu, Daniel Alabi, Raul Castro Fernandez, Eugene Wu

Abstract: Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset and these platforms search for augmentations (join or union compatible datasets) that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that man… ▽ More Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset and these platforms search for augmentations (join or union compatible datasets) that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We present Saibot, a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50 to 90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse. △ Less

Submitted 1 July, 2023; originally announced July 2023.

Journal ref: VLDB 2023

arXiv:2307.00422 [pdf, other]

JoinBoost: Grow Trees Over Normalized Data Using Only SQL

Authors: Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu

Abstract: Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to suppor… ▽ More Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the $Y$ variable to the residual in the non-materialized join result. Although this view update problem is generally ambiguous, we identify addition-to-multiplication preserving, the key property of variance semi-ring to support rmse, the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3x (1.1x) faster for random forests (gradient boosting) compared to LightGBM, and over an order magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas). △ Less

Submitted 1 July, 2023; originally announced July 2023.

Journal ref: VLDB 2023

arXiv:2307.00417 [pdf, other]

doi 10.1145/3597465.3605224

Aggregation Consistency Errors in Semantic Layers and How to Avoid Them

Authors: Zezhou Huang, Pavan Kalyan Damalapati, Eugene Wu

Abstract: Analysts often struggle with analyzing data from multiple tables in a database due to their lack of knowledge on how to join and aggregate the data. To address this, data engineers pre-specify "semantic layers" which include the join conditions and "metrics" of interest with aggregation functions and expressions. However, joins can cause "aggregation consistency issues". For example, analysts may… ▽ More Analysts often struggle with analyzing data from multiple tables in a database due to their lack of knowledge on how to join and aggregate the data. To address this, data engineers pre-specify "semantic layers" which include the join conditions and "metrics" of interest with aggregation functions and expressions. However, joins can cause "aggregation consistency issues". For example, analysts may observe inflated total revenue caused by double counting from join fanouts. Existing BI tools rely on heuristics for deduplication, resulting in imprecise and challenging-to-understand outcomes. To overcome these challenges, we propose "weighing" as a core primitive to counteract join fanouts. "Weighing" has been used in various areas, such as market attribution and order management, ensuring metrics consistency (e.g., total revenue remains the same) even for many-to-many joins. The idea is to assign equal weight to each join key group (rather than each tuple) and then distribute the weights among tuples. Implementing weighing techniques necessitates user input; therefore, we recommend a human-in-the-loop framework that enables users to iteratively explore different strategies and visualize the results. △ Less

Submitted 1 July, 2023; originally announced July 2023.

Journal ref: Proceedings of the Workshop on Human-In-the-Loop Data Analytics 2023

arXiv:2306.14525 [pdf, other]

ParameterNet: Parameters Are All You Need

Authors: Kai Han, Yunhe Wang, Jianyuan Guo, Enhua Wu

Abstract: The large-scale visual pretraining has significantly improve the performance of large vision models. However, we observe the \emph{low FLOPs pitfall} that the existing low-FLOPs models cannot benefit from large-scale pretraining. In this paper, we introduce a novel design principle, termed ParameterNet, aimed at augmenting the number of parameters in large-scale visual pretraining models while min… ▽ More The large-scale visual pretraining has significantly improve the performance of large vision models. However, we observe the \emph{low FLOPs pitfall} that the existing low-FLOPs models cannot benefit from large-scale pretraining. In this paper, we introduce a novel design principle, termed ParameterNet, aimed at augmenting the number of parameters in large-scale visual pretraining models while minimizing the increase in FLOPs. We leverage dynamic convolutions to incorporate additional parameters into the networks with only a marginal rise in FLOPs. The ParameterNet approach allows low-FLOPs networks to take advantage of large-scale visual pretraining. Furthermore, we extend the ParameterNet concept to the language domain to enhance inference results while preserving inference speed. Experiments on the large-scale ImageNet-22K have shown the superiority of our ParameterNet scheme. For example, ParameterNet-600M can achieve higher accuracy on ImageNet than the widely-used Swin Transformer (81.6\% \emph{vs.} 80.9\%) and has much lower FLOPs (0.6G \emph{vs.} 4.5G). In the language domain, LLaMA-1B enhanced with ParameterNet achieves 2\% higher accuracy over vanilla LLaMA. The code will be released at \url{https://parameternet.github.io/}. △ Less

Submitted 14 January, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

Comments: https://parameternet.github.io/

arXiv:2305.10419 [pdf, other]

Kitana: Efficient Data Augmentation Search for AutoML

Authors: Zezhou Huang, Pranav Subramaniam, Raul Castro Fernandez, Eugene Wu

Abstract: AutoML services provide a way for non-expert users to benefit from high-quality ML models without worrying about model design and deployment, in exchange for a charge per hour ($21.252 for VertexAI). However, existing AutoML services are model-centric, in that they are limited to extracting features and searching for models from initial training data-they are only as effective as the initial train… ▽ More AutoML services provide a way for non-expert users to benefit from high-quality ML models without worrying about model design and deployment, in exchange for a charge per hour ($21.252 for VertexAI). However, existing AutoML services are model-centric, in that they are limited to extracting features and searching for models from initial training data-they are only as effective as the initial training data quality. With the increasing volume of tabular data available, there is a huge opportunity for data augmentation. For instance, vertical augmentation adds predictive features, while horizontal augmentation adds examples. This augmented training data yields potentially much better AutoML models at a lower cost. However, existing systems either forgo the augmentation opportunities that provide poor models, or apply expensive augmentation searching techniques that drain users' budgets. Kitana is a data-centric AutoML system that also searches for new tabular datasets that can augment the tabular training data with new features and/or examples. Kitana manages a corpus of datasets, exposes an AutoML interface to users and searches for augmentation with datasets in the corpus to improve AutoML performance. To accelerate search, Kitana applies aggressive pre-computation to train a factorized proxy model and evaluate each candidate augmentation within 0.1s. Kitana also uses a cost model to limit the time spent on augmentation search, supports expressive data access controls, and performs request caching to benefit from past similar requests. Using a corpus of 518 open-source datasets, Kitana produces higher quality models than existing AutoML systems in orders of magnitude less time. Across different user requests, Kitana increases the model R2 from 0.16 to 0.66 while reducing the cost by >100x compared to the naive factorized learning and SOTA data augmentation search. △ Less

Submitted 17 May, 2023; originally announced May 2023.

arXiv:2304.11840 [pdf, other]

Robust and Efficient Memory Network for Video Object Segmentation

Authors: Yadang Chen, Dingwei Zhang, Zhi-xin Yang, Enhua Wu

Abstract: This paper proposes a Robust and Efficient Memory Network, referred to as REMN, for studying semi-supervised video object segmentation (VOS). Memory-based methods have recently achieved outstanding VOS performance by performing non-local pixel-wise matching between the query and memory. However, these methods have two limitations. 1) Non-local matching could cause distractor objects in the backgro… ▽ More This paper proposes a Robust and Efficient Memory Network, referred to as REMN, for studying semi-supervised video object segmentation (VOS). Memory-based methods have recently achieved outstanding VOS performance by performing non-local pixel-wise matching between the query and memory. However, these methods have two limitations. 1) Non-local matching could cause distractor objects in the background to be incorrectly segmented. 2) Memory features with high temporal redundancy consume significant computing resources. For limitation 1, we introduce a local attention mechanism that tackles the background distraction by enhancing the features of foreground objects with the previous mask. For limitation 2, we first adaptively decide whether to update the memory features depending on the variation of foreground objects to reduce temporal redundancy. Second, we employ a dynamic memory bank, which uses a lightweight and differentiable soft modulation gate to decide how many memory features need to be removed in the temporal dimension. Experiments demonstrate that our REMN achieves state-of-the-art results on DAVIS 2017, with a $\mathcal{J\&F}$ score of 86.3% and on YouTube-VOS 2018, with a $\mathcal{G}$ over mean of 85.5%. Furthermore, our network shows a high inference speed of 25+ FPS and uses relatively few computing resources. △ Less

Submitted 24 April, 2023; originally announced April 2023.

Comments: Accepted by ICME 2023. 6 pages, 6 figures

arXiv:2304.02819 [pdf, other]

GPT detectors are biased against non-native English writers

Authors: Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, James Zou

Abstract: The rapid adoption of generative language models has brought about substantial advancements in digital communication, while simultaneously raising concerns regarding the potential misuse of AI-generated content. Although numerous detection methods have been proposed to differentiate between AI and human-generated content, the fairness and robustness of these detectors remain underexplored. In this… ▽ More The rapid adoption of generative language models has brought about substantial advancements in digital communication, while simultaneously raising concerns regarding the potential misuse of AI-generated content. Although numerous detection methods have been proposed to differentiate between AI and human-generated content, the fairness and robustness of these detectors remain underexplored. In this study, we evaluate the performance of several widely-used GPT detectors using writing samples from native and non-native English writers. Our findings reveal that these detectors consistently misclassify non-native English writing samples as AI-generated, whereas native writing samples are accurately identified. Furthermore, we demonstrate that simple prompting strategies can not only mitigate this bias but also effectively bypass GPT detectors, suggesting that GPT detectors may unintentionally penalize writers with constrained linguistic expressions. Our results call for a broader conversation about the ethical implications of deploying ChatGPT content detectors and caution against their use in evaluative or educational settings, particularly when they may inadvertently penalize or exclude non-native English speakers from the global discourse. The published version of this study can be accessed at: www.cell.com/patterns/fulltext/S2666-3899(23)00130-7 △ Less

Submitted 10 July, 2023; v1 submitted 5 April, 2023; originally announced April 2023.

arXiv:2303.07080 [pdf, other]

Bag of Tricks with Quantized Convolutional Neural Networks for image classification

Authors: Jie Hu, Mengze Zeng, Enhua Wu

Abstract: Deep neural networks have been proven effective in a wide range of tasks. However, their high computational and memory costs make them impractical to deploy on resource-constrained devices. To address this issue, quantization schemes have been proposed to reduce the memory footprint and improve inference speed. While numerous quantization methods have been proposed, they lack systematic analysis f… ▽ More Deep neural networks have been proven effective in a wide range of tasks. However, their high computational and memory costs make them impractical to deploy on resource-constrained devices. To address this issue, quantization schemes have been proposed to reduce the memory footprint and improve inference speed. While numerous quantization methods have been proposed, they lack systematic analysis for their effectiveness. To bridge this gap, we collect and improve existing quantization methods and propose a gold guideline for post-training quantization. We evaluate the effectiveness of our proposed method with two popular models, ResNet50 and MobileNetV2, on the ImageNet dataset. By following our guidelines, no accuracy degradation occurs even after directly quantizing the model to 8-bits without additional training. A quantization-aware training based on the guidelines can further improve the accuracy in lower-bits quantization. Moreover, we have integrated a multi-stage fine-tuning strategy that works harmoniously with existing pruning techniques to reduce costs even further. Remarkably, our results reveal that a quantized MobileNetV2 with 30\% sparsity actually surpasses the performance of the equivalent full-precision model, underscoring the effectiveness and resilience of our proposed scheme. △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: ICASSP 2023

arXiv:2302.08760 [pdf, other]

3D Human Pose Lifting with Grid Convolution

Authors: Yangyuxuan Kang, Yuyang Liu, Anbang Yao, Shandong Wang, Enhua Wu

Abstract: Existing lifting networks for regressing 3D human poses from 2D single-view poses are typically constructed with linear layers based on graph-structured representation learning. In sharp contrast to them, this paper presents Grid Convolution (GridConv), mimicking the wisdom of regular convolution operations in image space. GridConv is based on a novel Semantic Grid Transformation (SGT) which lever… ▽ More Existing lifting networks for regressing 3D human poses from 2D single-view poses are typically constructed with linear layers based on graph-structured representation learning. In sharp contrast to them, this paper presents Grid Convolution (GridConv), mimicking the wisdom of regular convolution operations in image space. GridConv is based on a novel Semantic Grid Transformation (SGT) which leverages a binary assignment matrix to map the irregular graph-structured human pose onto a regular weave-like grid pose representation joint by joint, enabling layer-wise feature learning with GridConv operations. We provide two ways to implement SGT, including handcrafted and learnable designs. Surprisingly, both designs turn out to achieve promising results and the learnable one is better, demonstrating the great potential of this new lifting representation learning formulation. To improve the ability of GridConv to encode contextual cues, we introduce an attention module over the convolutional kernel, making grid convolution operations input-dependent, spatial-aware and grid-specific. We show that our fully convolutional grid lifting network outperforms state-of-the-art methods with noticeable margins under (1) conventional evaluation on Human3.6M and (2) cross-evaluation on MPI-INF-3DHP. Code is available at https://github.com/OSVAI/GridConv △ Less

Submitted 17 February, 2023; originally announced February 2023.

Comments: Oral paper at AAAI 2023. Project website: https://github.com/OSVAI/GridConv

arXiv:2301.01367 [pdf, other]

The Price of Anarchy of the Asymmetric One-Sided Allocation Problem

Authors: Sissi Jiang, Ndiame Ndiaye, Adrian Vetta, Eggie Wu

Abstract: We study fair mechanisms for the (asymmetric) one-sided allocation problem with m items and n multi-unit demand agents with additive, unit-sum valuations. The symmetric case (m=n), the one-sided matching problem, has been studied extensively for the class of unit demand agents, in particular with respect to the folklore Random Priority mechanism and the Probabilistic Serial mechanism, introduced b… ▽ More We study fair mechanisms for the (asymmetric) one-sided allocation problem with m items and n multi-unit demand agents with additive, unit-sum valuations. The symmetric case (m=n), the one-sided matching problem, has been studied extensively for the class of unit demand agents, in particular with respect to the folklore Random Priority mechanism and the Probabilistic Serial mechanism, introduced by Bogomolnaia and Moulin. Under the assumption of unit-sum valuation functions, Christodoulou et al. proved that the price of anarchy is $Θ(\sqrt{n})$ in the one-sided matching problem for both the Random Priority and Probabilistic Serial mechanisms. Whilst both Random Priority and Probabilistic Serial are ordinal mechanisms, these approximation guarantees are the best possible even for the broader class of cardinal mechanisms. To extend these results to the general setting there are two technical obstacles. One, asymmetry ($m\neq n$) is problematic especially when the number of items is much greater than the number of items. Two, it is necessary to study multi-unit demand agents rather than simply unit demand agents. Our approach is to study a cardinal mechanism variant of Probabilistic Serial, which we call Cardinal Probabilistic Serial. We present structural theorems for this mechanism and use them to obtain bounds on the price of anarchy. Our first main result is an upper bound of $O(\sqrt{n}\cdot \log m)$ on the price of anarchy for the asymmetric one-sided allocation problem with multi-unit demand agents. This upper bound applies to Probabilistic Serial as well and there is a complementary lower bound of $Ω(\sqrt{n})$ for any fair mechanism. Our second main result is that the price of anarchy degrades with the number of items. Specifically, a logarithmic dependence on the number of items is necessary for both mechanisms. △ Less

Submitted 13 May, 2023; v1 submitted 3 January, 2023; originally announced January 2023.

Comments: 32 pages, 4 figures

arXiv:2212.10200 [pdf, other]

Redistribution of Weights and Activations for AdderNet Quantization

Authors: Ying Nie, Kai Han, Haikang Diao, Chuanjian Liu, Enhua Wu, Yunhe Wang

Abstract: Adder Neural Network (AdderNet) provides a new way for develo** energy-efficient neural networks by replacing the expensive multiplications in convolution with cheaper additions (i.e.l1-norm). To achieve higher hardware efficiency, it is necessary to further study the low-bit quantization of AdderNet. Due to the limitation that the commutative law in multiplication does not hold in l1-norm, the… ▽ More Adder Neural Network (AdderNet) provides a new way for develo** energy-efficient neural networks by replacing the expensive multiplications in convolution with cheaper additions (i.e.l1-norm). To achieve higher hardware efficiency, it is necessary to further study the low-bit quantization of AdderNet. Due to the limitation that the commutative law in multiplication does not hold in l1-norm, the well-established quantization methods on convolutional networks cannot be applied on AdderNets. Thus, the existing AdderNet quantization techniques propose to use only one shared scale to quantize both the weights and activations simultaneously. Admittedly, such an approach can keep the commutative law in the l1-norm quantization process, while the accuracy drop after low-bit quantization cannot be ignored. To this end, we first thoroughly analyze the difference on distributions of weights and activations in AdderNet and then propose a new quantization algorithm by redistributing the weights and the activations. Specifically, the pre-trained full-precision weights in different kernels are clustered into different groups, then the intra-group sharing and inter-group independent scales can be adopted. To further compensate the accuracy drop caused by the distribution difference, we then develop a lossless range clamp scheme for weights and a simple yet effective outliers clamp strategy for activations. Thus, the functionality of full-precision weights and the representation ability of full-precision activations can be fully preserved. The effectiveness of the proposed quantization method for AdderNet is well verified on several benchmarks, e.g., our 4-bit post-training quantized adder ResNet-18 achieves an 66.5% top-1 accuracy on the ImageNet with comparable energy efficiency, which is about 8.5% higher than that of the previous AdderNet quantization methods. △ Less

Submitted 20 December, 2022; originally announced December 2022.

arXiv:2210.03851 [pdf, other]

Calibration: A Simple Trick for Wide-table Delta Analytics

Authors: Zezhou Huang, Eugene Wu

Abstract: Data analytics over normalized databases typically requires computing and materializing expensive joins (wide-tables). Factorized query execution models execution as message passing between relations in the join graph and pushes aggregations through joins to reduce intermediate result sizes. Although this accelerates query execution, it only optimizes a single wide-table query. In contrast, wide-t… ▽ More Data analytics over normalized databases typically requires computing and materializing expensive joins (wide-tables). Factorized query execution models execution as message passing between relations in the join graph and pushes aggregations through joins to reduce intermediate result sizes. Although this accelerates query execution, it only optimizes a single wide-table query. In contrast, wide-table analytics is usually interactive and users want to apply delta to the initial query structure. For instance, users want to slice, dice and drill-down dimensions, update part of the tables and join with new tables for enrichment. Such Wide-table Delta Analytics offers novel work-sharing opportunities. This work shows that carefully materializing messages during query execution can accelerate Wide-table Delta Analytics by >10^5x as compared to factorized execution, and only incurs a constant factor overhead. The key challenge is that messages are sensitive to the message passing ordering. To address this challenge, we borrow the concept of calibration in probabilistic graphical models to materialize sufficient messages to support any ordering. We manifest these ideas in the novel Calibrated Junction Hypertree (CJT) data structure, which is fast to build, aggressively re-uses messages to accelerate future queries, and is incrementally maintainable under updates. We further show how CJTs benefit applications such as OLAP, query explanation, streaming data, and data augmentation for ML. Our experiments evaluate three versions of the CJT that run in a single-threaded custom engine, on cloud DBs, and in Pandas, and show 30x - 10^5x improvements over state-of-the-art factorized execution algorithms on the above applications. △ Less

Submitted 7 October, 2022; originally announced October 2022.

arXiv:2209.15611 [pdf, other]

Protein structure generation via folding diffusion

Authors: Kevin E. Wu, Kevin K. Yang, Rianne van den Berg, James Y. Zou, Alex X. Lu, Ava P. Amini

Abstract: The ability to computationally generate novel yet physically foldable protein structures could lead to new biological discoveries and new treatments targeting yet incurable diseases. Despite recent advances in protein structure prediction, directly generating diverse, novel protein structures from neural networks remains difficult. In this work, we present a new diffusion-based generative model th… ▽ More The ability to computationally generate novel yet physically foldable protein structures could lead to new biological discoveries and new treatments targeting yet incurable diseases. Despite recent advances in protein structure prediction, directly generating diverse, novel protein structures from neural networks remains difficult. In this work, we present a new diffusion-based generative model that designs protein backbone structures via a procedure that mirrors the native folding process. We describe protein backbone structure as a series of consecutive angles capturing the relative orientation of the constituent amino acid residues, and generate new structures by denoising from a random, unfolded state towards a stable folded structure. Not only does this mirror how proteins biologically twist into energetically favorable conformations, the inherent shift and rotational invariance of this representation crucially alleviates the need for complex equivariant networks. We train a denoising diffusion probabilistic model with a simple transformer backbone and demonstrate that our resulting model unconditionally generates highly realistic protein structures with complexity and structural patterns akin to those of naturally-occurring proteins. As a useful resource, we release the first open-source codebase and trained models for protein structure diffusion. △ Less

Submitted 23 November, 2022; v1 submitted 30 September, 2022; originally announced September 2022.

ACM Class: I.2.0; J.3

arXiv:2209.15278 [pdf, other]

Rethinking skip connection model as a learnable Markov chain

Authors: Dengsheng Chen, Jie Hu, Wenwen Qiang, Xiaoming Wei, Enhua Wu

Abstract: Over past few years afterward the birth of ResNet, skip connection has become the defacto standard for the design of modern architectures due to its widespread adoption, easy optimization and proven performance. Prior work has explained the effectiveness of the skip connection mechanism from different perspectives. In this work, we deep dive into the model's behaviors with skip connections which c… ▽ More Over past few years afterward the birth of ResNet, skip connection has become the defacto standard for the design of modern architectures due to its widespread adoption, easy optimization and proven performance. Prior work has explained the effectiveness of the skip connection mechanism from different perspectives. In this work, we deep dive into the model's behaviors with skip connections which can be formulated as a learnable Markov chain. An efficient Markov chain is preferred as it always maps the input data to the target domain in a better way. However, while a model is explained as a Markov chain, it is not guaranteed to be optimized following an efficient Markov chain by existing SGD-based optimizers which are prone to get trapped in local optimal points. In order to towards a more efficient Markov chain, we propose a simple routine of penal connection to make any residual-like model become a learnable Markov chain. Aside from that, the penal connection can also be viewed as a particular model regularization and can be easily implemented with one line of code in the most popular deep learning frameworks~\footnote{Source code: \url{https://github.com/densechen/penal-connection}}. The encouraging experimental results in multi-modal translation and image recognition empirically confirm our conjecture of the learnable Markov chain view and demonstrate the superiority of the proposed penal connection. △ Less

Submitted 2 March, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

Comments: 12 pages, 4 figures

arXiv:2209.08834 [pdf, other]

NL2INTERFACE: Interactive Visualization Interface Generation from Natural Language Queries

Authors: Yiru Chen, Ryan Li, Austin Mac, Tianbao Xie, Tao Yu, Eugene Wu

Abstract: We develop NL2INTERFACE to explore the potential of generating usable interactive multi-visualization interfaces from natural language queries. With NL2INTERFACE, users can directly write natural language queries to automatically generate a fully interactive multi-visualization interface without any extra effort of learning a tool or programming language. Further, users can interact with the inter… ▽ More We develop NL2INTERFACE to explore the potential of generating usable interactive multi-visualization interfaces from natural language queries. With NL2INTERFACE, users can directly write natural language queries to automatically generate a fully interactive multi-visualization interface without any extra effort of learning a tool or programming language. Further, users can interact with the interfaces to easily transform the data and quickly see the results in the visualizations. △ Less

Submitted 24 September, 2022; v1 submitted 19 September, 2022; originally announced September 2022.

Comments: 5 pages, IEEE Visualization Conference NLVIZ Workshop 2022

ACM Class: H.5.2; H.2; I.2.7

Journal ref: IEEE Visualization Conference NLVIZ Workshop 2022

arXiv:2206.11448 [pdf, ps, other]

Efficient Adaptive Federated Optimization of Federated Learning for IoT

Authors: Zunming Chen, Hongyan Cui, Ensen Wu, Yu Xi

Abstract: The proliferation of the Internet of Things (IoT) and widespread use of devices with sensing, computing, and communication capabilities have motivated intelligent applications empowered by artificial intelligence. The classical artificial intelligence algorithms require centralized data collection and processing which are challenging in realistic intelligent IoT applications due to growing data pr… ▽ More The proliferation of the Internet of Things (IoT) and widespread use of devices with sensing, computing, and communication capabilities have motivated intelligent applications empowered by artificial intelligence. The classical artificial intelligence algorithms require centralized data collection and processing which are challenging in realistic intelligent IoT applications due to growing data privacy concerns and distributed datasets. Federated Learning (FL) has emerged as a distributed privacy-preserving learning framework that enables IoT devices to train global model through sharing model parameters. However, inefficiency due to frequent parameters transmissions significantly reduce FL performance. Existing acceleration algorithms consist of two main type including local update considering trade-offs between communication and computation and parameter compression considering trade-offs between communication and precision. Jointly considering these two trade-offs and adaptively balancing their impacts on convergence have remained unresolved. To solve the problem, this paper proposes a novel efficient adaptive federated optimization (EAFO) algorithm to improve efficiency of FL, which minimizes the learning error via jointly considering two variables including local update and parameter compression and enables FL to adaptively adjust the two variables and balance trade-offs among computation, communication and precision. The experiment results illustrate that comparing with state-of-the-art algorithms, the proposed EAFO can achieve higher accuracies faster. △ Less

Submitted 22 June, 2022; originally announced June 2022.

arXiv:2206.00272 [pdf, other]

Vision GNN: An Image is Worth Graph of Nodes

Authors: Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, Enhua Wu

Abstract: Network architecture plays a key role in the deep learning-based computer vision system. The widely-used convolutional neural network and transformer treat the image as a grid or sequence structure, which is not flexible to capture irregular and complex objects. In this paper, we propose to represent the image as a graph structure and introduce a new Vision GNN (ViG) architecture to extract graph-… ▽ More Network architecture plays a key role in the deep learning-based computer vision system. The widely-used convolutional neural network and transformer treat the image as a grid or sequence structure, which is not flexible to capture irregular and complex objects. In this paper, we propose to represent the image as a graph structure and introduce a new Vision GNN (ViG) architecture to extract graph-level feature for visual tasks. We first split the image to a number of patches which are viewed as nodes, and construct a graph by connecting the nearest neighbors. Based on the graph representation of images, we build our ViG model to transform and exchange information among all the nodes. ViG consists of two basic modules: Grapher module with graph convolution for aggregating and updating graph information, and FFN module with two linear layers for node feature transformation. Both isotropic and pyramid architectures of ViG are built with different model sizes. Extensive experiments on image recognition and object detection tasks demonstrate the superiority of our ViG architecture. We hope this pioneering study of GNN on general visual tasks will provide useful inspiration and experience for future research. The PyTorch code is available at https://github.com/huawei-noah/Efficient-AI-Backbones and the MindSpore code is available at https://gitee.com/mindspore/models. △ Less

Submitted 4 November, 2022; v1 submitted 1 June, 2022; originally announced June 2022.

Comments: NeurIPS 2022

arXiv:2205.04148 [pdf, other]

Productive Performance Engineering for Weather and Climate Modeling with Python

Authors: Tal Ben-Nun, Linus Groner, Florian Deconinck, Tobias Wicky, Eddie Davis, Johann Dahm, Oliver D. Elbert, Rhea George, Jeremy McGibbon, Lukas Trümper, Elynn Wu, Oliver Fuhrer, Thomas Schulthess, Torsten Hoefler

Abstract: Earth system models are developed with a tight coupling to target hardware, often containing specialized code predicated on processor characteristics. This coupling stems from using imperative languages that hard-code computation schedules and layout. We present a detailed account of optimizing the Finite Volume Cubed-Sphere Dynamical Core (FV3), improving productivity and performance. By using a… ▽ More Earth system models are developed with a tight coupling to target hardware, often containing specialized code predicated on processor characteristics. This coupling stems from using imperative languages that hard-code computation schedules and layout. We present a detailed account of optimizing the Finite Volume Cubed-Sphere Dynamical Core (FV3), improving productivity and performance. By using a declarative Python-embedded stencil domain-specific language and data-centric optimization, we abstract hardware-specific details and define a semi-automated workflow for analyzing and optimizing weather and climate applications. The workflow utilizes both local and full-program optimization, as well as user-guided fine-tuning. To prune the infeasible global optimization space, we automatically utilize repeating code motifs via a novel transfer tuning approach. On the Piz Daint supercomputer, we scale to 2,400 GPUs, achieving speedups of up to 3.92x over the tuned production implementation at a fraction of the original code. △ Less

Submitted 25 August, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

arXiv:2205.01283 [pdf, other]

Extending the View Composition Algebra to Hierarchical Data

Authors: Eugene Wu

Abstract: Comparison is a core task in visual analysis. Although there are numerous guidelines to help users design effective visualizations to aid known comparison tasks, there are few formalisms that define the semantics of comparison operations in a way that can serve as the basis for a grammar of comparison interactions. Recent work proposed a formalism called View Composition Algebra (VCA) that enables… ▽ More Comparison is a core task in visual analysis. Although there are numerous guidelines to help users design effective visualizations to aid known comparison tasks, there are few formalisms that define the semantics of comparison operations in a way that can serve as the basis for a grammar of comparison interactions. Recent work proposed a formalism called View Composition Algebra (VCA) that enables ad hoc comparisons between any combination of marks, trends, or charts in a visualization interface. However, VCA limits comparisons to visual representations of data that have an identical schema, or where the schemas form a strict subset relationship (e.g., comparing price per state with price, but not with price per county). In contrast, the majority of real-world data - temporal, geographical, organizational - are hierarchical. To bridge this gap, this paper presents an extension to VCA (called VCAH) that enables ad hoc comparisons between visualizations of hierarchical data. VCAH leverages known hierarchical relationships to enable ad hoc comparison of data at different hierarchical granularities. We illustrate applications to hierarchical and Tableau visualizations. △ Less

Submitted 2 May, 2022; originally announced May 2022.

arXiv:2205.01263 [pdf, other]

How Do Captions Affect Visualization Reading?

Authors: Hanxiu 'Hazel' Zhu, Shelly Shiying Cheng, Eugene Wu

Abstract: Captions help readers better understand visualizations. However, if the visualization is intended to communicate specific features, should the caption be statistical, and focus on specific values, or perceptual, and focus on general patterns? Prior work has shown that when captions mention visually salient features, readers tend to recall those features. Still, we lack explicit guidelines for how… ▽ More Captions help readers better understand visualizations. However, if the visualization is intended to communicate specific features, should the caption be statistical, and focus on specific values, or perceptual, and focus on general patterns? Prior work has shown that when captions mention visually salient features, readers tend to recall those features. Still, we lack explicit guidelines for how to compose the appropriate caption. Further, what if the author wishes to emphasize a less salient feature? In this paper, we study how the visual salience of the feature described in a caption, and the semantic level of the caption description, affect a reader's takeaways from line charts. For each single- or multi-line chart, we generate 4 captions that 1) describe either the primary or secondary salient feature in a chart, and 2) describe the feature either at the statistical or perceptual levels. We then show participants random chart-caption pairs and record their takeaways. We find that the primary salient feature is more memorable for single-line charts when the caption is expressed at the statistical level; for primary and secondary features in multi-line charts, the perceptual level is more memorable. We also find that many readers will tend to recall y-axis numerical values when a caption is present. △ Less

Submitted 26 September, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

ACM Class: H.5.2

arXiv:2204.14267 [pdf, other]

A Grammar of Hypotheses for Visualization, Data, and Analysis

Authors: Ashley Suh, Ab Mosca, Eugene Wu, Remco Chang

Abstract: We present a grammar for expressing hypotheses in visual data analysis to formalize the previously abstract notion of "analysis tasks." Through the lens of our grammar, we lay the groundwork for how a user's data analysis questions can be operationalized and automated as a set of hypotheses (a hypothesis space). We demonstrate that our grammar-based approach for analysis tasks can provide a system… ▽ More We present a grammar for expressing hypotheses in visual data analysis to formalize the previously abstract notion of "analysis tasks." Through the lens of our grammar, we lay the groundwork for how a user's data analysis questions can be operationalized and automated as a set of hypotheses (a hypothesis space). We demonstrate that our grammar-based approach for analysis tasks can provide a systematic method towards unifying three disparate spaces in visualization research: the hypotheses a dataset can express (a data hypothesis space), the hypotheses a user would like to refine or verify through analysis (an analysis hypothesis space), and the hypotheses a visualization design is capable of supporting (a visualization hypothesis space). We illustrate how the formalization of these three spaces can inform future research in visualization evaluation, knowledge elicitation, analytic provenance, and visualization recommendation by using a shared language for hypotheses. Finally, we compare our proposed grammar-based approach with existing visual analysis models and discuss the potential of a new hypothesis-driven theory of visual analytics. △ Less

Submitted 3 April, 2023; v1 submitted 29 April, 2022; originally announced April 2022.

arXiv:2202.07836 [pdf, other]

View Composition Algebra for Ad Hoc Comparison

Authors: Eugene Wu

Abstract: Comparison is a core task in visual analysis. Although there are numerous guidelines to help users design effective visualizations to aid known comparison tasks, there are few techniques available when users want to make ad hoc comparisons between marks, trends, or charts during data exploration and visual analysis. For instance, to compare voting count maps from different years, two stock trends… ▽ More Comparison is a core task in visual analysis. Although there are numerous guidelines to help users design effective visualizations to aid known comparison tasks, there are few techniques available when users want to make ad hoc comparisons between marks, trends, or charts during data exploration and visual analysis. For instance, to compare voting count maps from different years, two stock trends in a line chart, or a scatterplot of country GDPs with a textual summary of the average GDP. Ideally, users can directly select the comparison targets and compare them, however what elements of a visualization should be candidate targets, which combinations of targets are safe to compare, and what comparison operations make sense? This paper proposes a conceptual model that lets users compose combinations of values, marks, legend elements, and charts using a set of composition operators that summarize, compute differences, merge, and model their operands. We further define a View Composition Algebra (VCA) that is compatible with datacube-based visualizations, derive an interaction design based on this algebra that supports ad hoc visual comparisons, and illustrate its utility through several use cases. △ Less

Submitted 15 February, 2022; originally announced February 2022.

arXiv:2201.05664 [pdf, other]

doi 10.1145/3514221.3520153

Demonstration of PI2: Interactive Visualization Interface Generation for SQL Analysis in Notebook

Authors: Jeffrey Tao, Yiru Chen, Eugene Wu

Abstract: We demonstrate PI2, the first notebook extension that can automatically generate interactive visualization interfaces during SQL-based analyses. We demonstrate PI2, the first notebook extension that can automatically generate interactive visualization interfaces during SQL-based analyses. △ Less

Submitted 14 January, 2022; originally announced January 2022.

Comments: arXiv admin note: text overlap with arXiv:2107.08203

ACM Class: H.2; H.5.2

Journal ref: SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

arXiv:2201.03297 [pdf, other]

doi 10.1007/s11263-022-01575-y

GhostNets on Heterogeneous Devices via Cheap Operations

Authors: Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chun**g Xu, Enhua Wu, Qi Tian

Abstract: Deploying convolutional neural networks (CNNs) on mobile devices is difficult due to the limited memory and computation resources. We aim to design efficient neural networks for heterogeneous devices including CPU and GPU, by exploiting the redundancy in feature maps, which has rarely been investigated in neural architecture design. For CPU-like devices, we propose a novel CPU-efficient Ghost (C-G… ▽ More Deploying convolutional neural networks (CNNs) on mobile devices is difficult due to the limited memory and computation resources. We aim to design efficient neural networks for heterogeneous devices including CPU and GPU, by exploiting the redundancy in feature maps, which has rarely been investigated in neural architecture design. For CPU-like devices, we propose a novel CPU-efficient Ghost (C-Ghost) module to generate more feature maps from cheap operations. Based on a set of intrinsic feature maps, we apply a series of linear transformations with cheap cost to generate many ghost feature maps that could fully reveal information underlying intrinsic features. The proposed C-Ghost module can be taken as a plug-and-play component to upgrade existing convolutional neural networks. C-Ghost bottlenecks are designed to stack C-Ghost modules, and then the lightweight C-GhostNet can be easily established. We further consider the efficient networks for GPU devices. Without involving too many GPU-inefficient operations (e.g.,, depth-wise convolution) in a building stage, we propose to utilize the stage-wise feature redundancy to formulate GPU-efficient Ghost (G-Ghost) stage structure. The features in a stage are split into two parts where the first part is processed using the original block with fewer output channels for generating intrinsic features, and the other are generated using cheap operations by exploiting stage-wise redundancy. Experiments conducted on benchmarks demonstrate the effectiveness of the proposed C-Ghost module and the G-Ghost stage. C-GhostNet and G-GhostNet can achieve the optimal trade-off of accuracy and latency for CPU and GPU, respectively. Code is available at https://github.com/huawei-noah/CV-Backbones. △ Less

Submitted 10 January, 2022; originally announced January 2022.

Comments: Accepted by IJCV 2022. Extension of GhostNet CVPR2020 paper (arXiv:1911.11907). arXiv admin note: substantial text overlap with arXiv:1911.11907

arXiv:2112.15594 [pdf, other]

doi 10.1073/pnas.2123433119

A Neural Network Solves, Explains, and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human Level

Authors: Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, Roman Wang, Nikhil Singh, Taylor L. Patti, Jayson Lynch, Avi Shporer, Nakul Verma, Eugene Wu, Gilbert Strang

Abstract: We demonstrate that a neural network pre-trained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates new questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI's Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a new dataset of questions from MIT's largest m… ▽ More We demonstrate that a neural network pre-trained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates new questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI's Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a new dataset of questions from MIT's largest mathematics courses (Single Variable and Multivariable Calculus, Differential Equations, Introduction to Probability and Statistics, Linear Algebra, and Mathematics for Computer Science) and Columbia University's Computational Linear Algebra. We solve questions from a MATH dataset (on Prealgebra, Algebra, Counting and Probability, Intermediate Algebra, Number Theory, and Precalculus), the latest benchmark of advanced mathematics problems designed to assess mathematical reasoning. We randomly sample questions and generate solutions with multiple modalities, including numbers, equations, and plots. The latest GPT-3 language model pre-trained on text automatically solves only 18.8% of these university questions using zero-shot learning and 30.8% using few-shot learning and the most recent chain of thought prompting. In contrast, program synthesis with few-shot learning using Codex fine-tuned on code generates programs that automatically solve 81% of these questions. Our approach improves the previous state-of-the-art automatic solution accuracy on the benchmark topics from 8.8% to 81.1%. We perform a survey to evaluate the quality and difficulty of generated questions. This work is the first to automatically solve university-level mathematics course questions at a human level and the first work to explain and generate university-level mathematics course questions at scale, a milestone for higher education. △ Less

Submitted 30 May, 2022; v1 submitted 31 December, 2021; originally announced December 2021.

Comments: 181 pages, 8 figures, 280 tables

arXiv:2112.10149 [pdf, other]

Elastic-Link for Binarized Neural Network

Authors: Jie Hu, Ziheng Wu, Vince Tan, Zhilin Lu, Mengze Zeng, Enhua Wu

Abstract: Recent work has shown that Binarized Neural Networks (BNNs) are able to greatly reduce computational costs and memory footprints, facilitating model deployment on resource-constrained devices. However, in comparison to their full-precision counterparts, BNNs suffer from severe accuracy degradation. Research aiming to reduce this accuracy gap has thus far largely focused on specific network archite… ▽ More Recent work has shown that Binarized Neural Networks (BNNs) are able to greatly reduce computational costs and memory footprints, facilitating model deployment on resource-constrained devices. However, in comparison to their full-precision counterparts, BNNs suffer from severe accuracy degradation. Research aiming to reduce this accuracy gap has thus far largely focused on specific network architectures with few or no 1x1 convolutional layers, for which standard binarization methods do not work well. Because 1x1 convolutions are common in the design of modern architectures (e.g. GoogleNet, ResNet, DenseNet), it is crucial to develop a method to binarize them effectively for BNNs to be more widely adopted. In this work, we propose an "Elastic-Link" (EL) module to enrich information flow within a BNN by adaptively adding real-valued input features to the subsequent convolutional output features. The proposed EL module is easily implemented and can be used in conjunction with other methods for BNNs. We demonstrate that adding EL to BNNs produces a significant improvement on the challenging large-scale ImageNet dataset. For example, we raise the top-1 accuracy of binarized ResNet26 from 57.9% to 64.0%. EL also aids convergence in the training of binarized MobileNet, for which a top-1 accuracy of 56.4% is achieved. Finally, with the integration of ReActNet, it yields a new state-of-the-art result of 71.9% top-1 accuracy. △ Less

Submitted 17 February, 2023; v1 submitted 19 December, 2021; originally announced December 2021.

Comments: AAAI2022

arXiv:2111.08168 [pdf, other]

Explaining medical AI performance disparities across sites with confounder Shapley value analysis

Authors: Eric Wu, Kevin Wu, James Zou

Abstract: Medical AI algorithms can often experience degraded performance when evaluated on previously unseen sites. Addressing cross-site performance disparities is key to ensuring that AI is equitable and effective when deployed on diverse patient populations. Multi-site evaluations are key to diagnosing such disparities as they can test algorithms across a broader range of potential biases such as patien… ▽ More Medical AI algorithms can often experience degraded performance when evaluated on previously unseen sites. Addressing cross-site performance disparities is key to ensuring that AI is equitable and effective when deployed on diverse patient populations. Multi-site evaluations are key to diagnosing such disparities as they can test algorithms across a broader range of potential biases such as patient demographics, equipment types, and technical parameters. However, such tests do not explain why the model performs worse. Our framework provides a method for quantifying the marginal and cumulative effect of each type of bias on the overall performance difference when a model is evaluated on external data. We demonstrate its usefulness in a case study of a deep learning model trained to detect the presence of pneumothorax, where our framework can help explain up to 60% of the discrepancy in performance across different sites with known biases like disease comorbidities and imaging parameters. △ Less

Submitted 12 November, 2021; originally announced November 2021.

Comments: Machine Learning for Health (ML4H) - Extended Abstract

arXiv:2109.09618 [pdf, other]

Automatic Y-axis Rescaling in Dynamic Visualizations

Authors: Jacob Fisher, Remco Chang, Eugene Wu

Abstract: Animated and interactive data visualizations dynamically change the data rendered in a visualization (e.g., bar chart). As the data changes, the y-axis may need to be rescaled as the domain of the data changes. Each axis rescaling potentially improves the readability of the current chart, but may also disorient the user. In contrast to static visualizations, where there is considerable literature… ▽ More Animated and interactive data visualizations dynamically change the data rendered in a visualization (e.g., bar chart). As the data changes, the y-axis may need to be rescaled as the domain of the data changes. Each axis rescaling potentially improves the readability of the current chart, but may also disorient the user. In contrast to static visualizations, where there is considerable literature to help choose the appropriate y-axis scale, there is a lack of guidance about how and when rescaling should be used in dynamic visualizations. Existing visualization systems and libraries adapt a fixed global y-axis, or rescale every time the data changes. Yet, professional visualizations, such as in data journalism, do not adopt either strategy. They instead carefully and manually choose when to rescale based on the analysis task and data. To this end, we conduct a series of Mechanical Turk experiments to study the potential of dynamic axis rescaling and the factors that affect its effectiveness. We find that the appropriate rescaling policy is both task- and data-dependent, and we do not find one clear policy choice for all situations. △ Less

Submitted 20 September, 2021; originally announced September 2021.

Comments: 5 pages, 7 figures, to be published in IEEE VIS 2021

arXiv:2109.09310 [pdf, other]

doi 10.1109/TPAMI.2021.3114368

Learning Versatile Convolution Filters for Efficient Visual Recognition

Authors: Kai Han, Yunhe Wang, Chang Xu, Chun**g Xu, Enhua Wu, Dacheng Tao

Abstract: This paper introduces versatile filters to construct efficient convolutional neural networks that are widely used in various visual recognition tasks. Considering the demands of efficient deep learning techniques running on cost-effective hardware, a number of methods have been developed to learn compact neural networks. Most of these works aim to slim down filters in different ways, \eg,~investig… ▽ More This paper introduces versatile filters to construct efficient convolutional neural networks that are widely used in various visual recognition tasks. Considering the demands of efficient deep learning techniques running on cost-effective hardware, a number of methods have been developed to learn compact neural networks. Most of these works aim to slim down filters in different ways, \eg,~investigating small, sparse or quantized filters. In contrast, we treat filters from an additive perspective. A series of secondary filters can be derived from a primary filter with the help of binary masks. These secondary filters all inherit in the primary filter without occupying more storage, but once been unfolded in computation they could significantly enhance the capability of the filter by integrating information extracted from different receptive fields. Besides spatial versatile filters, we additionally investigate versatile filters from the channel perspective. Binary masks can be further customized for different primary filters under orthogonal constraints. We conduct theoretical analysis on network complexity and an efficient convolution scheme is introduced. Experimental results on benchmark datasets and neural networks demonstrate that our versatile filters are able to achieve comparable accuracy as that of original filters, but require less memory and computation cost. △ Less

Submitted 20 September, 2021; originally announced September 2021.

Comments: Accepted by TPAMI. Extended version of NeurIPS 2018 paper

arXiv:2108.11884 [pdf]

Enabling SQL-based Training Data Debugging for Federated Learning

Authors: Yejia Liu, Weiyuan Wu, Lampros Flokas, Jiannan Wang, Eugene Wu

Abstract: How can we debug a logistical regression model in a federated learning setting when seeing the model behave unexpectedly (e.g., the model rejects all high-income customers' loan applications)? The SQL-based training data debugging framework has proved effective to fix this kind of issue in a non-federated learning setting. Given an unexpected query result over model predictions, this framework aut… ▽ More How can we debug a logistical regression model in a federated learning setting when seeing the model behave unexpectedly (e.g., the model rejects all high-income customers' loan applications)? The SQL-based training data debugging framework has proved effective to fix this kind of issue in a non-federated learning setting. Given an unexpected query result over model predictions, this framework automatically removes the label errors from training data such that the unexpected behavior disappears in the retrained model. In this paper, we enable this powerful framework for federated learning. The key challenge is how to develop a security protocol for federated debugging which is proved to be secure, efficient, and accurate. Achieving this goal requires us to investigate how to seamlessly integrate the techniques from multiple fields (Databases, Machine Learning, and Cybersecurity). We first propose FedRain, which extends Rain, the state-of-the-art SQL-based training data debugging framework, to our federated learning setting. We address several technical challenges to make FedRain work and analyze its security guarantee and time complexity. The analysis results show that FedRain falls short in terms of both efficiency and security. To overcome these limitations, we redesign our security protocol and propose Frog, a novel SQL-based training data debugging framework tailored for federated learning. Our theoretical analysis shows that Frog is more secure, more accurate, and more efficient than FedRain. We conduct extensive experiments using several real-world datasets and a case study. The experimental results are consistent with our theoretical analysis and validate the effectiveness of Frog in practice. △ Less

Submitted 26 August, 2021; originally announced August 2021.

arXiv:2108.08202 [pdf, other]

Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation

Authors: Jiaming Liu, Ming Lu, Kaixin Chen, Xiaoqi Li, Shizun Wang, Zhaoqing Wang, Enhua Wu, Yurong Chen, Chuang Zhang, Ming Wu

Abstract: Internet video delivery has undergone a tremendous explosion of growth over the past few years. However, the quality of video delivery system greatly depends on the Internet bandwidth. Deep Neural Networks (DNNs) are utilized to improve the quality of video delivery recently. These methods divide a video into chunks, and stream LR video chunks and corresponding content-aware models to the client.… ▽ More Internet video delivery has undergone a tremendous explosion of growth over the past few years. However, the quality of video delivery system greatly depends on the Internet bandwidth. Deep Neural Networks (DNNs) are utilized to improve the quality of video delivery recently. These methods divide a video into chunks, and stream LR video chunks and corresponding content-aware models to the client. The client runs the inference of models to super-resolve the LR chunks. Consequently, a large number of models are streamed in order to deliver a video. In this paper, we first carefully study the relation between models of different chunks, then we tactfully design a joint training framework along with the Content-aware Feature Modulation (CaFM) layer to compress these models for neural video delivery. {\bf With our method, each video chunk only requires less than $1\% $ of original parameters to be streamed, achieving even better SR performance.} We conduct extensive experiments across various SR backbones, video time length, and scaling factors to demonstrate the advantages of our method. Besides, our method can be also viewed as a new approach of video coding. Our primary experiments achieve better video quality compared with the commercial H.264 and H.265 standard under the same storage cost, showing the great potential of the proposed method. Code is available at:\url{https://github.com/Neural-video-delivery/CaFM-Pytorch-ICCV2021} △ Less

Submitted 17 September, 2021; v1 submitted 18 August, 2021; originally announced August 2021.

Comments: Accepted by ICCV 2021

arXiv:2107.08203 [pdf, other]

doi 10.1145/3514221.3526166

PI2: End-to-end Interactive Visualization Interface Generation from Queries

Authors: Yiru Chen, Eugene Wu

Abstract: Interactive visual analysis interfaces are critical in nearly every data task. However, creating new interfaces is deeply challenging, as it requires the developer to understand the queries needed to express the desired analysis task, design the appropriate interface to express those queries for the task, and implement the interface using a combination of visualization, browser, server, and databa… ▽ More Interactive visual analysis interfaces are critical in nearly every data task. However, creating new interfaces is deeply challenging, as it requires the developer to understand the queries needed to express the desired analysis task, design the appropriate interface to express those queries for the task, and implement the interface using a combination of visualization, browser, server, and database technologies. Although prior work generates a set of interactive widgets that can express an input query log, this paper presents PI2, the first system to generate fully functional visual analysis interfaces from an example sequence of analysis queries. PI2 analyzes queries syntactically and represents a set of queries using a novel Difftree structure that encodes systematic variations between query abstract syntax trees. PI2 then maps each Difftree to a visualization that renders its results, the variations in each Difftree to interactions, and generates a good layout for the interface. We show that PI2 can express data-oriented interactions in existing visualization interaction taxonomies, reproduce or improve several real-world visual analysis interfaces, generate interfaces in 2-19s (median 6s), and scale linearly with the number of queries. △ Less

Submitted 19 September, 2022; v1 submitted 17 July, 2021; originally announced July 2021.

Comments: 16 pages

ACM Class: H.2; H.5.2

arXiv:2106.02898 [pdf, other]

Dynamic Resolution Network

Authors: Mingjian Zhu, Kai Han, Enhua Wu, Qiulin Zhang, Ying Nie, Zhenzhong Lan, Yunhe Wang

Abstract: Deep convolutional neural networks (CNNs) are often of sophisticated design with numerous learnable parameters for the accuracy reason. To alleviate the expensive costs of deploying them on mobile devices, recent works have made huge efforts for excavating redundancy in pre-defined architectures. Nevertheless, the redundancy on the input resolution of modern CNNs has not been fully investigated, i… ▽ More Deep convolutional neural networks (CNNs) are often of sophisticated design with numerous learnable parameters for the accuracy reason. To alleviate the expensive costs of deploying them on mobile devices, recent works have made huge efforts for excavating redundancy in pre-defined architectures. Nevertheless, the redundancy on the input resolution of modern CNNs has not been fully investigated, i.e., the resolution of input image is fixed. In this paper, we observe that the smallest resolution for accurately predicting the given image is different using the same neural network. To this end, we propose a novel dynamic-resolution network (DRNet) in which the input resolution is determined dynamically based on each input sample. Wherein, a resolution predictor with negligible computational costs is explored and optimized jointly with the desired network. Specifically, the predictor learns the smallest resolution that can retain and even exceed the original recognition accuracy for each image. During the inference, each input image will be resized to its predicted resolution for minimizing the overall computation burden. We then conduct extensive experiments on several benchmark networks and datasets. The results show that our DRNet can be embedded in any off-the-shelf network architecture to obtain a considerable reduction in computational complexity. For instance, DR-ResNet-50 achieves similar performance with an about 34% computation reduction, while gaining 1.4% accuracy increase with 10% computation reduction compared to the original ResNet-50 on ImageNet. △ Less

Submitted 6 November, 2021; v1 submitted 5 June, 2021; originally announced June 2021.

Comments: Accepted by NeurIPS 2021

arXiv:2103.07037 [pdf, other]

Reptile: Aggregation-level Explanations for Hierarchical Data

Authors: Zezhou Huang, Eugene Wu

Abstract: Recent query explanation systems help users understand anomalies in aggregation results by proposing predicates that describe input records that, if deleted, would resolve the anomalies. However, it can be difficult for users to understand how a predicate was chosen, and these approaches are limited to errors that can be resolved through deletion. In contrast, data errors may be due to group-wise… ▽ More Recent query explanation systems help users understand anomalies in aggregation results by proposing predicates that describe input records that, if deleted, would resolve the anomalies. However, it can be difficult for users to understand how a predicate was chosen, and these approaches are limited to errors that can be resolved through deletion. In contrast, data errors may be due to group-wise errors, such as missing records or systematic value errors. This paper presents Reptile, an explanation system for hierarchical data. Given an anomalous aggregate query result, Reptile recommends the next drill-down attribute,and ranks the drill-down groups based on the extent repairing the group's statistics to its expected values resolves the anomaly. Reptile efficiently trains a multi-level model that leverages the data's hierarchy to estimate the expected values, and uses a factorised representation of the feature matrix to remove redundancies due to the data's hierarchical structure. We further extend model training to support factorised data, and develop a suite of optimizations that leverage the data's hierarchical structure. Reptile reduces end-to-end runtimes by more than 6 times compared to a Matlab-based implementation, correctly identifies 21/30 data errors in John Hopkin's COVID-19 data, and correctly resolves 20/22 complaints in a user study using data and researchers from Columbia University's Financial Instruments Sector Team. △ Less

Submitted 11 March, 2021; originally announced March 2021.

arXiv:2103.00112 [pdf, other]

Transformer in Transformer

Authors: Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chun**g Xu, Yunhe Wang

Abstract: Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch… ▽ More Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16$\times$16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4$\times$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at https://github.com/huawei-noah/CV-Backbones, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/TNT. △ Less

Submitted 25 October, 2021; v1 submitted 26 February, 2021; originally announced March 2021.

Comments: Accepted by NeurIPS 2021

Showing 1–50 of 95 results for author: Wu, E