Search | arXiv e-print repository

How We Built Cedar: A Verification-Guided Approach

Authors: Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks, Kesha Hietala, John Kastner, Anwar Mamat, Matt McCutchen, Neha Rungta, Bhakti Shah, Emina Torlak, Andrew Wells

Abstract: This paper presents verification-guided development (VGD), a software engineering process we used to build Cedar, a new policy language for expressive, fast, safe, and analyzable authorization. Develo** a system with VGD involves writing an executable model of the system and mechanically proving properties about the model; writing production code for the system and using differential random test… ▽ More This paper presents verification-guided development (VGD), a software engineering process we used to build Cedar, a new policy language for expressive, fast, safe, and analyzable authorization. Develo** a system with VGD involves writing an executable model of the system and mechanically proving properties about the model; writing production code for the system and using differential random testing (DRT) to check that the production code matches the model; and using property-based testing (PBT) to check properties of unmodeled parts of the production code. Using VGD for Cedar, we can build fast, idiomatic production code, prove our model correct, and find and fix subtle implementation bugs that evade code reviews and unit testing. While carrying out proofs, we found and fixed 4 bugs in Cedar's policy validator, and DRT and PBT helped us find and fix 21 additional bugs in various parts of Cedar. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2405.06563 [pdf, other]

What Can Natural Language Processing Do for Peer Review?

Authors: Ilia Kuznetsov, Osama Mohammed Afzal, Koen Dercksen, Nils Dycke, Alexander Goldberg, Tom Hope, Dirk Hovy, Jonathan K. Kummerfeld, Anne Lauscher, Kevin Leyton-Brown, Sheng Lu, Mausam, Margot Mieskes, Aurélie Névéol, Danish Pruthi, Lizhen Qu, Roy Schwartz, Noah A. Smith, Thamar Solorio, **gyan Wang, Xiaodan Zhu, Anna Rogers, Nihar B. Shah, Iryna Gurevych

Abstract: The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time… ▽ More The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time-consuming, and prone to error. Since the artifacts involved in peer review -- manuscripts, reviews, discussions -- are largely text-based, Natural Language Processing has great potential to improve reviewing. As the emergence of large language models (LLMs) has enabled NLP assistance for many new tasks, the discussion on machine-assisted peer review is picking up the pace. Yet, where exactly is help needed, where can NLP help, and where should it stand aside? The goal of our paper is to provide a foundation for the future efforts in NLP for peer-reviewing assistance. We discuss peer review as a general process, exemplified by reviewing at AI conferences. We detail each step of the process from manuscript submission to camera-ready revision, and discuss the associated challenges and opportunities for NLP assistance, illustrated by existing work. We then turn to the big challenges in NLP for peer review as a whole, including data acquisition and licensing, operationalization and experimentation, and ethical issues. To help consolidate community efforts, we create a companion repository that aggregates key datasets pertaining to peer review. Finally, we issue a detailed call for action for the scientific community, NLP and AI researchers, policymakers, and funding bodies to help bring the research in NLP for peer review forward. We hope that our work will help set the agenda for research in machine-assisted scientific quality control in the age of AI, within the NLP community and beyond. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2404.08163 [pdf, other]

ViCAR: Visualizing Categories with Automated Rewriting in Coq

Authors: Bhakti Shah, William Spencer, Laura Zielinski, Ben Caldwell, Adrian Lehmann, Robert Rand

Abstract: We present ViCAR, a library for working with monoidal categories in the Coq proof assistant. ViCAR provides definitions for categorical structures that users can instantiate with their own verification projects. Upon verifying relevant coherence conditions, ViCAR gives a set of lemmas and tactics for manipulating categorical structures. We also provide a visualizer that can display any composition… ▽ More We present ViCAR, a library for working with monoidal categories in the Coq proof assistant. ViCAR provides definitions for categorical structures that users can instantiate with their own verification projects. Upon verifying relevant coherence conditions, ViCAR gives a set of lemmas and tactics for manipulating categorical structures. We also provide a visualizer that can display any composition and tensor product of morphisms as a string diagram, showing its categorical structure. This enables graphical reasoning and automated rewriting for Coq projects with monoidal structures. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 13 pages, 10 figures

arXiv:2403.01015 [pdf, other]

A Randomized Controlled Trial on Anonymizing Reviewers to Each Other in Peer Review Discussions

Authors: Charvi Rastogi, Xiangchen Song, Zhi**g **, Ivan Stelmakh, Hal Daumé III, Kun Zhang, Nihar B. Shah

Abstract: Peer review often involves reviewers submitting their independent reviews, followed by a discussion among reviewers of each paper. A question among policymakers is whether the reviewers of a paper should be anonymous to each other during the discussion. We shed light on this by conducting a randomized controlled trial at the UAI 2022 conference. We randomly split the reviewers and papers into two… ▽ More Peer review often involves reviewers submitting their independent reviews, followed by a discussion among reviewers of each paper. A question among policymakers is whether the reviewers of a paper should be anonymous to each other during the discussion. We shed light on this by conducting a randomized controlled trial at the UAI 2022 conference. We randomly split the reviewers and papers into two conditions--one with anonymous discussions and the other with non-anonymous discussions, and conduct an anonymous survey of all reviewers, to address the following questions: 1. Do reviewers discuss more in one of the conditions? Marginally more in anonymous (n = 2281, p = 0.051). 2. Does seniority have more influence on final decisions when non-anonymous? Yes, the decisions are closer to senior reviewers' scores in the non-anonymous condition than in anonymous (n = 484, p = 0.04). 3. Are reviewers more polite in one of the conditions? No significant difference in politeness of reviewers' text-based responses (n = 1125, p = 0.72). 4. Do reviewers' self-reported experiences differ across the two conditions? No significant difference for each of the five questions asked (n = 132 and p > 0.3). 5. Do reviewers prefer one condition over the other? Yes, there is a weak preference for anonymous discussions (n = 159 and Cohen's d= 0.25). 6. What do reviewers consider important to make policy on anonymity among reviewers? Reviewers' feeling of safety in expressing their opinions was rated most important, while polite communication among reviewers was rated least important (n = 159). 7. Have reviewers experienced dishonest behavior due to non-anonymity in discussions? Yes, roughly 7% of respondents answered affirmatively (n = 167). Overall, this experiment reveals evidence supporting an anonymous discussion setup in the peer-review process, in terms of the evaluation criteria considered. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: 18 pages, 4 figures, 3 tables

arXiv:2402.07860 [pdf, other]

On the Detection of Reviewer-Author Collusion Rings From Paper Bidding

Authors: Steven Jecmen, Nihar B. Shah, Fei Fang, Leman Akoglu

Abstract: A major threat to the peer-review systems of computer science conferences is the existence of "collusion rings" between reviewers. In such collusion rings, reviewers who have also submitted their own papers to the conference work together to manipulate the conference's paper assignment, with the aim of being assigned to review each other's papers. The most straightforward way that colluding review… ▽ More A major threat to the peer-review systems of computer science conferences is the existence of "collusion rings" between reviewers. In such collusion rings, reviewers who have also submitted their own papers to the conference work together to manipulate the conference's paper assignment, with the aim of being assigned to review each other's papers. The most straightforward way that colluding reviewers can manipulate the paper assignment is by indicating their interest in each other's papers through strategic paper bidding. One potential approach to solve this important problem would be to detect the colluding reviewers from their manipulated bids, after which the conference can take appropriate action. While prior work has developed effective techniques to detect other kinds of fraud, no research has yet established that detecting collusion rings is even possible. In this work, we tackle the question of whether it is feasible to detect collusion rings from the paper bidding. To answer this question, we conduct empirical analysis of two realistic conference bidding datasets, including evaluations of existing algorithms for fraud detection in other applications. We find that collusion rings can achieve considerable success at manipulating the paper assignment while remaining hidden from detection: for example, in one dataset, undetected colluders are able to achieve assignment to up to 30% of the papers authored by other colluders. In addition, when 10 colluders bid on all of each other's papers, no detection algorithm outputs a group of reviewers with more than 31% overlap with the true colluders. These results suggest that collusion cannot be effectively detected from the bidding using popular existing tools, demonstrating the need to develop more complex detection algorithms as well as those that leverage additional metadata (e.g., reviewer-paper text-similarity scores). △ Less

Submitted 10 March, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2402.01021 [pdf, other]

Towards Understanding the Challenges of Bug Localization in Deep Learning Systems

Authors: Sigma Jahan, Mehil B. Shah, Mohammad Masudur Rahman

Abstract: Software bugs cost the global economy billions of dollars annually and claim ~50\% of the programming time from software developers. Locating these bugs is crucial for their resolution but challenging. It is even more challenging in deep-learning systems due to their black-box nature. Bugs in these systems are also hidden not only in the code but also in the models and training data, which might m… ▽ More Software bugs cost the global economy billions of dollars annually and claim ~50\% of the programming time from software developers. Locating these bugs is crucial for their resolution but challenging. It is even more challenging in deep-learning systems due to their black-box nature. Bugs in these systems are also hidden not only in the code but also in the models and training data, which might make traditional debugging methods less effective. In this article, we conduct a large-scale empirical study to better understand the challenges of localizing bugs in deep-learning systems. First, we determine the bug localization performance of four existing techniques using 2,365 bugs from deep-learning systems and 2,913 from traditional software. We found these techniques significantly underperform in localizing deep-learning system bugs. Second, we evaluate how different bug types in deep learning systems impact bug localization. We found that the effectiveness of localization techniques varies with bug type due to their unique challenges. For example, tensor bugs were more accessible to locate due to their structural nature, while all techniques struggled with GPU bugs due to their external dependencies. Third, we investigate the impact of bugs' extrinsic nature on localization in deep-learning systems. We found that deep learning bugs are often extrinsic and thus connected to artifacts other than source code (e.g., GPU, training data), contributing to the poor performance of existing localization methods. △ Less

Submitted 1 February, 2024; originally announced February 2024.

arXiv:2401.03069 [pdf, other]

Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical Study

Authors: Mehil B. Shah, Mohammad Masudur Rahman, Foutse Khomh

Abstract: Context: Deep learning has achieved remarkable progress in various domains. However, like any software system, deep learning systems contain bugs, some of which can have severe impacts, as evidenced by crashes involving autonomous vehicles. Despite substantial advancements in deep learning techniques, little research has focused on reproducing deep learning bugs, which is an essential step for the… ▽ More Context: Deep learning has achieved remarkable progress in various domains. However, like any software system, deep learning systems contain bugs, some of which can have severe impacts, as evidenced by crashes involving autonomous vehicles. Despite substantial advancements in deep learning techniques, little research has focused on reproducing deep learning bugs, which is an essential step for their resolution. Existing literature suggests that only 3% of deep learning bugs are reproducible, underscoring the need for further research. Objective: This paper examines the reproducibility of deep learning bugs. We identify edit actions and useful information that could improve the reproducibility of deep learning bugs. Method: First, we construct a dataset of 668 deep-learning bugs from Stack Overflow and GitHub across three frameworks and 22 architectures. Second, out of the 668 bugs, we select 165 bugs using stratified sampling and attempt to determine their reproducibility. While reproducing these bugs, we identify edit actions and useful information for their reproduction. Third, we used the Apriori algorithm to identify useful information and edit actions required to reproduce specific types of bugs. Finally, we conducted a user study involving 22 developers to assess the effectiveness of our findings in real-life settings. Results: We successfully reproduced 148 out of 165 bugs attempted. We identified ten edit actions and five useful types of component information that can help us reproduce the deep learning bugs. With the help of our findings, the developers were able to reproduce 22.92% more bugs and reduce their reproduction time by 24.35%. Conclusions: Our research addresses the critical issue of deep learning bug reproducibility. Practitioners and researchers can leverage our findings to improve deep learning bug reproducibility. △ Less

Submitted 18 June, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

Comments: Under Major Revision at the EMSE (Empirical Software Engineering) Journal

arXiv:2311.11571 [pdf, other]

VyZX: Formal Verification of a Graphical Quantum Language

Authors: Adrian Lehmann, Ben Caldwell, Bhakti Shah, Robert Rand

Abstract: Mathematical representations of graphs often resemble adjacency matrices or lists, representations that facilitate whiteboard reasoning and algorithm design. In the realm of proof assistants, inductive representations effectively define semantics for formal reasoning. This highlights a gap where algorithm design and proof assistants require a fundamentally different structure of graphs, particular… ▽ More Mathematical representations of graphs often resemble adjacency matrices or lists, representations that facilitate whiteboard reasoning and algorithm design. In the realm of proof assistants, inductive representations effectively define semantics for formal reasoning. This highlights a gap where algorithm design and proof assistants require a fundamentally different structure of graphs, particularly for process theories which represent programs using graphs. To address this gap, we present VyZX, a verified library for reasoning about inductively defined graphical languages. These inductive constructs arise naturally from category theory definitions. A key goal for VyZX is to Verify the ZX-calculus, a graphical language for reasoning about quantum computation. The ZX-calculus comes with a collection of diagrammatic rewrite rules that preserve the graph's semantic interpretation. We show how inductive graphs in VyZX are used to prove the correctness of the ZX-calculus rewrite rules and apply them in practice using standard proof assistant techniques. VyZX integrates easily with the proof engineer's workflow through visualization and automation. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: 28 pages, 26 figures

arXiv:2311.09497 [pdf, other]

Peer Reviews of Peer Reviews: A Randomized Controlled Trial and Other Experiments

Authors: Alexander Goldberg, Ivan Stelmakh, Kyunghyun Cho, Alice Oh, Alekh Agarwal, Danielle Belgrave, Nihar B. Shah

Abstract: Is it possible to reliably evaluate the quality of peer reviews? We study this question driven by two primary motivations -- incentivizing high-quality reviewing using assessed quality of reviews and measuring changes to review quality in experiments. We conduct a large scale study at the NeurIPS 2022 conference, a top-tier conference in machine learning, in which we invited (meta)-reviewers and a… ▽ More Is it possible to reliably evaluate the quality of peer reviews? We study this question driven by two primary motivations -- incentivizing high-quality reviewing using assessed quality of reviews and measuring changes to review quality in experiments. We conduct a large scale study at the NeurIPS 2022 conference, a top-tier conference in machine learning, in which we invited (meta)-reviewers and authors to evaluate reviews given to submitted papers. First, we conduct a RCT to examine bias due to the length of reviews. We generate elongated versions of reviews by adding substantial amounts of non-informative content. Participants in the control group evaluate the original reviews, whereas participants in the experimental group evaluate the artificially lengthened versions. We find that lengthened reviews are scored (statistically significantly) higher quality than the original reviews. Additionally, in analysis of observational data we find that authors are positively biased towards reviews recommending acceptance of their own papers, even after controlling for confounders of review length, quality, and different numbers of papers per author. We also measure disagreement rates between multiple evaluations of the same review of 28%-32%, which is comparable to that of paper reviewers at NeurIPS. Further, we assess the amount of miscalibration of evaluators of reviews using a linear model of quality scores and find that it is similar to estimates of miscalibration of paper reviewers at NeurIPS. Finally, we estimate the amount of variability in subjective opinions around how to map individual criteria to overall scores of review quality and find that it is roughly the same as that in the review of papers. Our results suggest that the various problems that exist in reviews of papers -- inconsistency, bias towards irrelevant factors, miscalibration, subjectivity -- also arise in reviewing of reviews. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2309.06195 [pdf, other]

Optimization Guarantees of Unfolded ISTA and ADMM Networks With Smooth Soft-Thresholding

Authors: Shaik Basheeruddin Shah, Pradyumna Pradhan, Wei Pu, Ramunaidu Randhi, Miguel R. D. Rodrigues, Yonina C. Eldar

Abstract: Solving linear inverse problems plays a crucial role in numerous applications. Algorithm unfolding based, model-aware data-driven approaches have gained significant attention for effectively addressing these problems. Learned iterative soft-thresholding algorithm (LISTA) and alternating direction method of multipliers compressive sensing network (ADMM-CSNet) are two widely used such approaches, ba… ▽ More Solving linear inverse problems plays a crucial role in numerous applications. Algorithm unfolding based, model-aware data-driven approaches have gained significant attention for effectively addressing these problems. Learned iterative soft-thresholding algorithm (LISTA) and alternating direction method of multipliers compressive sensing network (ADMM-CSNet) are two widely used such approaches, based on ISTA and ADMM algorithms, respectively. In this work, we study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs, for finite-layer unfolded networks such as LISTA and ADMM-CSNet with smooth soft-thresholding in an over-parameterized (OP) regime. We achieve this by leveraging a modified version of the Polyak-Lojasiewicz, denoted PL$^*$, condition. Satisfying the PL$^*$ condition within a specific region of the loss landscape ensures the existence of a global minimum and exponential convergence from initialization using gradient descent based methods. Hence, we provide conditions, in terms of the network width and the number of training samples, on these unfolded networks for the PL$^*$ condition to hold. We achieve this by deriving the Hessian spectral norm of these networks. Additionally, we show that the threshold on the number of training samples increases with the increase in the network width. Furthermore, we compare the threshold on training samples of unfolded networks with that of a standard fully-connected feed-forward network (FFNN) with smooth soft-thresholding non-linearity. We prove that unfolded networks have a higher threshold value than FFNN. Consequently, one can expect a better expected error for unfolded networks than FFNN. △ Less

Submitted 12 September, 2023; originally announced September 2023.

arXiv:2309.04635 [pdf, other]

Can NLP Models 'Identify', 'Distinguish', and 'Justify' Questions that Don't have a Definitive Answer?

Authors: Ayushi Agarwal, Nisarg Patel, Neeraj Varshney, Mihir Parmar, Pavan Mallina, Aryan Bhavin Shah, Srihari Raju Sangaraju, Tirth Patel, Nihar Thakkar, Chitta Baral

Abstract: Though state-of-the-art (SOTA) NLP systems have achieved remarkable performance on a variety of language understanding tasks, they primarily focus on questions that have a correct and a definitive answer. However, in real-world applications, users often ask questions that don't have a definitive answer. Incorrectly answering such questions certainly hampers a system's reliability and trustworthine… ▽ More Though state-of-the-art (SOTA) NLP systems have achieved remarkable performance on a variety of language understanding tasks, they primarily focus on questions that have a correct and a definitive answer. However, in real-world applications, users often ask questions that don't have a definitive answer. Incorrectly answering such questions certainly hampers a system's reliability and trustworthiness. Can SOTA models accurately identify such questions and provide a reasonable response? To investigate the above question, we introduce QnotA, a dataset consisting of five different categories of questions that don't have definitive answers. Furthermore, for each QnotA instance, we also provide a corresponding QA instance i.e. an alternate question that ''can be'' answered. With this data, we formulate three evaluation tasks that test a system's ability to 'identify', 'distinguish', and 'justify' QnotA questions. Through comprehensive experiments, we show that even SOTA models including GPT-3 and Flan T5 do not fare well on these tasks and lack considerably behind the human performance baseline. We conduct a thorough analysis which further leads to several interesting findings. Overall, we believe our work and findings will encourage and facilitate further research in this important area and help develop more robust models. △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: TrustNLP Workshop at ACL 2023

arXiv:2307.05443 [pdf, other]

Testing for Reviewer Anchoring in Peer Review: A Randomized Controlled Trial

Authors: Ryan Liu, Steven Jecmen, Vincent Conitzer, Fei Fang, Nihar B. Shah

Abstract: Peer review frequently follows a process where reviewers first provide initial reviews, authors respond to these reviews, then reviewers update their reviews based on the authors' response. There is mixed evidence regarding whether this process is useful, including frequent anecdotal complaints that reviewers insufficiently update their scores. In this study, we aim to investigate whether reviewer… ▽ More Peer review frequently follows a process where reviewers first provide initial reviews, authors respond to these reviews, then reviewers update their reviews based on the authors' response. There is mixed evidence regarding whether this process is useful, including frequent anecdotal complaints that reviewers insufficiently update their scores. In this study, we aim to investigate whether reviewers anchor to their original scores when updating their reviews, which serves as a potential explanation for the lack of updates in reviewer scores. We design a novel randomized controlled trial to test if reviewers exhibit anchoring. In the experimental condition, participants initially see a flawed version of a paper that is later corrected, while in the control condition, participants only see the correct version. We take various measures to ensure that in the absence of anchoring, reviewers in the experimental group should revise their scores to be identically distributed to the scores from the control group. Furthermore, we construct the reviewed paper to maximize the difference between the flawed and corrected versions, and employ deception to hide the true experiment purpose. Our randomized controlled trial consists of 108 researchers as participants. First, we find that our intervention was successful at creating a difference in perceived paper quality between the flawed and corrected versions: Using a permutation test with the Mann-Whitney U statistic, we find that the experimental group's initial scores are lower than the control group's scores in both the Evaluation category (Vargha-Delaney A=0.64, p=0.0096) and Overall score (A=0.59, p=0.058). Next, we test for anchoring by comparing the experimental group's revised scores with the control group's scores. We find no significant evidence of anchoring in either the Overall (A=0.50, p=0.61) or Evaluation category (A=0.49, p=0.61). △ Less

Submitted 11 July, 2023; originally announced July 2023.

Comments: 14 pages (19 including references and appendix), 2 figures

arXiv:2306.13675 [pdf, other]

doi 10.18653/v1/2023.clinicalnlp-1.39

Intersectionality and Testimonial Injustice in Medical Records

Authors: Kenya S. Andrews, Bhuvani Shah, Lu Cheng

Abstract: Detecting testimonial injustice is an essential element of addressing inequities and promoting inclusive healthcare practices, many of which are life-critical. However, using a single demographic factor to detect testimonial injustice does not fully encompass the nuanced identities that contribute to a patient's experience. Further, some injustices may only be evident when examining the nuances th… ▽ More Detecting testimonial injustice is an essential element of addressing inequities and promoting inclusive healthcare practices, many of which are life-critical. However, using a single demographic factor to detect testimonial injustice does not fully encompass the nuanced identities that contribute to a patient's experience. Further, some injustices may only be evident when examining the nuances that arise through the lens of intersectionality. Ignoring such injustices can result in poor quality of care or life-endangering events. Thus, considering intersectionality could result in more accurate classifications and just decisions. To illustrate this, we use real-world medical data to determine whether medical records exhibit words that could lead to testimonial injustice, employ fairness metrics (e.g. demographic parity, differential intersectional fairness, and subgroup fairness) to assess the severity to which subgroups are experiencing testimonial injustice, and analyze how the intersectionality of demographic features (e.g. gender and race) make a difference in uncovering testimonial injustice. From our analysis, we found that with intersectionality we can better see disparities in how subgroups are treated and there are differences in how someone is treated based on the intersection of their demographic attributes. This has not been previously studied in clinical records, nor has it been proven through empirical study. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Journal ref: In Proceedings of the 5th ClinicalNLP Workshop, pages 358-372, Toronto, Canada. Association for Computational Linguistics (2023)

arXiv:2306.06084 [pdf]

doi 10.1109/M2VIP55626.2022.10041089

Machine Vision Using Cellphone Camera: A Comparison of deep networks for classifying three challenging denominations of Indian Coins

Authors: Keyur D. Joshi, Dhruv Shah, Varshil Shah, Nilay Gandhi, Sanket J. Shah, Sanket B. Shah

Abstract: Indian currency coins come in a variety of denominations. Off all the varieties Rs.1, RS.2, and Rs.5 have similar diameters. Majority of the coin styles in market circulation for denominations of Rs.1 and Rs.2 coins are nearly the same except for numerals on its reverse side. If a coin is resting on its obverse side, the correct denomination is not distinguishable by humans. Therefore, it was hypo… ▽ More Indian currency coins come in a variety of denominations. Off all the varieties Rs.1, RS.2, and Rs.5 have similar diameters. Majority of the coin styles in market circulation for denominations of Rs.1 and Rs.2 coins are nearly the same except for numerals on its reverse side. If a coin is resting on its obverse side, the correct denomination is not distinguishable by humans. Therefore, it was hypothesized that a digital image of a coin resting on its either size could be classified into its correct denomination by training a deep neural network model. The digital images were generated by using cheap cell phone cameras. To find the most suitable deep neural network architecture, four were selected based on the preliminary analysis carried out for comparison. The results confirm that two of the four deep neural network models can classify the correct denomination from either side of a coin with an accuracy of 97%. △ Less

Submitted 12 May, 2023; originally announced June 2023.

Comments: 6 Pages, 4 Figures, 6 Tables, Conference paper

arXiv:2306.00622 [pdf, other]

ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing

Authors: Ryan Liu, Nihar B. Shah

Abstract: Given the rapid ascent of large language models (LLMs), we study the question: (How) can large language models help in reviewing of scientific papers or proposals? We first conduct some pilot studies where we find that (i) GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to identify errors) outperf… ▽ More Given the rapid ascent of large language models (LLMs), we study the question: (How) can large language models help in reviewing of scientific papers or proposals? We first conduct some pilot studies where we find that (i) GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to identify errors) outperforms prompting to simply write a review. With these insights, we study the use of LLMs (specifically, GPT-4) for three tasks: 1. Identifying errors: We construct 13 short computer science papers each with a deliberately inserted error, and ask the LLM to check for the correctness of these papers. We observe that the LLM finds errors in 7 of them, spanning both mathematical and conceptual errors. 2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist questions in the respective sections of 15 NeurIPS 2022 papers. We find that across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy. 3. Choosing the "better" paper: We generate 10 pairs of abstracts, deliberately designing each pair in such a way that one abstract was clearly superior than the other. The LLM, however, struggled to discern these relatively straightforward distinctions accurately, committing errors in its evaluations for 6 out of the 10 pairs. Based on these experiments, we think that LLMs have a promising use as reviewing assistants for specific reviewing tasks, but not (yet) for complete evaluations of papers or proposals. △ Less

Submitted 1 June, 2023; originally announced June 2023.

arXiv:2305.17339 [pdf, other]

Counterfactual Evaluation of Peer-Review Assignment Policies

Authors: Martin Saveski, Steven Jecmen, Nihar B. Shah, Johan Ugander

Abstract: Peer review assignment algorithms aim to match research papers to suitable expert reviewers, working to maximize the quality of the resulting reviews. A key challenge in designing effective assignment policies is evaluating how changes to the assignment algorithm map to changes in review quality. In this work, we leverage recently proposed policies that introduce randomness in peer-review assignme… ▽ More Peer review assignment algorithms aim to match research papers to suitable expert reviewers, working to maximize the quality of the resulting reviews. A key challenge in designing effective assignment policies is evaluating how changes to the assignment algorithm map to changes in review quality. In this work, we leverage recently proposed policies that introduce randomness in peer-review assignment--in order to mitigate fraud--as a valuable opportunity to evaluate counterfactual assignment policies. Specifically, we exploit how such randomized assignments provide a positive probability of observing the reviews of many assignment policies of interest. To address challenges in applying standard off-policy evaluation methods, such as violations of positivity, we introduce novel methods for partial identification based on monotonicity and Lipschitz smoothness assumptions for the map** between reviewer-paper covariates and outcomes. We apply our methods to peer-review data from two computer science venues: the TPDP'21 workshop (95 papers and 35 reviewers) and the AAAI'22 conference (8,450 papers and 3,145 reviewers). We consider estimates of (i) the effect on review quality when changing weights in the assignment algorithm, e.g., weighting reviewers' bids vs. textual similarity (between the review's past papers and the submission), and (ii) the "cost of randomization", capturing the difference in expected quality between the perturbed and unperturbed optimal match. We find that placing higher weight on text similarity results in higher review quality and that introducing randomization in the reviewer-paper assignment only marginally reduces the review quality. Our methods for partial identification may be of independent interest, while our off-policy approach can likely find use evaluating a broad class of algorithmic matching systems. △ Less

Submitted 26 May, 2023; originally announced May 2023.

arXiv:2303.16750 [pdf, other]

A Gold Standard Dataset for the Reviewer Assignment Problem

Authors: Ivan Stelmakh, John Wieting, Graham Neubig, Nihar B. Shah

Abstract: Many peer-review venues are either using or looking to use algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the "similarity score"--a numerical estimate of the expertise of a reviewer in reviewing a paper--and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison,… ▽ More Many peer-review venues are either using or looking to use algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the "similarity score"--a numerical estimate of the expertise of a reviewer in reviewing a paper--and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison, making it difficult for stakeholders to choose the algorithm in an evidence-based manner. The key challenge in comparing existing algorithms and develo** better algorithms is the lack of the publicly available gold-standard data that would be needed to perform reproducible research. We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously. We use this data to compare several popular algorithms employed in computer science conferences and come up with recommendations for stakeholders. Our main findings are as follows. First, all algorithms make a non-trivial amount of error. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases, highlighting the vital need for more research on the similarity-computation problem. Second, most existing algorithms are designed to work with titles and abstracts of papers, and in this regime the Specter+MFR algorithm performs best. Third, to improve performance, it may be important to develop modern deep-learning based algorithms that can make use of the full texts of papers: the classical TD-IDF algorithm enhanced with full texts of papers is on par with the deep-learning based Specter+MFR that cannot make use of this information. △ Less

Submitted 23 March, 2023; originally announced March 2023.

arXiv:2303.09192 [pdf, other]

Metric-Free Exploration for Topological Map** by Task and Motion Imitation in Feature Space

Authors: Yuhang He, Irving Fang, Yiming Li, Rushi Bhavesh Shah, Chen Feng

Abstract: We propose DeepExplorer, a simple and lightweight metric-free exploration method for topological map** of unknown environments. It performs task and motion planning (TAMP) entirely in image feature space. The task planner is a recurrent network using the latest image observation sequence to hallucinate a feature as the next-best exploration goal. The motion planner then utilizes the current and… ▽ More We propose DeepExplorer, a simple and lightweight metric-free exploration method for topological map** of unknown environments. It performs task and motion planning (TAMP) entirely in image feature space. The task planner is a recurrent network using the latest image observation sequence to hallucinate a feature as the next-best exploration goal. The motion planner then utilizes the current and the hallucinated features to generate an action taking the agent towards that goal. The two planners are jointly trained via deeply-supervised imitation learning from expert demonstrations. During exploration, we iteratively call the two planners to predict the next action, and the topological map is built by constantly appending the latest image observation and action to the map and using visual place recognition (VPR) for loop closing. The resulting topological map efficiently represents an environment's connectivity and traversability, so it can be used for tasks such as visual navigation. We show DeepExplorer's exploration efficiency and strong sim2sim generalization capability on large-scale simulation datasets like Gibson and MP3D. Its effectiveness is further validated via the image-goal navigation performance on the resulting topological map. We further show its strong zero-shot sim2real generalization capability in real-world experiments. The source code is available at \url{https://ai4ce.github.io/DeepExplorer/}. △ Less

Submitted 16 March, 2023; originally announced March 2023.

arXiv:2302.08450 [pdf, other]

Assisting Human Decisions in Document Matching

Authors: Joon Sik Kim, Valerie Chen, Danish Pruthi, Nihar B. Shah, Ameet Talwalkar

Abstract: Many practical applications, ranging from paper-reviewer assignment in peer review to job-applicant matching for hiring, require human decision makers to identify relevant matches by combining their expertise with predictions from machine learning models. In many such model-assisted document matching tasks, the decision makers have stressed the need for assistive information about the model output… ▽ More Many practical applications, ranging from paper-reviewer assignment in peer review to job-applicant matching for hiring, require human decision makers to identify relevant matches by combining their expertise with predictions from machine learning models. In many such model-assisted document matching tasks, the decision makers have stressed the need for assistive information about the model outputs (or the data) to facilitate their decisions. In this paper, we devise a proxy matching task that allows us to evaluate which kinds of assistive information improve decision makers' performance (in terms of accuracy and time). Through a crowdsourced (N=271 participants) study, we find that providing black-box model explanations reduces users' accuracy on the matching task, contrary to the commonly-held belief that they can be helpful by allowing better understanding of the model. On the other hand, custom methods that are designed to closely attend to some task-specific desiderata are found to be effective in improving user performance. Surprisingly, we also find that the users' perceived utility of assistive information is misaligned with their objective utility (measured through their task performance). △ Less

Submitted 16 February, 2023; originally announced February 2023.

arXiv:2301.00221 [pdf, ps, other]

doi 10.1371/journal.pone.0286206

The Role of Author Identities in Peer Review

Authors: Nihar B. Shah

Abstract: There is widespread debate on whether to anonymize author identities in peer review. The key argument for anonymization is to mitigate bias, whereas arguments against anonymization posit various uses of author identities in the review process. The Innovations in Theoretical Computer Science (ITCS) 2023 conference adopted a middle ground by initially anonymizing the author identities from reviewers… ▽ More There is widespread debate on whether to anonymize author identities in peer review. The key argument for anonymization is to mitigate bias, whereas arguments against anonymization posit various uses of author identities in the review process. The Innovations in Theoretical Computer Science (ITCS) 2023 conference adopted a middle ground by initially anonymizing the author identities from reviewers, revealing them after the reviewer had submitted their initial reviews, and allowing the reviewer to change their review subsequently. We present an analysis of the reviews pertaining to the identification and use of author identities. Our key findings are: (I) A majority of reviewers self-report not knowing and being unable to guess the authors' identities for the papers they were reviewing. (II) After the initial submission of reviews, 7.1% of reviews changed their overall merit score and 3.8% changed their self-reported reviewer expertise. (III) There is a very weak and statistically insignificant correlation of the rank of authors' affiliations with the change in overall merit; there is a weak but statistically significant correlation with respect to change in reviewer expertise. We also conducted an anonymous survey to obtain opinions from reviewers and authors. The main findings from the 200 survey responses are: (i) A vast majority of participants favor anonymizing author identities in some form. (ii) The "middle-ground" initiative of ITCS 2023 was appreciated. (iii) Detecting conflicts of interest is a challenge that needs to be addressed if author identities are anonymized. Overall, these findings support anonymization of author identities in some form (e.g., as was done in ITCS 2023), as long as there is a robust and efficient way to check conflicts of interest. △ Less

Submitted 25 June, 2023; v1 submitted 31 December, 2022; originally announced January 2023.

arXiv:2211.14446 [pdf, other]

Sign Language to Text Conversion in Real Time using Transfer Learning

Authors: Shubham Thakar, Samveg Shah, Bhavya Shah, Anant V. Nimkar

Abstract: The people in the world who are hearing impaired face many obstacles in communication and require an interpreter to comprehend what a person is saying. There has been constant scientific research and the existing models lack the ability to make accurate predictions. So we propose a deep learning model trained on ASL i.e. American Sign Language which will take actions in the form of ASL as input an… ▽ More The people in the world who are hearing impaired face many obstacles in communication and require an interpreter to comprehend what a person is saying. There has been constant scientific research and the existing models lack the ability to make accurate predictions. So we propose a deep learning model trained on ASL i.e. American Sign Language which will take actions in the form of ASL as input and translate it into text. To achieve the translation a Convolution Neural Network model and a transfer learning model based on the VGG16 architecture are used. There has been an improvement in accuracy from 94% of CNN to 98.7% of Transfer Learning, an improvement of 5%. An application with the deep learning model integrated has also been built. △ Less

Submitted 7 December, 2022; v1 submitted 13 November, 2022; originally announced November 2022.

Comments: Will be published in IEEE Explore; Typos corrected

arXiv:2211.12966 [pdf, other]

How do Authors' Perceptions of their Papers Compare with Co-authors' Perceptions and Peer-review Decisions?

Authors: Charvi Rastogi, Ivan Stelmakh, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, Jennifer Wortman Vaughan, Zhenyu Xue, Hal Daumé III, Emma Pierson, Nihar B. Shah

Abstract: How do author perceptions match up to the outcomes of the peer-review process and perceptions of others? In a top-tier computer science conference (NeurIPS 2021) with more than 23,000 submitting authors and 9,000 submitted papers, we survey the authors on three questions: (i) their predicted probability of acceptance for each of their papers, (ii) their perceived ranking of their own papers based… ▽ More How do author perceptions match up to the outcomes of the peer-review process and perceptions of others? In a top-tier computer science conference (NeurIPS 2021) with more than 23,000 submitting authors and 9,000 submitted papers, we survey the authors on three questions: (i) their predicted probability of acceptance for each of their papers, (ii) their perceived ranking of their own papers based on scientific contribution, and (iii) the change in their perception about their own papers after seeing the reviews. The salient results are: (1) Authors have roughly a three-fold overestimate of the acceptance probability of their papers: The median prediction is 70% for an approximately 25% acceptance rate. (2) Female authors exhibit a marginally higher (statistically significant) miscalibration than male authors; predictions of authors invited to serve as meta-reviewers or reviewers are similarly calibrated, but better than authors who were not invited to review. (3) Authors' relative ranking of scientific contribution of two submissions they made generally agree (93%) with their predicted acceptance probabilities, but there is a notable 7% responses where authors think their better paper will face a worse outcome. (4) The author-provided rankings disagreed with the peer-review decisions about a third of the time; when co-authors ranked their jointly authored papers, co-authors disagreed at a similar rate -- about a third of the time. (5) At least 30% of respondents of both accepted and rejected papers said that their perception of their own paper improved after the review process. The stakeholders in peer review should take these findings into account in setting their expectations from peer review. △ Less

Submitted 22 November, 2022; originally announced November 2022.

arXiv:2211.12686 [pdf, other]

doi 10.1145/3579335

Batching of Tasks by Users of Pseudonymous Forums: Anonymity Compromise and Protection

Authors: Alexander Goldberg, Giulia Fanti, Nihar B. Shah

Abstract: There are a number of forums where people participate under pseudonyms. One example is peer review, where the identity of reviewers for any paper is confidential. When participating in these forums, people frequently engage in "batching": executing multiple related tasks (e.g., commenting on multiple papers) at nearly the same time. Our empirical analysis shows that batching is common in two appli… ▽ More There are a number of forums where people participate under pseudonyms. One example is peer review, where the identity of reviewers for any paper is confidential. When participating in these forums, people frequently engage in "batching": executing multiple related tasks (e.g., commenting on multiple papers) at nearly the same time. Our empirical analysis shows that batching is common in two applications we consider $\unicode{x2013}$ peer review and Wikipedia edits. In this paper, we identify and address the risk of deanonymization arising from linking batched tasks. To protect against linkage attacks, we take the approach of adding delay to the posting time of batched tasks. We first show that under some natural assumptions, no delay mechanism can provide a meaningful differential privacy guarantee. We therefore propose a "one-sided" formulation of differential privacy for protecting against linkage attacks. We design a mechanism that adds zero-inflated uniform delay to events and show it can preserve privacy. We prove that this noise distribution is in fact optimal in minimizing expected delay among mechanisms adding independent noise to each event, thereby establishing the Pareto frontier of the trade-off between the expected delay for batched and unbatched events. Finally, we conduct a series of experiments on Wikipedia and Bitcoin data that corroborate the practical utility of our algorithm in obfuscating batching without introducing onerous delay to a system. △ Less

Submitted 11 September, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

arXiv:2211.08268 [pdf, other]

A Comparative Study of Machine Learning and Deep Learning Techniques for Prediction of Co2 Emission in Cars

Authors: Samveg Shah, Shubham Thakar, Kashish Jain, Bhavya Shah, Sudhir Dhage

Abstract: The most recent concern of all people on Earth is the increase in the concentration of greenhouse gas in the atmosphere. The concentration of these gases has risen rapidly over the last century and if the trend continues it can cause many adverse climatic changes. There have been ways implemented to curb this by the government by limiting processes that emit a higher amount of CO2, one such greenh… ▽ More The most recent concern of all people on Earth is the increase in the concentration of greenhouse gas in the atmosphere. The concentration of these gases has risen rapidly over the last century and if the trend continues it can cause many adverse climatic changes. There have been ways implemented to curb this by the government by limiting processes that emit a higher amount of CO2, one such greenhouse gas. However, there is mounting evidence that the CO2 numbers supplied by the government do not accurately reflect the performance of automobiles on the road. Our proposal of using artificial intelligence techniques to improve a previously rudimentary process takes a radical tack, but it fits the bill given the situation. To determine which algorithms and models produce the greatest outcomes, we compared them all and explored a novel method of ensembling them. Further, this can be used to foretell the rise in global temperature and to ground crucial policy decisions like the adoption of electric vehicles. To estimate emissions from vehicles, we used machine learning, deep learning, and ensemble learning on a massive dataset. △ Less

Submitted 15 November, 2022; originally announced November 2022.

Comments: Will be published in Springer L.N.N.S. Data set used https://www.eea.europa.eu/data-and-maps/data/co2-cars-emission-22

arXiv:2209.08665 [pdf, other]

Allocation Schemes in Analytic Evaluation: Applicant-Centric Holistic or Attribute-Centric Segmented?

Authors: **gyan Wang, Carmel Baharav, Nihar B. Shah, Anita Williams Woolley, R Ravi

Abstract: Many applications such as hiring and university admissions involve evaluation and selection of applicants. These tasks are fundamentally difficult, and require combining evidence from multiple different aspects (what we term "attributes"). In these applications, the number of applicants is often large, and a common practice is to assign the task to multiple evaluators in a distributed fashion. Spe… ▽ More Many applications such as hiring and university admissions involve evaluation and selection of applicants. These tasks are fundamentally difficult, and require combining evidence from multiple different aspects (what we term "attributes"). In these applications, the number of applicants is often large, and a common practice is to assign the task to multiple evaluators in a distributed fashion. Specifically, in the often-used holistic allocation, each evaluator is assigned a subset of the applicants, and is asked to assess all relevant information for their assigned applicants. However, such an evaluation process is subject to issues such as miscalibration (evaluators see only a small fraction of the applicants and may not get a good sense of relative quality), and discrimination (evaluators are influenced by irrelevant information about the applicants). We identify that such attribute-based evaluation allows alternative allocation schemes. Specifically, we consider assigning each evaluator more applicants but fewer attributes per applicant, termed segmented allocation. We compare segmented allocation to holistic allocation on several dimensions via theoretical and experimental methods. We establish various tradeoffs between these two approaches, and identify conditions under which one approach results in more accurate evaluation than the other. △ Less

Submitted 18 September, 2022; originally announced September 2022.

arXiv:2208.13982 [pdf]

Performance Study of Time Series Databases

Authors: Bonil Shah, P. M. Jat, Kalyan Sashidhar

Abstract: The growth of big-data sectors such as the Internet of Things (IoT) generates enormous volumes of data. As IoT devices generate a vast volume of time-series data, the Time Series Database (TSDB) popularity has grown alongside the rise of IoT. Time series databases are developed to manage and analyze huge amounts of time series data. However, it is not easy to choose the best one from them. The mos… ▽ More The growth of big-data sectors such as the Internet of Things (IoT) generates enormous volumes of data. As IoT devices generate a vast volume of time-series data, the Time Series Database (TSDB) popularity has grown alongside the rise of IoT. Time series databases are developed to manage and analyze huge amounts of time series data. However, it is not easy to choose the best one from them. The most popular benchmarks compare the performance of different databases to each other but use random or synthetic data that applies to only one domain. As a result, these benchmarks may not always accurately represent real-world performance. It is required to comprehensively compare the performance of time series databases with real datasets. The experiment shows significant performance differences for data injection time and query execution time when comparing real and synthetic datasets. The results are reported and analyzed. △ Less

Submitted 30 August, 2022; originally announced August 2022.

arXiv:2207.11315 [pdf, other]

Tradeoffs in Preventing Manipulation in Paper Bidding for Reviewer Assignment

Authors: Steven Jecmen, Nihar B. Shah, Fei Fang, Vincent Conitzer

Abstract: Many conferences rely on paper bidding as a key component of their reviewer assignment procedure. These bids are then taken into account when assigning reviewers to help ensure that each reviewer is assigned to suitable papers. However, despite the benefits of using bids, reliance on paper bidding can allow malicious reviewers to manipulate the paper assignment for unethical purposes (e.g., gettin… ▽ More Many conferences rely on paper bidding as a key component of their reviewer assignment procedure. These bids are then taken into account when assigning reviewers to help ensure that each reviewer is assigned to suitable papers. However, despite the benefits of using bids, reliance on paper bidding can allow malicious reviewers to manipulate the paper assignment for unethical purposes (e.g., getting assigned to a friend's paper). Several different approaches to preventing this manipulation have been proposed and deployed. In this paper, we enumerate certain desirable properties that algorithms for addressing bid manipulation should satisfy. We then offer a high-level analysis of various approaches along with directions for future investigation. △ Less

Submitted 22 July, 2022; originally announced July 2022.

arXiv:2207.02303 [pdf, other]

A Dataset on Malicious Paper Bidding in Peer Review

Authors: Steven Jecmen, Minji Yoon, Vincent Conitzer, Nihar B. Shah, Fei Fang

Abstract: In conference peer review, reviewers are often asked to provide "bids" on each submitted paper that express their interest in reviewing that paper. A paper assignment algorithm then uses these bids (along with other data) to compute a high-quality assignment of reviewers to papers. However, this process has been exploited by malicious reviewers who strategically bid in order to unethically manipul… ▽ More In conference peer review, reviewers are often asked to provide "bids" on each submitted paper that express their interest in reviewing that paper. A paper assignment algorithm then uses these bids (along with other data) to compute a high-quality assignment of reviewers to papers. However, this process has been exploited by malicious reviewers who strategically bid in order to unethically manipulate the paper assignment, crucially undermining the peer review process. For example, these reviewers may aim to get assigned to a friend's paper as part of a quid-pro-quo deal. A critical impediment towards creating and evaluating methods to mitigate this issue is the lack of any publicly-available data on malicious paper bidding. In this work, we collect and publicly release a novel dataset to fill this gap, collected from a mock conference activity where participants were instructed to bid either honestly or maliciously. We further provide a descriptive analysis of the bidding behavior, including our categorization of different strategies employed by participants. Finally, we evaluate the ability of each strategy to manipulate the assignment, and also evaluate the performance of some simple algorithms meant to detect malicious bidding. The performance of these detection algorithms can be taken as a baseline for future research on detecting malicious bidding. △ Less

Submitted 10 March, 2023; v1 submitted 24 June, 2022; originally announced July 2022.

arXiv:2204.03505 [pdf, other]

Integrating Rankings into Quantized Scores in Peer Review

Authors: Yusha Liu, Yichong Xu, Nihar B. Shah, Aarti Singh

Abstract: In peer review, reviewers are usually asked to provide scores for the papers. The scores are then used by Area Chairs or Program Chairs in various ways in the decision-making process. The scores are usually elicited in a quantized form to accommodate the limited cognitive ability of humans to describe their opinions in numerical values. It has been found that the quantized scores suffer from a lar… ▽ More In peer review, reviewers are usually asked to provide scores for the papers. The scores are then used by Area Chairs or Program Chairs in various ways in the decision-making process. The scores are usually elicited in a quantized form to accommodate the limited cognitive ability of humans to describe their opinions in numerical values. It has been found that the quantized scores suffer from a large number of ties, thereby leading to a significant loss of information. To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed. There are however two key challenges. First, there is no standard procedure for using this ranking information and Area Chairs may use it in different ways (including simply ignoring them), thereby leading to arbitrariness in the peer-review process. Second, there are no suitable interfaces for judicious use of this data nor methods to incorporate it in existing workflows, thereby leading to inefficiencies. We take a principled approach to integrate the ranking information into the scores. The output of our method is an updated score pertaining to each review that also incorporates the rankings. Our approach addresses the two aforementioned challenges by: (i) ensuring that rankings are incorporated into the updates scores in the same manner for all papers, thereby mitigating arbitrariness, and (ii) allowing to seamlessly use existing interfaces and workflows designed for scores. We empirically evaluate our method on synthetic datasets as well as on peer reviews from the ICLR 2017 conference, and find that it reduces the error by approximately 30% as compared to the best performing baseline on the ICLR 2017 data. △ Less

Submitted 5 April, 2022; originally announced April 2022.

Comments: 14 main pages, 7 appendix pages

arXiv:2204.00977 [pdf]

Deep Speech Based End-to-End Automated Speech Recognition (ASR) for Indian-English Accents

Authors: Priyank Dubey, Bilal Shah

Abstract: Automated Speech Recognition (ASR) is an interdisciplinary application of computer science and linguistics that enable us to derive the transcription from the uttered speech waveform. It finds several applications in Military like High-performance fighter aircraft, helicopters, air-traffic controller. Other than military speech recognition is used in healthcare, persons with disabilities and many… ▽ More Automated Speech Recognition (ASR) is an interdisciplinary application of computer science and linguistics that enable us to derive the transcription from the uttered speech waveform. It finds several applications in Military like High-performance fighter aircraft, helicopters, air-traffic controller. Other than military speech recognition is used in healthcare, persons with disabilities and many more. ASR has been an active research area. Several models and algorithms for speech to text (STT) have been proposed. One of the most recent is Mozilla Deep Speech, it is based on the Deep Speech research paper by Baidu. Deep Speech is a state-of-art speech recognition system is developed using end-to-end deep learning, it is trained using well-optimized Recurrent Neural Network (RNN) training system utilizing multiple Graphical Processing Units (GPUs). This training is mostly done using American-English accent datasets, which results in poor generalizability to other English accents. India is a land of vast diversity. This can even be seen in the speech, there are several English accents which vary from state to state. In this work, we have used transfer learning approach using most recent Deep Speech model i.e., deepspeech-0.9.3 to develop an end-to-end speech recognition system for Indian-English accents. This work utilizes fine-tuning and data argumentation to further optimize and improve the Deep Speech ASR system. Indic TTS data of Indian-English accents is used for transfer learning and fine-tuning the pre-trained Deep Speech model. A general comparison is made among the untrained model, our trained model and other available speech recognition services for Indian-English Accents. △ Less

Submitted 2 April, 2022; originally announced April 2022.

arXiv:2203.17259 [pdf, other]

To ArXiv or not to ArXiv: A Study Quantifying Pros and Cons of Posting Preprints Online

Authors: Charvi Rastogi, Ivan Stelmakh, Xinwei Shen, Marina Meila, Federico Echenique, Shuchi Chawla, Nihar B. Shah

Abstract: Double-blind conferences have engaged in debates over whether to allow authors to post their papers online on arXiv or elsewhere during the review process. Independently, some authors of research papers face the dilemma of whether to put their papers on arXiv due to its pros and cons. We conduct a study to substantiate this debate and dilemma via quantitative measurements. Specifically, we conduct… ▽ More Double-blind conferences have engaged in debates over whether to allow authors to post their papers online on arXiv or elsewhere during the review process. Independently, some authors of research papers face the dilemma of whether to put their papers on arXiv due to its pros and cons. We conduct a study to substantiate this debate and dilemma via quantitative measurements. Specifically, we conducted surveys of reviewers in two top-tier double-blind computer science conferences -- ICML 2021 (5361 submissions and 4699 reviewers) and EC 2021 (498 submissions and 190 reviewers). Our two main findings are as follows. First, more than a third of the reviewers self-report searching online for a paper they are assigned to review. Second, outside the review process, we find that preprints from better-ranked affiliations see a weakly higher visibility, with a correlation of 0.06 in ICML and 0.05 in EC. In particular, papers associated with the top-10-ranked affiliations had a visibility of approximately 11% in ICML and 22% in EC, whereas the remaining papers had a visibility of 7% and 18% respectively. △ Less

Submitted 11 June, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: 17 pages, 3 figures

arXiv:2203.17239 [pdf, other]

doi 10.1371/journal.pone.0283980

Cite-seeing and Reviewing: A Study on Citation Bias in Peer Review

Authors: Ivan Stelmakh, Charvi Rastogi, Ryan Liu, Shuchi Chawla, Federico Echenique, Nihar B. Shah

Abstract: Citations play an important role in researchers' careers as a key factor in evaluation of scientific impact. Many anecdotes advice authors to exploit this fact and cite prospective reviewers to try obtaining a more positive evaluation for their submission. In this work, we investigate if such a citation bias actually exists: Does the citation of a reviewer's own work in a submission cause them to… ▽ More Citations play an important role in researchers' careers as a key factor in evaluation of scientific impact. Many anecdotes advice authors to exploit this fact and cite prospective reviewers to try obtaining a more positive evaluation for their submission. In this work, we investigate if such a citation bias actually exists: Does the citation of a reviewer's own work in a submission cause them to be positively biased towards the submission? In conjunction with the review process of two flagship conferences in machine learning and algorithmic economics, we execute an observational study to test for citation bias in peer review. In our analysis, we carefully account for various confounding factors such as paper quality and reviewer expertise, and apply different modeling techniques to alleviate concerns regarding the model mismatch. Overall, our analysis involves 1,314 papers and 1,717 reviewers and detects citation bias in both venues we consider. In terms of the effect size, by citing a reviewer's work, a submission has a non-trivial chance of getting a higher score from the reviewer: an expected increase in the score is approximately 0.23 on a 5-point Likert item. For reference, a one-point increase of a score by a single reviewer improves the position of a submission by 11% on average. △ Less

Submitted 31 March, 2022; originally announced March 2022.

Comments: 19 pages, 3 figures

arXiv:2201.11308 [pdf, other]

Calibration with Privacy in Peer Review

Authors: Wenxin Ding, Gautam Kamath, Weina Wang, Nihar B. Shah

Abstract: Reviewers in peer review are often miscalibrated: they may be strict, lenient, extreme, moderate, etc. A number of algorithms have previously been proposed to calibrate reviews. Such attempts of calibration can however leak sensitive information about which reviewer reviewed which paper. In this paper, we identify this problem of calibration with privacy, and provide a foundational building block… ▽ More Reviewers in peer review are often miscalibrated: they may be strict, lenient, extreme, moderate, etc. A number of algorithms have previously been proposed to calibrate reviews. Such attempts of calibration can however leak sensitive information about which reviewer reviewed which paper. In this paper, we identify this problem of calibration with privacy, and provide a foundational building block to address it. Specifically, we present a theoretical study of this problem under a simplified-yet-challenging model involving two reviewers, two papers, and an MAP-computing adversary. Our main results establish the Pareto frontier of the tradeoff between privacy (preventing the adversary from inferring reviewer identity) and utility (accepting better papers), and design explicit computationally-efficient algorithms that we prove are Pareto optimal. △ Less

Submitted 26 January, 2022; originally announced January 2022.

Comments: 31 pages, 6 figures

arXiv:2201.10631 [pdf, other]

Strategyproofing Peer Assessment via Partitioning: The Price in Terms of Evaluators' Expertise

Authors: Komal Dhull, Steven Jecmen, Pravesh Kothari, Nihar B. Shah

Abstract: Strategic behavior is a fundamental problem in a variety of real-world applications that require some form of peer assessment, such as peer grading of homeworks, grant proposal review, conference peer review of scientific papers, and peer assessment of employees in organizations. Since an individual's own work is in competition with the submissions they are evaluating, they may provide dishonest e… ▽ More Strategic behavior is a fundamental problem in a variety of real-world applications that require some form of peer assessment, such as peer grading of homeworks, grant proposal review, conference peer review of scientific papers, and peer assessment of employees in organizations. Since an individual's own work is in competition with the submissions they are evaluating, they may provide dishonest evaluations to increase the relative standing of their own submission. This issue is typically addressed by partitioning the individuals and assigning them to evaluate the work of only those from different subsets. Although this method ensures strategyproofness, each submission may require a different type of expertise for effective evaluation. In this paper, we focus on finding an assignment of evaluators to submissions that maximizes assigned evaluators' expertise subject to the constraint of strategyproofness. We analyze the price of strategyproofness: that is, the amount of compromise on the assigned evaluators' expertise required in order to get strategyproofness. We establish several polynomial-time algorithms for strategyproof assignment along with assignment-quality guarantees. Finally, we evaluate the methods on a dataset from conference peer review. △ Less

Submitted 28 August, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

arXiv:2110.13935 [pdf, other]

doi 10.1145/3475724.3483610

Frequency Centric Defense Mechanisms against Adversarial Examples

Authors: Sanket B. Shah, Param Raval, Harin Khakhi, Mehul S. Raval

Abstract: Adversarial example (AE) aims at fooling a Convolution Neural Network by introducing small perturbations in the input image.The proposed work uses the magnitude and phase of the Fourier Spectrum and the entropy of the image to defend against AE. We demonstrate the defense in two ways: by training an adversarial detector and denoising the adversarial effect. Experiments were conducted on the low-re… ▽ More Adversarial example (AE) aims at fooling a Convolution Neural Network by introducing small perturbations in the input image.The proposed work uses the magnitude and phase of the Fourier Spectrum and the entropy of the image to defend against AE. We demonstrate the defense in two ways: by training an adversarial detector and denoising the adversarial effect. Experiments were conducted on the low-resolution CIFAR-10 and high-resolution ImageNet datasets. The adversarial detector has 99% accuracy for FGSM and PGD attacks on the CIFAR-10 dataset. However, the detection accuracy falls to 50% for sophisticated DeepFool and Carlini & Wagner attacks on ImageNet. We overcome the limitation by using autoencoder and show that 70% of AEs are correctly classified after denoising. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Comments: AdvM '21: Proceedings of the 1st International Workshop on Adversarial Learning for Multimedia, at ACM Multimedia '21

arXiv:2108.06371 [pdf, other]

Near-Optimal Reviewer Splitting in Two-Phase Paper Reviewing and Conference Experiment Design

Authors: Steven Jecmen, Hanrui Zhang, Ryan Liu, Fei Fang, Vincent Conitzer, Nihar B. Shah

Abstract: Many scientific conferences employ a two-phase paper review process, where some papers are assigned additional reviewers after the initial reviews are submitted. Many conferences also design and run experiments on their paper review process, where some papers are assigned reviewers who provide reviews under an experimental condition. In this paper, we consider the question: how should reviewers be… ▽ More Many scientific conferences employ a two-phase paper review process, where some papers are assigned additional reviewers after the initial reviews are submitted. Many conferences also design and run experiments on their paper review process, where some papers are assigned reviewers who provide reviews under an experimental condition. In this paper, we consider the question: how should reviewers be divided between phases or conditions in order to maximize total assignment similarity? We make several contributions towards answering this question. First, we prove that when the set of papers requiring additional review is unknown, a simplified variant of this problem is NP-hard. Second, we empirically show that across several datasets pertaining to real conference data, dividing reviewers between phases/conditions uniformly at random allows an assignment that is nearly as good as the oracle optimal assignment. This uniformly random choice is practical for both the two-phase and conference experiment design settings. Third, we provide explanations of this phenomenon by providing theoretical bounds on the suboptimality of this random strategy under certain natural conditions. From these easily-interpretable conditions, we provide actionable insights to conference program chairs about whether a random reviewer split is suitable for their conference. △ Less

Submitted 13 August, 2021; originally announced August 2021.

arXiv:2108.05523 [pdf, other]

Fair Decision-Making for Food Inspections

Authors: Shubham Singh, Bhuvni Shah, Chris Kanich, Ian A. Kash

Abstract: Data and algorithms are essential and complementary parts of a large-scale decision-making process. However, their injudicious use can lead to unforeseen consequences, as has been observed by researchers and activists alike in the recent past. In this paper, we revisit the application of predictive models by the Chicago Department of Public Health to schedule restaurant inspections and prioritize… ▽ More Data and algorithms are essential and complementary parts of a large-scale decision-making process. However, their injudicious use can lead to unforeseen consequences, as has been observed by researchers and activists alike in the recent past. In this paper, we revisit the application of predictive models by the Chicago Department of Public Health to schedule restaurant inspections and prioritize the detection of critical food code violations. We perform the first analysis of the model's fairness to the population served by the restaurants in terms of average time to find a critical violation. We find that the model treats inspections unequally based on the sanitarian who conducted the inspection and that, in turn, there are geographic disparities in the benefits of the model. We examine four alternate methods of model training and two alternative ways of scheduling using the model and find that the latter generate more desirable results. The challenges from this application point to important directions for future work around fairness with collective entities rather than individuals, the use of critical violations as a proxy, and the disconnect between fair classification and fairness in the dynamic scheduling system. △ Less

Submitted 9 March, 2022; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: 28 pages, 16 figures

arXiv:2107.06173 [pdf, other]

doi 10.1109/TSP.2020.2971936

Orthogonal and Non-Orthogonal Signal Representations Using New Transformation Matrices Having NPM Structure

Authors: Shaik Basheeruddin Shah, Vijay Kumar Chakka, Arikatla Satyanarayana Reddy

Abstract: In this paper, we introduce two types of real-valued sums known as Complex Conjugate Pair Sums (CCPSs) denoted as CCPS$^{(1)}$ and CCPS$^{(2)}$, and discuss a few of their properties. Using each type of CCPSs and their circular shifts, we construct two non-orthogonal Nested Periodic Matrices (NPMs). As NPMs are non-singular, this introduces two non-orthogonal transforms known as Complex Conjugate… ▽ More In this paper, we introduce two types of real-valued sums known as Complex Conjugate Pair Sums (CCPSs) denoted as CCPS$^{(1)}$ and CCPS$^{(2)}$, and discuss a few of their properties. Using each type of CCPSs and their circular shifts, we construct two non-orthogonal Nested Periodic Matrices (NPMs). As NPMs are non-singular, this introduces two non-orthogonal transforms known as Complex Conjugate Periodic Transforms (CCPTs) denoted as CCPT$^{(1)}$ and CCPT$^{(2)}$. We propose another NPM, which uses both types of CCPSs such that its columns are mutually orthogonal, this transform is known as Orthogonal CCPT (OCCPT). After a brief study of a few OCCPT properties like periodicity, circular shift, etc., we present two different interpretations of it. Further, we propose a Decimation-In-Time (DIT) based fast computation algorithm for OCCPT (termed as FOCCPT), whenever the length of the signal is equal to $2^v,\ v{\in} \mathbb{N}$. The proposed sums and transforms are inspired by Ramanujan sums and Ramanujan Period Transform (RPT). Finally, we show that the period (both divisor and non-divisor) and frequency information of a signal can be estimated using the proposed transforms with a significant reduction in the computational complexity over Discrete Fourier Transform (DFT). △ Less

Submitted 20 June, 2021; originally announced July 2021.

Comments: 13 pages, 5 figures,

Journal ref: IEEE Transactions on Signal Processing, 06 February 2020

arXiv:2105.05946 [pdf, other]

Composing Modeling and Simulation with Machine Learning in Julia

Authors: Chris Rackauckas, Ranjan Anantharaman, Alan Edelman, Shashi Gowda, Maja Gwozdz, Anand Jain, Chris Laughman, Yingbo Ma, Francesco Martinuzzi, Avik Pal, Utkarsh Rajput, Elliot Saba, Viral B. Shah

Abstract: In this paper we introduce JuliaSim, a high-performance programming environment designed to blend traditional modeling and simulation with machine learning. JuliaSim can build accelerated surrogates from component-based models, such as those conforming to the FMI standard, using continuous-time echo state networks (CTESN). The foundation of this environment, ModelingToolkit.jl, is an acausal model… ▽ More In this paper we introduce JuliaSim, a high-performance programming environment designed to blend traditional modeling and simulation with machine learning. JuliaSim can build accelerated surrogates from component-based models, such as those conforming to the FMI standard, using continuous-time echo state networks (CTESN). The foundation of this environment, ModelingToolkit.jl, is an acausal modeling language which can compose the trained surrogates as components within its staged compilation process. As a complementary factor we present the JuliaSim model library, a standard library with differential-algebraic equations and pre-trained surrogates, which can be composed using the modeling system for design, optimization, and control. We demonstrate the effectiveness of the surrogate-accelerated modeling and simulation approach on HVAC dynamics by showing that the CTESN surrogates accurately capture the dynamics of a HVAC cycle at less than 4\% error while accelerating its simulation by 340x. We illustrate the use of surrogate acceleration in the design process via global optimization of simulation parameters using the embedded surrogate, yielding a speedup of two orders of magnitude to find the optimum. We showcase the surrogate deployed in a co-simulation loop, as a drop-in replacement for one of the coupled FMUs, allowing engineers to effectively explore the design space of a coupled system. Together this demonstrates a workflow for automating the integration of machine learning techniques into traditional modeling and simulation processes. △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2105.03949 [pdf, other]

High-performance symbolic-numerics via multiple dispatch

Authors: Shashi Gowda, Yingbo Ma, Alessandro Cheli, Maja Gwozdz, Viral B. Shah, Alan Edelman, Christopher Rackauckas

Abstract: As mathematical computing becomes more democratized in high-level languages, high-performance symbolic-numeric systems are necessary for domain scientists and engineers to get the best performance out of their machine without deep knowledge of code optimization. Naturally, users need different term types either to have different algebraic properties for them, or to use efficient data structures. T… ▽ More As mathematical computing becomes more democratized in high-level languages, high-performance symbolic-numeric systems are necessary for domain scientists and engineers to get the best performance out of their machine without deep knowledge of code optimization. Naturally, users need different term types either to have different algebraic properties for them, or to use efficient data structures. To this end, we developed Symbolics.jl, an extendable symbolic system which uses dynamic multiple dispatch to change behavior depending on the domain needs. In this work we detail an underlying abstract term interface which allows for speed without sacrificing generality. We show that by formalizing a generic API on actions independent of implementation, we can retroactively add optimized data structures to our system without changing the pre-existing term rewriters. We showcase how this can be used to optimize term construction and give a 113x acceleration on general symbolic transformations. Further, we show that such a generic API allows for complementary term-rewriting implementations. We demonstrate the ability to swap between classical term-rewriting simplifiers and e-graph-based term-rewriting simplifiers. We showcase an e-graph ruleset which minimizes the number of CPU cycles during expression evaluation, and demonstrate how it simplifies a real-world reaction-network simulation to halve the runtime. Additionally, we show a reaction-diffusion partial differential equation solver which is able to be automatically converted into symbolic expressions via multiple dispatch tracing, which is subsequently accelerated and parallelized to give a 157x simulation speedup. Together, this presents Symbolics.jl as a next-generation symbolic-numeric computing environment geared towards modeling and simulation. △ Less

Submitted 5 February, 2022; v1 submitted 9 May, 2021; originally announced May 2021.

ACM Class: D.3.3; I.1.1; I.1.3

arXiv:2012.00714 [pdf, other]

Debiasing Evaluations That are Biased by Evaluations

Authors: **gyan Wang, Ivan Stelmakh, Yuting Wei, Nihar B. Shah

Abstract: It is common to evaluate a set of items by soliciting people to rate them. For example, universities ask students to rate the teaching quality of their instructors, and conference organizers ask authors of submissions to evaluate the quality of the reviews. However, in these applications, students often give a higher rating to a course if they receive higher grades in a course, and authors often g… ▽ More It is common to evaluate a set of items by soliciting people to rate them. For example, universities ask students to rate the teaching quality of their instructors, and conference organizers ask authors of submissions to evaluate the quality of the reviews. However, in these applications, students often give a higher rating to a course if they receive higher grades in a course, and authors often give a higher rating to the reviews if their papers are accepted to the conference. In this work, we call these external factors the "outcome" experienced by people, and consider the problem of mitigating these outcome-induced biases in the given ratings when some information about the outcome is available. We formulate the information about the outcome as a known partial ordering on the bias. We propose a debiasing method by solving a regularized optimization problem under this ordering constraint, and also provide a carefully designed cross-validation method that adaptively chooses the appropriate amount of regularization. We provide theoretical guarantees on the performance of our algorithm, as well as experimental evaluations. △ Less

Submitted 1 December, 2020; originally announced December 2020.

arXiv:2011.15083 [pdf, other]

A Large Scale Randomized Controlled Trial on Herding in Peer-Review Discussions

Authors: Ivan Stelmakh, Charvi Rastogi, Nihar B. Shah, Aarti Singh, Hal Daumé III

Abstract: Peer review is the backbone of academia and humans constitute a cornerstone of this process, being responsible for reviewing papers and making the final acceptance/rejection decisions. Given that human decision making is known to be susceptible to various cognitive biases, it is important to understand which (if any) biases are present in the peer-review process and design the pipeline such that t… ▽ More Peer review is the backbone of academia and humans constitute a cornerstone of this process, being responsible for reviewing papers and making the final acceptance/rejection decisions. Given that human decision making is known to be susceptible to various cognitive biases, it is important to understand which (if any) biases are present in the peer-review process and design the pipeline such that the impact of these biases is minimized. In this work, we focus on the dynamics of between-reviewers discussions and investigate the presence of herding behaviour therein. In that, we aim to understand whether reviewers and more senior decision makers get disproportionately influenced by the first argument presented in the discussion when (in case of reviewers) they form an independent opinion about the paper before discussing it with others. Specifically, in conjunction with the review process of ICML 2020 -- a large, top tier machine learning conference -- we design and execute a randomized controlled trial with the goal of testing for the conditional causal effect of the discussion initiator's opinion on the outcome of a paper. △ Less

Submitted 30 November, 2020; originally announced November 2020.

arXiv:2011.15050 [pdf, other]

A Novice-Reviewer Experiment to Address Scarcity of Qualified Reviewers in Large Conferences

Authors: Ivan Stelmakh, Nihar B. Shah, Aarti Singh, Hal Daumé III

Abstract: Conference peer review constitutes a human-computation process whose importance cannot be overstated: not only it identifies the best submissions for acceptance, but, ultimately, it impacts the future of the whole research area by promoting some ideas and restraining others. A surge in the number of submissions received by leading AI conferences has challenged the sustainability of the review proc… ▽ More Conference peer review constitutes a human-computation process whose importance cannot be overstated: not only it identifies the best submissions for acceptance, but, ultimately, it impacts the future of the whole research area by promoting some ideas and restraining others. A surge in the number of submissions received by leading AI conferences has challenged the sustainability of the review process by increasing the burden on the pool of qualified reviewers which is growing at a much slower rate. In this work, we consider the problem of reviewer recruiting with a focus on the scarcity of qualified reviewers in large conferences. Specifically, we design a procedure for (i) recruiting reviewers from the population not typically covered by major conferences and (ii) guiding them through the reviewing pipeline. In conjunction with ICML 2020 -- a large, top-tier machine learning conference -- we recruit a small set of reviewers through our procedure and compare their performance with the general population of ICML reviewers. Our experiment reveals that a combination of the recruiting and guiding mechanisms allows for a principled enhancement of the reviewer pool and results in reviews of superior quality compared to the conventional pool of reviews as evaluated by senior members of the program committee (meta-reviewers). △ Less

Submitted 30 November, 2020; originally announced November 2020.

arXiv:2011.14646 [pdf, other]

Prior and Prejudice: The Novice Reviewers' Bias against Resubmissions in Conference Peer Review

Authors: Ivan Stelmakh, Nihar B. Shah, Aarti Singh, Hal Daumé III

Abstract: Modern machine learning and computer science conferences are experiencing a surge in the number of submissions that challenges the quality of peer review as the number of competent reviewers is growing at a much slower rate. To curb this trend and reduce the burden on reviewers, several conferences have started encouraging or even requiring authors to declare the previous submission history of the… ▽ More Modern machine learning and computer science conferences are experiencing a surge in the number of submissions that challenges the quality of peer review as the number of competent reviewers is growing at a much slower rate. To curb this trend and reduce the burden on reviewers, several conferences have started encouraging or even requiring authors to declare the previous submission history of their papers. Such initiatives have been met with skepticism among authors, who raise the concern about a potential bias in reviewers' recommendations induced by this information. In this work, we investigate whether reviewers exhibit a bias caused by the knowledge that the submission under review was previously rejected at a similar venue, focusing on a population of novice reviewers who constitute a large fraction of the reviewer pool in leading machine learning and computer science conferences. We design and conduct a randomized controlled trial closely replicating the relevant components of the peer-review pipeline with $133$ reviewers (master's, junior PhD students, and recent graduates of top US universities) writing reviews for $19$ papers. The analysis reveals that reviewers indeed become negatively biased when they receive a signal about paper being a resubmission, giving almost 1 point lower overall score on a 10-point Likert item ($Δ= -0.78, \ 95\% \ \text{CI} = [-1.30, -0.24]$) than reviewers who do not receive such a signal. Looking at specific criteria scores (originality, quality, clarity and significance), we observe that novice reviewers tend to underrate quality the most. △ Less

Submitted 30 November, 2020; originally announced November 2020.

arXiv:2010.15300 [pdf, other]

Uncovering Latent Biases in Text: Method and Application to Peer Review

Authors: Emaad Manzoor, Nihar B. Shah

Abstract: Quantifying systematic disparities in numerical quantities such as employment rates and wages between population subgroups provides compelling evidence for the existence of societal biases. However, biases in the text written for members of different subgroups (such as in recommendation letters for male and non-male candidates), though widely reported anecdotally, remain challenging to quantify. I… ▽ More Quantifying systematic disparities in numerical quantities such as employment rates and wages between population subgroups provides compelling evidence for the existence of societal biases. However, biases in the text written for members of different subgroups (such as in recommendation letters for male and non-male candidates), though widely reported anecdotally, remain challenging to quantify. In this work, we introduce a novel framework to quantify bias in text caused by the visibility of subgroup membership indicators. We develop a nonparametric estimation and inference procedure to estimate this bias. We then formalize an identification strategy to causally link the estimated bias to the visibility of subgroup membership indicators, provided observations from time periods both before and after an identity-hiding policy change. We identify an application wherein "ground truth" bias can be inferred to evaluate our framework, instead of relying on synthetic or secondary data. Specifically, we apply our framework to quantify biases in the text of peer reviews from a reputed machine learning conference before and after the conference adopted a double-blind reviewing policy. We show evidence of biases in the review ratings that serves as "ground truth", and show that our proposed framework accurately detects these biases from the review text without having access to the review ratings. △ Less

Submitted 28 October, 2020; originally announced October 2020.

arXiv:2010.04899 [pdf, other]

Human-Supervised Semi-Autonomous Mobile Manipulators for Safely and Efficiently Executing Machine Tending Tasks

Authors: Sarah Al-Hussaini, Shantanu Thakar, Hyojeong Kim, Pradeep Rajendran, Brual C. Shah, Jeremy A. Marvel, Satyandra K. Gupta

Abstract: Mobile manipulators can be used for machine tending and material handling tasks in small volume manufacturing applications. These applications usually have semi-structured work environment. The use of a fully autonomous mobile manipulator for such applications can be risky, as an inaccurate model of the workspace may result in damage to expensive equipment. On the other hand, the use of a fully te… ▽ More Mobile manipulators can be used for machine tending and material handling tasks in small volume manufacturing applications. These applications usually have semi-structured work environment. The use of a fully autonomous mobile manipulator for such applications can be risky, as an inaccurate model of the workspace may result in damage to expensive equipment. On the other hand, the use of a fully teleoperated mobile manipulator may require a significant amount of operator time. In this paper, a semi-autonomous mobile manipulator is developed for safely and efficiently carrying out machine tending tasks under human supervision. The robot is capable of generating motion plans from the high-level task description and presenting simulation results to the human for approval. The human operator can authorize the robot to execute the automatically generated plan or provide additional input to the planner to refine the plan. If the level of uncertainty in some parts of the workspace model is high, then the human can decide to perform teleoperation to safely execute the task. Our preliminary user trials show that non-expert operators can quickly learn to use the system and perform machine tending tasks. △ Less

Submitted 16 October, 2020; v1 submitted 10 October, 2020; originally announced October 2020.

arXiv:2010.04041 [pdf, other]

Catch Me if I Can: Detecting Strategic Behaviour in Peer Assessment

Authors: Ivan Stelmakh, Nihar B. Shah, Aarti Singh

Abstract: We consider the issue of strategic behaviour in various peer-assessment tasks, including peer grading of exams or homeworks and peer review in hiring or promotions. When a peer-assessment task is competitive (e.g., when students are graded on a curve), agents may be incentivized to misreport evaluations in order to improve their own final standing. Our focus is on designing methods for detection o… ▽ More We consider the issue of strategic behaviour in various peer-assessment tasks, including peer grading of exams or homeworks and peer review in hiring or promotions. When a peer-assessment task is competitive (e.g., when students are graded on a curve), agents may be incentivized to misreport evaluations in order to improve their own final standing. Our focus is on designing methods for detection of such manipulations. Specifically, we consider a setting in which agents evaluate a subset of their peers and output rankings that are later aggregated to form a final ordering. In this paper, we investigate a statistical framework for this problem and design a principled test for detecting strategic behaviour. We prove that our test has strong false alarm guarantees and evaluate its detection ability in practical settings. For this, we design and execute an experiment that elicits strategic behaviour from subjects and release a dataset of patterns of strategic behaviour that may be of independent interest. We then use the collected data to conduct a series of real and semi-synthetic evaluations that demonstrate a strong detection power of our test. △ Less

Submitted 8 October, 2020; originally announced October 2020.

arXiv:2007.07079 [pdf, other]

A SUPER* Algorithm to Optimize Paper Bidding in Peer Review

Authors: Tanner Fiez, Nihar B. Shah, Lillian Ratliff

Abstract: A number of applications involve sequential arrival of users, and require showing each user an ordering of items. A prime example (which forms the focus of this paper) is the bidding process in conference peer review where reviewers enter the system sequentially, each reviewer needs to be shown the list of submitted papers, and the reviewer then "bids" to review some papers. The order of the paper… ▽ More A number of applications involve sequential arrival of users, and require showing each user an ordering of items. A prime example (which forms the focus of this paper) is the bidding process in conference peer review where reviewers enter the system sequentially, each reviewer needs to be shown the list of submitted papers, and the reviewer then "bids" to review some papers. The order of the papers shown has a significant impact on the bids due to primacy effects. In deciding on the ordering of papers to show, there are two competing goals: (i) obtaining sufficiently many bids for each paper, and (ii) satisfying reviewers by showing them relevant items. In this paper, we begin by develo** a framework to study this problem in a principled manner. We present an algorithm called SUPER*, inspired by the A* algorithm, for this goal. Theoretically, we show a local optimality guarantee of our algorithm and prove that popular baselines are considerably suboptimal. Moreover, under a community model for the similarities, we prove that SUPER* is near-optimal whereas the popular baselines are considerably suboptimal. In experiments on real data from ICLR 2018 and synthetic data, we find that SUPER* considerably outperforms baselines deployed in existing systems, consistently reducing the number of papers with fewer than requisite bids by 50-75% or more, and is also robust to various real world complexities. △ Less

Submitted 31 July, 2020; v1 submitted 27 June, 2020; originally announced July 2020.

arXiv:2006.16437 [pdf, other]

Mitigating Manipulation in Peer Review via Randomized Reviewer Assignments

Authors: Steven Jecmen, Hanrui Zhang, Ryan Liu, Nihar B. Shah, Vincent Conitzer, Fei Fang

Abstract: We consider three important challenges in conference peer review: (i) reviewers maliciously attempting to get assigned to certain papers to provide positive reviews, possibly as part of quid-pro-quo arrangements with the authors; (ii) "torpedo reviewing," where reviewers deliberately attempt to get assigned to certain papers that they dislike in order to reject them; (iii) reviewer de-anonymizatio… ▽ More We consider three important challenges in conference peer review: (i) reviewers maliciously attempting to get assigned to certain papers to provide positive reviews, possibly as part of quid-pro-quo arrangements with the authors; (ii) "torpedo reviewing," where reviewers deliberately attempt to get assigned to certain papers that they dislike in order to reject them; (iii) reviewer de-anonymization on release of the similarities and the reviewer-assignment code. On the conceptual front, we identify connections between these three problems and present a framework that brings all these challenges under a common umbrella. We then present a (randomized) algorithm for reviewer assignment that can optimally solve the reviewer-assignment problem under any given constraints on the probability of assignment for any reviewer-paper pair. We further consider the problem of restricting the joint probability that certain suspect pairs of reviewers are assigned to certain papers, and show that this problem is NP-hard for arbitrary constraints on these joint probabilities but efficiently solvable for a practical special case. Finally, we experimentally evaluate our algorithms on datasets from past conferences, where we observe that they can limit the chance that any malicious reviewer gets assigned to their desired paper to 50% while producing assignments with over 90% of the total optimal similarity. Our algorithms still achieve this similarity while also preventing reviewers with close associations from being assigned to the same paper. △ Less

Submitted 23 October, 2020; v1 submitted 29 June, 2020; originally announced June 2020.

arXiv:2006.16385 [pdf, other]

On the Privacy-Utility Tradeoff in Peer-Review Data Analysis

Authors: Wenxin Ding, Nihar B. Shah, Weina Wang

Abstract: A major impediment to research on improving peer review is the unavailability of peer-review data, since any release of such data must grapple with the sensitivity of the peer review data in terms of protecting identities of reviewers from authors. We posit the need to develop techniques to release peer-review data in a privacy-preserving manner. Identifying this problem, in this paper we propose… ▽ More A major impediment to research on improving peer review is the unavailability of peer-review data, since any release of such data must grapple with the sensitivity of the peer review data in terms of protecting identities of reviewers from authors. We posit the need to develop techniques to release peer-review data in a privacy-preserving manner. Identifying this problem, in this paper we propose a framework for privacy-preserving release of certain conference peer-review data -- distributions of ratings, miscalibration, and subjectivity -- with an emphasis on the accuracy (or utility) of the released data. The crux of the framework lies in recognizing that a part of the data pertaining to the reviews is already available in public, and we use this information to post-process the data released by any privacy mechanism in a manner that improves the accuracy (utility) of the data while retaining the privacy guarantees. Our framework works with any privacy-preserving mechanism that operates via releasing perturbed data. We present several positive and negative theoretical results, including a polynomial-time algorithm for improving on the privacy-utility tradeoff. △ Less

Submitted 29 June, 2020; originally announced June 2020.

Showing 1–50 of 92 results for author: Shah, B