Search | arXiv e-print repository

Towards Enhanced Local Explainability of Random Forests: a Proximity-Based Approach

Authors: Joshua Rosaler, Dhruv Desai, Bhaskarjit Sarmah, Dimitrios Vamvourellis, Deran Onay, Dhagash Mehta, Stefano Pasquali

Abstract: We initiate a novel approach to explain the out of sample performance of random forest (RF) models by exploiting the fact that any RF can be formulated as an adaptive weighted K nearest-neighbors model. Specifically, we use the proximity between points in the feature space learned by the RF to re-write random forest predictions exactly as a weighted average of the target labels of training data po… ▽ More We initiate a novel approach to explain the out of sample performance of random forest (RF) models by exploiting the fact that any RF can be formulated as an adaptive weighted K nearest-neighbors model. Specifically, we use the proximity between points in the feature space learned by the RF to re-write random forest predictions exactly as a weighted average of the target labels of training data points. This linearity facilitates a local notion of explainability of RF predictions that generates attributions for any model prediction across observations in the training set, and thereby complements established methods like SHAP, which instead generates attributions for a model prediction across dimensions of the feature space. We demonstrate this approach in the context of a bond pricing model trained on US corporate bond trades, and compare our approach to various existing approaches to model explainability. △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: 5 pages, 6 figures

arXiv:2310.10760 [pdf, other]

Towards reducing hallucination in extracting information from financial reports using Large Language Models

Authors: Bhaskarjit Sarmah, Tianjie Zhu, Dhagash Mehta, Stefano Pasquali

Abstract: For a financial analyst, the question and answer (Q\&A) segment of the company financial report is a crucial piece of information for various analysis and investment decisions. However, extracting valuable insights from the Q\&A section has posed considerable challenges as the conventional methods such as detailed reading and note-taking lack scalability and are susceptible to human errors, and Op… ▽ More For a financial analyst, the question and answer (Q\&A) segment of the company financial report is a crucial piece of information for various analysis and investment decisions. However, extracting valuable insights from the Q\&A section has posed considerable challenges as the conventional methods such as detailed reading and note-taking lack scalability and are susceptible to human errors, and Optical Character Recognition (OCR) and similar techniques encounter difficulties in accurately processing unstructured transcript text, often missing subtle linguistic nuances that drive investor decisions. Here, we demonstrate the utilization of Large Language Models (LLMs) to efficiently and rapidly extract information from earnings report transcripts while ensuring high accuracy transforming the extraction process as well as reducing hallucination by combining retrieval-augmented generation technique as well as metadata. We evaluate the outcomes of various LLMs with and without using our proposed approach based on various objective metrics for evaluating Q\&A systems, and empirically demonstrate superiority of our method. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: 4 pages + references. Accepted for publication in Workshop on Generative AI at the 3rd International Conference on AI-ML Systems 2023, Bengaluru, India

arXiv:2308.06882 [pdf, other]

Quantifying Outlierness of Funds from their Categories using Supervised Similarity

Authors: Dhruv Desai, Ashmita Dhiman, Tushar Sharma, Deepika Sharma, Dhagash Mehta, Stefano Pasquali

Abstract: Mutual fund categorization has become a standard tool for the investment management industry and is extensively used by allocators for portfolio construction and manager selection, as well as by fund managers for peer analysis and competitive positioning. As a result, a (unintended) miscategorization or lack of precision can significantly impact allocation decisions and investment fund managers. H… ▽ More Mutual fund categorization has become a standard tool for the investment management industry and is extensively used by allocators for portfolio construction and manager selection, as well as by fund managers for peer analysis and competitive positioning. As a result, a (unintended) miscategorization or lack of precision can significantly impact allocation decisions and investment fund managers. Here, we aim to quantify the effect of miscategorization of funds utilizing a machine learning based approach. We formulate the problem of miscategorization of funds as a distance-based outlier detection problem, where the outliers are the data-points that are far from the rest of the data-points in the given feature space. We implement and employ a Random Forest (RF) based method of distance metric learning, and compute the so-called class-wise outlier measures for each data-point to identify outliers in the data. We test our implementation on various publicly available data sets, and then apply it to mutual fund data. We show that there is a strong relationship between the outlier measures of the funds and their future returns and discuss the implications of our findings. △ Less

Submitted 13 August, 2023; originally announced August 2023.

Comments: 8 pages, 5 tables, 8 figures

arXiv:2305.18703 [pdf, other]

Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey

Authors: Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, Tianjiao Zhao, Amit Panalkar, Dhagash Mehta, Stefano Pasquali, Wei Cheng, Haoyu Wang, Yanchi Liu, Zhengzhang Chen, Haifeng Chen, Chris White, Quanquan Gu, Jian Pei, Carl Yang, Liang Zhao

Abstract: Large language models (LLMs) have significantly advanced the field of natural language processing (NLP), providing a highly useful, task-agnostic foundation for a wide range of applications. However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles, caused by the heterogeneity of domain data, the sophistication of domain knowledge, the uniqueness of dom… ▽ More Large language models (LLMs) have significantly advanced the field of natural language processing (NLP), providing a highly useful, task-agnostic foundation for a wide range of applications. However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles, caused by the heterogeneity of domain data, the sophistication of domain knowledge, the uniqueness of domain objectives, and the diversity of the constraints (e.g., various social norms, cultural conformity, religious beliefs, and ethical standards in the domain applications). Domain specification techniques are key to make large language models disruptive in many applications. Specifically, to solve these hurdles, there has been a notable increase in research and practices conducted in recent years on the domain specialization of LLMs. This emerging field of study, with its substantial potential for impact, necessitates a comprehensive and systematic review to better summarize and guide ongoing work in this area. In this article, we present a comprehensive survey on domain specification techniques for large language models, an emerging direction critical for large language model applications. First, we propose a systematic taxonomy that categorizes the LLM domain-specialization techniques based on the accessibility to LLMs and summarizes the framework for all the subcategories as well as their relations and differences to each other. Second, we present an extensive taxonomy of critical application domains that can benefit dramatically from specialized LLMs, discussing their practical significance and open challenges. Last, we offer our insights into the current research status and future trends in this area. △ Less

Submitted 29 March, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

arXiv:1708.04176 [pdf]

doi 10.1371/journal.pcbi.1005955

10 simple rules to create a serious game, illustrated with examples from structural biology

Authors: Marc Baaden, Olivier Delalande, Nicolas Ferey, Samuela Pasquali, Jérôme Waldispühl, Antoine Taly

Abstract: Serious scientific games are games whose purpose is not only fun. In the field of science, the serious goals include crucial activities for scientists: outreach, teaching and research. The number of serious games is increasing rapidly, in particular citizen science games, games that allow people to produce and/or analyze scientific data. Interestingly, it is possible to build a set of rules provid… ▽ More Serious scientific games are games whose purpose is not only fun. In the field of science, the serious goals include crucial activities for scientists: outreach, teaching and research. The number of serious games is increasing rapidly, in particular citizen science games, games that allow people to produce and/or analyze scientific data. Interestingly, it is possible to build a set of rules providing a guideline to create or improve serious games. We present arguments gathered from our own experience ( Phylo , DocMolecules , HiRE-RNA contest and Pangu) as well as examples from the growing literature on scientific serious games. △ Less

Submitted 9 March, 2018; v1 submitted 14 August, 2017; originally announced August 2017.

Journal ref: Baaden M, Delalande O, Ferey N, Pasquali S, Waldispühl J, Taly A (2018) Ten simple rules to create a serious game, illustrated with examples from structural biology. PLoS Comput Biol 14(3): e1005955

Showing 1–5 of 5 results for author: Pasquali, S