Search | arXiv e-print repository

The Evolution of Multimodal Model Architectures

Authors: Shakti N. Wadekar, Abhishek Chaurasia, Aman Chadha, Eugenio Culurciello

Abstract: This work uniquely identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain. Distinct from recent survey papers that present general information on multimodal architectures, this research conducts a comprehensiv… ▽ More This work uniquely identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain. Distinct from recent survey papers that present general information on multimodal architectures, this research conducts a comprehensive exploration of architectural details and identifies four specific architectural types. The types are distinguished by their respective methodologies for integrating multimodal inputs into the deep neural network model. The first two types (Type A and B) deeply fuses multimodal inputs within the internal layers of the model, whereas the following two types (Type C and D) facilitate early fusion at the input stage. Type-A employs standard cross-attention, whereas Type-B utilizes custom-designed layers for modality fusion within the internal layers. On the other hand, Type-C utilizes modality-specific encoders, while Type-D leverages tokenizers to process the modalities at the model's input stage. The identified architecture types aid the monitoring of any-to-any multimodal model development. Notably, Type-C and Type-D are currently favored in the construction of any-to-any multimodal models. Type-C, distinguished by its non-tokenizing multimodal model architecture, is emerging as a viable alternative to Type-D, which utilizes input-tokenizing techniques. To assist in model selection, this work highlights the advantages and disadvantages of each architecture type based on data and compute requirements, architecture complexity, scalability, simplification of adding modalities, training objectives, and any-to-any multimodal generation capability. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 30 pages, 6 tables, 7 figures

arXiv:2402.08656 [pdf, other]

NeuroIDBench: An Open-Source Benchmark Framework for the Standardization of Methodology in Brainwave-based Authentication Research

Authors: Avinash Kumar Chaurasia, Matin Fallahi, Thorsten Strufe, Philipp Terhörst, Patricia Arias Cabarcos

Abstract: Biometric systems based on brain activity have been proposed as an alternative to passwords or to complement current authentication techniques. By leveraging the unique brainwave patterns of individuals, these systems offer the possibility of creating authentication solutions that are resistant to theft, hands-free, accessible, and potentially even revocable. However, despite the growing stream of… ▽ More Biometric systems based on brain activity have been proposed as an alternative to passwords or to complement current authentication techniques. By leveraging the unique brainwave patterns of individuals, these systems offer the possibility of creating authentication solutions that are resistant to theft, hands-free, accessible, and potentially even revocable. However, despite the growing stream of research in this area, faster advance is hindered by reproducibility problems. Issues such as the lack of standard reporting schemes for performance results and system configuration, or the absence of common evaluation benchmarks, make comparability and proper assessment of different biometric solutions challenging. Further, barriers are erected to future work when, as so often, source code is not published open access. To bridge this gap, we introduce NeuroIDBench, a flexible open source tool to benchmark brainwave-based authentication models. It incorporates nine diverse datasets, implements a comprehensive set of pre-processing parameters and machine learning algorithms, enables testing under two common adversary models (known vs unknown attacker), and allows researchers to generate full performance reports and visualizations. We use NeuroIDBench to investigate the shallow classifiers and deep learning-based approaches proposed in the literature, and to test robustness across multiple sessions. We observe a 37.6% reduction in Equal Error Rate (EER) for unknown attacker scenarios (typically not tested in the literature), and we highlight the importance of session variability to brainwave authentication. All in all, our results demonstrate the viability and relevance of NeuroIDBench in streamlining fair comparisons of algorithms, thereby furthering the advancement of brainwave-based authentication through robust methodological practices. △ Less

Submitted 7 May, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: 21 pages, 5 Figures, 3 tables, Submitted to the Journal of Information Security and Applications

arXiv:2402.05817 [pdf]

Using YOLO v7 to Detect Kidney in Magnetic Resonance Imaging

Authors: Pouria Yazdian Anari, Fiona Obiezu, Nathan Lay, Fatemeh Dehghani Firouzabadi, Aditi Chaurasia, Mahshid Golagha, Shiva Singh, Fatemeh Homayounieh, Aryan Zahergivar, Stephanie Harmon, Evrim Turkbey, Rabindra Gautam, Kevin Ma, Maria Merino, Elizabeth C. Jones, Mark W. Ball, W. Marston Linehan, Baris Turkbey, Ashkan A. Malayeri

Abstract: Introduction This study explores the use of the latest You Only Look Once (YOLO V7) object detection method to enhance kidney detection in medical imaging by training and testing a modified YOLO V7 on medical image formats. Methods Study includes 878 patients with various subtypes of renal cell carcinoma (RCC) and 206 patients with normal kidneys. A total of 5657 MRI scans for 1084 patients were r… ▽ More Introduction This study explores the use of the latest You Only Look Once (YOLO V7) object detection method to enhance kidney detection in medical imaging by training and testing a modified YOLO V7 on medical image formats. Methods Study includes 878 patients with various subtypes of renal cell carcinoma (RCC) and 206 patients with normal kidneys. A total of 5657 MRI scans for 1084 patients were retrieved. 326 patients with 1034 tumors recruited from a retrospective maintained database, and bounding boxes were drawn around their tumors. A primary model was trained on 80% of annotated cases, with 20% saved for testing (primary test set). The best primary model was then used to identify tumors in the remaining 861 patients and bounding box coordinates were generated on their scans using the model. Ten benchmark training sets were created with generated coordinates on not-segmented patients. The final model used to predict the kidney in the primary test set. We reported the positive predictive value (PPV), sensitivity, and mean average precision (mAP). Results The primary training set showed an average PPV of 0.94 +/- 0.01, sensitivity of 0.87 +/- 0.04, and mAP of 0.91 +/- 0.02. The best primary model yielded a PPV of 0.97, sensitivity of 0.92, and mAP of 0.95. The final model demonstrated an average PPV of 0.95 +/- 0.03, sensitivity of 0.98 +/- 0.004, and mAP of 0.95 +/- 0.01. Conclusion Using a semi-supervised approach with a medical image library, we developed a high-performing model for kidney detection. Further external validation is required to assess the model's generalizability. △ Less

Submitted 12 February, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

arXiv:2303.01229 [pdf, other]

Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Authors: Cyril Zakka, Akash Chaurasia, Rohan Shad, Alex R. Dalal, Jennifer L. Kim, Michael Moor, Kevin Alexander, Euan Ashley, Jack Boyd, Kathleen Boyd, Karen Hirsch, Curt Langlotz, Joanna Nelson, William Hiesinger

Abstract: Large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as summarization, dialogue generation, and question-answering. Despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. In… ▽ More Large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as summarization, dialogue generation, and question-answering. Despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. In this study, we develop Almanac, a large language model framework augmented with retrieval capabilities for medical guideline and treatment recommendations. Performance on a novel dataset of clinical scenarios (n = 130) evaluated by a panel of 5 board-certified and resident physicians demonstrates significant increases in factuality (mean of 18% at p-value < 0.05) across all specialties, with improvements in completeness and safety. Our results demonstrate the potential for large language models to be effective tools in the clinical decision-making process, while also emphasizing the importance of careful testing and deployment to mitigate their shortcomings. △ Less

Submitted 31 May, 2023; v1 submitted 28 February, 2023; originally announced March 2023.

arXiv:2302.08497 [pdf, other]

Rethinking "Risk" in Algorithmic Systems Through A Computational Narrative Analysis of Casenotes in Child-Welfare

Authors: Devansh Saxena, Erina Seh-Young Moon, Aryan Chaurasia, Yixin Guan, Shion Guha

Abstract: Risk assessment algorithms are being adopted by public sector agencies to make high-stakes decisions about human lives. Algorithms model "risk" based on individual client characteristics to identify clients most in need. However, this understanding of risk is primarily based on easily quantifiable risk factors that present an incomplete and biased perspective of clients. We conducted a computation… ▽ More Risk assessment algorithms are being adopted by public sector agencies to make high-stakes decisions about human lives. Algorithms model "risk" based on individual client characteristics to identify clients most in need. However, this understanding of risk is primarily based on easily quantifiable risk factors that present an incomplete and biased perspective of clients. We conducted a computational narrative analysis of child-welfare casenotes and draw attention to deeper systemic risk factors that are hard to quantify but directly impact families and street-level decision-making. We found that beyond individual risk factors, the system itself poses a significant amount of risk where parents are over-surveilled by caseworkers and lack agency in decision-making processes. We also problematize the notion of risk as a static construct by highlighting the temporality and mediating effects of different risk, protective, systemic, and procedural factors. Finally, we draw caution against using casenotes in NLP-based systems by unpacking their limitations and biases embedded within them. △ Less

Submitted 16 February, 2023; originally announced February 2023.

arXiv:2209.15159 [pdf, other]

MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

Authors: Shakti N. Wadekar, Abhishek Chaurasia

Abstract: MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simp… ▽ More MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. These new models give better accuracy numbers on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets as compared to MobileViTv2. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively. Our code and the trained models are available at: https://github.com/micronDLA/MobileViTv3 △ Less

Submitted 6 October, 2022; v1 submitted 29 September, 2022; originally announced September 2022.

Comments: 20 pages, 7 figures

arXiv:2205.13675 [pdf, other]

Reinforcement Learning Approach for Map** Applications to Dataflow-Based Coarse-Grained Reconfigurable Array

Authors: Andre Xian Ming Chang, Parth Khopkar, Bashar Romanous, Abhishek Chaurasia, Patrick Estep, Skyler Windh, Doug Vanesko, Sheik Dawood Beer Mohideen, Eugenio Culurciello

Abstract: The Streaming Engine (SE) is a Coarse-Grained Reconfigurable Array which provides programming flexibility and high-performance with energy efficiency. An application program to be executed on the SE is represented as a combination of Synchronous Data Flow (SDF) graphs, where every instruction is represented as a node. Each node needs to be mapped to the right slot and array in the SE to ensure the… ▽ More The Streaming Engine (SE) is a Coarse-Grained Reconfigurable Array which provides programming flexibility and high-performance with energy efficiency. An application program to be executed on the SE is represented as a combination of Synchronous Data Flow (SDF) graphs, where every instruction is represented as a node. Each node needs to be mapped to the right slot and array in the SE to ensure the correct execution of the program. This creates an optimization problem with a vast and sparse search space for which finding a map** manually is impractical because it requires expertise and knowledge of the SE micro-architecture. In this work we propose a Reinforcement Learning framework with Global Graph Attention (GGA) module and output masking of invalid placements to find and optimize instruction schedules. We use Proximal Policy Optimization in order to train a model which places operations into the SE tiles based on a reward function that models the SE device and its constraints. The GGA module consists of a graph neural network and an attention module. The graph neural network creates embeddings of the SDFs and the attention block is used to model sequential operation placement. We show results on how certain workloads are mapped to the SE and the factors affecting map** quality. We find that the addition of GGA, on average, finds 10% better instruction schedules in terms of total clock cycles taken and masking improves reward obtained by 20%. △ Less

Submitted 26 May, 2022; originally announced May 2022.

Comments: 10 pages, 12 figures

arXiv:2009.03878 [pdf, other]

Convolution Neural Networks for diagnosing colon and lung cancer histopathological images

Authors: Sanidhya Mangal, Aanchal Chaurasia, Ayush Khajanchi

Abstract: Lung and Colon cancer are one of the leading causes of mortality and morbidity in adults. Histopathological diagnosis is one of the key components to discern cancer type. The aim of the present research is to propose a computer aided diagnosis system for diagnosing squamous cell carcinomas and adenocarcinomas of lung as well as adenocarcinomas of colon using convolutional neural networks by evalua… ▽ More Lung and Colon cancer are one of the leading causes of mortality and morbidity in adults. Histopathological diagnosis is one of the key components to discern cancer type. The aim of the present research is to propose a computer aided diagnosis system for diagnosing squamous cell carcinomas and adenocarcinomas of lung as well as adenocarcinomas of colon using convolutional neural networks by evaluating the digital pathology images for these cancers. Hereby, rendering artificial intelligence as useful technology in the near future. A total of 2500 digital images were acquired from LC25000 dataset containing 5000 images for each class. A shallow neural network architecture was used classify the histopathological slides into squamous cell carcinomas, adenocarcinomas and benign for the lung. Similar model was used to classify adenocarcinomas and benign for colon. The diagnostic accuracy of more than 97% and 96% was recorded for lung and colon respectively. △ Less

Submitted 8 September, 2020; originally announced September 2020.

Comments: 10 pages, 4 figures, 2 tables

arXiv:1707.03718 [pdf, other]

doi 10.1109/VCIP.2017.8305148

LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation

Authors: Abhishek Chaurasia, Eugenio Culurciello

Abstract: Pixel-wise semantic segmentation for visual scene understanding not only needs to be accurate, but also efficient in order to find any use in real-time application. Existing algorithms even though are accurate but they do not focus on utilizing the parameters of neural network efficiently. As a result they are huge in terms of parameters and number of operations; hence slow too. In this paper, we… ▽ More Pixel-wise semantic segmentation for visual scene understanding not only needs to be accurate, but also efficient in order to find any use in real-time application. Existing algorithms even though are accurate but they do not focus on utilizing the parameters of neural network efficiently. As a result they are huge in terms of parameters and number of operations; hence slow too. In this paper, we propose a novel deep neural network architecture which allows it to learn without any significant increase in number of parameters. Our network uses only 11.5 million parameters and 21.2 GFLOPs for processing an image of resolution 3x640x360. It gives state-of-the-art performance on CamVid and comparable results on Cityscapes dataset. We also compare our networks processing time on NVIDIA GPU and embedded system device with existing state-of-the-art architectures for different image resolutions. △ Less

Submitted 14 June, 2017; originally announced July 2017.

Comments: 5 pages, 5 figures, GitHub: https://github.com/e-lab/LinkNet

arXiv:1606.02147 [pdf, other]

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Authors: Adam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello

Abstract: The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural netwo… ▽ More The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18$\times$ faster, requires 75$\times$ less FLOPs, has 79$\times$ less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster. △ Less

Submitted 7 June, 2016; originally announced June 2016.

arXiv:1203.3660 [pdf]

Arabic Interface Analysis Based on Cultural Markers

Authors: Mohammadi Akheela Khanum, Shameem Fatima, Mousmi A. Chaurasia

Abstract: This study examines the Arabic interface design elements that are largely influenced by the cultural values. Cultural markers are examined in websites from educational, business, and media. Cultural values analysis is based on Geert Hofstede's cultural dimensions. The findings show that there are cultural markers which are largely influenced by the culture and that the Hofstede's score for Arab co… ▽ More This study examines the Arabic interface design elements that are largely influenced by the cultural values. Cultural markers are examined in websites from educational, business, and media. Cultural values analysis is based on Geert Hofstede's cultural dimensions. The findings show that there are cultural markers which are largely influenced by the culture and that the Hofstede's score for Arab countries is partially supported by the website design components examined in this study. Moderate support was also found for the long term orientation, for which Hoftsede has no score. △ Less

Submitted 16 March, 2012; originally announced March 2012.

Comments: 8 pages, 9 figures, IJCSI International Journal of Computer Science Issues

Journal ref: IJCSI International Journal of Computer Science Issues, Vol. 9, IJCSI Issue 1, No 2, January 2012, pp.255-262

arXiv:0908.2744 [pdf, ps, other]

XTile: An Error-Correction Package for DNA Self-Assembly

Authors: Anshul Chaurasia, Sudhanshu Dwivedi, Prateek Jain, Manish K. Gupta

Abstract: Self assembly is a process by which supramolecular species form spontaneously from their components. This process is ubiquitous throughout the life chemistry and is central to biological information processing. It has been predicted that in future self assembly will become an important engineering discipline by combining the fields of bio molecular computation, nano technology and medicine. Howe… ▽ More Self assembly is a process by which supramolecular species form spontaneously from their components. This process is ubiquitous throughout the life chemistry and is central to biological information processing. It has been predicted that in future self assembly will become an important engineering discipline by combining the fields of bio molecular computation, nano technology and medicine. However error control is a key challenge in realizing the potential of self assembly. Recently many authors have proposed several combinatorial error correction schemes to control errors which have a close analogy with the coding theory such as Winfree s proofreading scheme and its generalizations by Chen and Goel and compact scheme of Reif, Sahu and Yin. In this work, we present an error correction computational tool XTile that can be used to create input files to the Xgrow simulator of Winfree by providing the design logic of the tiles and it also allows the user to apply proofreading, snake and compact error correction schemes. △ Less

Submitted 19 August, 2009; originally announced August 2009.

Comments: 5 pages, FNANO 2009 conference paper, The tool XTile is available for download and use at http://www.guptalab.org/xtile

Journal ref: Proceedings of 6th Annual Conference on Foundations of Nanoscience (FNANO 09): Self-Assembled Architectures and Devices, Salt Lake City, Utah, U.S.A., 20th-24th April 2009, pp. 225 - 229

Showing 1–12 of 12 results for author: Chaurasia, A