-
The Evolution of Multimodal Model Architectures
Authors:
Shakti N. Wadekar,
Abhishek Chaurasia,
Aman Chadha,
Eugenio Culurciello
Abstract:
This work uniquely identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain. Distinct from recent survey papers that present general information on multimodal architectures, this research conducts a comprehensiv…
▽ More
This work uniquely identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain. Distinct from recent survey papers that present general information on multimodal architectures, this research conducts a comprehensive exploration of architectural details and identifies four specific architectural types. The types are distinguished by their respective methodologies for integrating multimodal inputs into the deep neural network model. The first two types (Type A and B) deeply fuses multimodal inputs within the internal layers of the model, whereas the following two types (Type C and D) facilitate early fusion at the input stage. Type-A employs standard cross-attention, whereas Type-B utilizes custom-designed layers for modality fusion within the internal layers. On the other hand, Type-C utilizes modality-specific encoders, while Type-D leverages tokenizers to process the modalities at the model's input stage. The identified architecture types aid the monitoring of any-to-any multimodal model development. Notably, Type-C and Type-D are currently favored in the construction of any-to-any multimodal models. Type-C, distinguished by its non-tokenizing multimodal model architecture, is emerging as a viable alternative to Type-D, which utilizes input-tokenizing techniques. To assist in model selection, this work highlights the advantages and disadvantages of each architecture type based on data and compute requirements, architecture complexity, scalability, simplification of adding modalities, training objectives, and any-to-any multimodal generation capability.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
NeuroIDBench: An Open-Source Benchmark Framework for the Standardization of Methodology in Brainwave-based Authentication Research
Authors:
Avinash Kumar Chaurasia,
Matin Fallahi,
Thorsten Strufe,
Philipp Terhörst,
Patricia Arias Cabarcos
Abstract:
Biometric systems based on brain activity have been proposed as an alternative to passwords or to complement current authentication techniques. By leveraging the unique brainwave patterns of individuals, these systems offer the possibility of creating authentication solutions that are resistant to theft, hands-free, accessible, and potentially even revocable. However, despite the growing stream of…
▽ More
Biometric systems based on brain activity have been proposed as an alternative to passwords or to complement current authentication techniques. By leveraging the unique brainwave patterns of individuals, these systems offer the possibility of creating authentication solutions that are resistant to theft, hands-free, accessible, and potentially even revocable. However, despite the growing stream of research in this area, faster advance is hindered by reproducibility problems. Issues such as the lack of standard reporting schemes for performance results and system configuration, or the absence of common evaluation benchmarks, make comparability and proper assessment of different biometric solutions challenging. Further, barriers are erected to future work when, as so often, source code is not published open access. To bridge this gap, we introduce NeuroIDBench, a flexible open source tool to benchmark brainwave-based authentication models. It incorporates nine diverse datasets, implements a comprehensive set of pre-processing parameters and machine learning algorithms, enables testing under two common adversary models (known vs unknown attacker), and allows researchers to generate full performance reports and visualizations. We use NeuroIDBench to investigate the shallow classifiers and deep learning-based approaches proposed in the literature, and to test robustness across multiple sessions. We observe a 37.6% reduction in Equal Error Rate (EER) for unknown attacker scenarios (typically not tested in the literature), and we highlight the importance of session variability to brainwave authentication. All in all, our results demonstrate the viability and relevance of NeuroIDBench in streamlining fair comparisons of algorithms, thereby furthering the advancement of brainwave-based authentication through robust methodological practices.
△ Less
Submitted 7 May, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
Using YOLO v7 to Detect Kidney in Magnetic Resonance Imaging
Authors:
Pouria Yazdian Anari,
Fiona Obiezu,
Nathan Lay,
Fatemeh Dehghani Firouzabadi,
Aditi Chaurasia,
Mahshid Golagha,
Shiva Singh,
Fatemeh Homayounieh,
Aryan Zahergivar,
Stephanie Harmon,
Evrim Turkbey,
Rabindra Gautam,
Kevin Ma,
Maria Merino,
Elizabeth C. Jones,
Mark W. Ball,
W. Marston Linehan,
Baris Turkbey,
Ashkan A. Malayeri
Abstract:
Introduction This study explores the use of the latest You Only Look Once (YOLO V7) object detection method to enhance kidney detection in medical imaging by training and testing a modified YOLO V7 on medical image formats. Methods Study includes 878 patients with various subtypes of renal cell carcinoma (RCC) and 206 patients with normal kidneys. A total of 5657 MRI scans for 1084 patients were r…
▽ More
Introduction This study explores the use of the latest You Only Look Once (YOLO V7) object detection method to enhance kidney detection in medical imaging by training and testing a modified YOLO V7 on medical image formats. Methods Study includes 878 patients with various subtypes of renal cell carcinoma (RCC) and 206 patients with normal kidneys. A total of 5657 MRI scans for 1084 patients were retrieved. 326 patients with 1034 tumors recruited from a retrospective maintained database, and bounding boxes were drawn around their tumors. A primary model was trained on 80% of annotated cases, with 20% saved for testing (primary test set). The best primary model was then used to identify tumors in the remaining 861 patients and bounding box coordinates were generated on their scans using the model. Ten benchmark training sets were created with generated coordinates on not-segmented patients. The final model used to predict the kidney in the primary test set. We reported the positive predictive value (PPV), sensitivity, and mean average precision (mAP). Results The primary training set showed an average PPV of 0.94 +/- 0.01, sensitivity of 0.87 +/- 0.04, and mAP of 0.91 +/- 0.02. The best primary model yielded a PPV of 0.97, sensitivity of 0.92, and mAP of 0.95. The final model demonstrated an average PPV of 0.95 +/- 0.03, sensitivity of 0.98 +/- 0.004, and mAP of 0.95 +/- 0.01. Conclusion Using a semi-supervised approach with a medical image library, we developed a high-performing model for kidney detection. Further external validation is required to assess the model's generalizability.
△ Less
Submitted 12 February, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
Almanac: Retrieval-Augmented Language Models for Clinical Medicine
Authors:
Cyril Zakka,
Akash Chaurasia,
Rohan Shad,
Alex R. Dalal,
Jennifer L. Kim,
Michael Moor,
Kevin Alexander,
Euan Ashley,
Jack Boyd,
Kathleen Boyd,
Karen Hirsch,
Curt Langlotz,
Joanna Nelson,
William Hiesinger
Abstract:
Large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as summarization, dialogue generation, and question-answering. Despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. In…
▽ More
Large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as summarization, dialogue generation, and question-answering. Despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. In this study, we develop Almanac, a large language model framework augmented with retrieval capabilities for medical guideline and treatment recommendations. Performance on a novel dataset of clinical scenarios (n = 130) evaluated by a panel of 5 board-certified and resident physicians demonstrates significant increases in factuality (mean of 18% at p-value < 0.05) across all specialties, with improvements in completeness and safety. Our results demonstrate the potential for large language models to be effective tools in the clinical decision-making process, while also emphasizing the importance of careful testing and deployment to mitigate their shortcomings.
△ Less
Submitted 31 May, 2023; v1 submitted 28 February, 2023;
originally announced March 2023.
-
Rethinking "Risk" in Algorithmic Systems Through A Computational Narrative Analysis of Casenotes in Child-Welfare
Authors:
Devansh Saxena,
Erina Seh-Young Moon,
Aryan Chaurasia,
Yixin Guan,
Shion Guha
Abstract:
Risk assessment algorithms are being adopted by public sector agencies to make high-stakes decisions about human lives. Algorithms model "risk" based on individual client characteristics to identify clients most in need. However, this understanding of risk is primarily based on easily quantifiable risk factors that present an incomplete and biased perspective of clients. We conducted a computation…
▽ More
Risk assessment algorithms are being adopted by public sector agencies to make high-stakes decisions about human lives. Algorithms model "risk" based on individual client characteristics to identify clients most in need. However, this understanding of risk is primarily based on easily quantifiable risk factors that present an incomplete and biased perspective of clients. We conducted a computational narrative analysis of child-welfare casenotes and draw attention to deeper systemic risk factors that are hard to quantify but directly impact families and street-level decision-making. We found that beyond individual risk factors, the system itself poses a significant amount of risk where parents are over-surveilled by caseworkers and lack agency in decision-making processes. We also problematize the notion of risk as a static construct by highlighting the temporality and mediating effects of different risk, protective, systemic, and procedural factors. Finally, we draw caution against using casenotes in NLP-based systems by unpacking their limitations and biases embedded within them.
△ Less
Submitted 16 February, 2023;
originally announced February 2023.
-
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features
Authors:
Shakti N. Wadekar,
Abhishek Chaurasia
Abstract:
MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simp…
▽ More
MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. These new models give better accuracy numbers on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets as compared to MobileViTv2. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively. Our code and the trained models are available at: https://github.com/micronDLA/MobileViTv3
△ Less
Submitted 6 October, 2022; v1 submitted 29 September, 2022;
originally announced September 2022.
-
Reinforcement Learning Approach for Map** Applications to Dataflow-Based Coarse-Grained Reconfigurable Array
Authors:
Andre Xian Ming Chang,
Parth Khopkar,
Bashar Romanous,
Abhishek Chaurasia,
Patrick Estep,
Skyler Windh,
Doug Vanesko,
Sheik Dawood Beer Mohideen,
Eugenio Culurciello
Abstract:
The Streaming Engine (SE) is a Coarse-Grained Reconfigurable Array which provides programming flexibility and high-performance with energy efficiency. An application program to be executed on the SE is represented as a combination of Synchronous Data Flow (SDF) graphs, where every instruction is represented as a node. Each node needs to be mapped to the right slot and array in the SE to ensure the…
▽ More
The Streaming Engine (SE) is a Coarse-Grained Reconfigurable Array which provides programming flexibility and high-performance with energy efficiency. An application program to be executed on the SE is represented as a combination of Synchronous Data Flow (SDF) graphs, where every instruction is represented as a node. Each node needs to be mapped to the right slot and array in the SE to ensure the correct execution of the program. This creates an optimization problem with a vast and sparse search space for which finding a map** manually is impractical because it requires expertise and knowledge of the SE micro-architecture. In this work we propose a Reinforcement Learning framework with Global Graph Attention (GGA) module and output masking of invalid placements to find and optimize instruction schedules. We use Proximal Policy Optimization in order to train a model which places operations into the SE tiles based on a reward function that models the SE device and its constraints. The GGA module consists of a graph neural network and an attention module. The graph neural network creates embeddings of the SDFs and the attention block is used to model sequential operation placement. We show results on how certain workloads are mapped to the SE and the factors affecting map** quality. We find that the addition of GGA, on average, finds 10% better instruction schedules in terms of total clock cycles taken and masking improves reward obtained by 20%.
△ Less
Submitted 26 May, 2022;
originally announced May 2022.
-
Convolution Neural Networks for diagnosing colon and lung cancer histopathological images
Authors:
Sanidhya Mangal,
Aanchal Chaurasia,
Ayush Khajanchi
Abstract:
Lung and Colon cancer are one of the leading causes of mortality and morbidity in adults. Histopathological diagnosis is one of the key components to discern cancer type. The aim of the present research is to propose a computer aided diagnosis system for diagnosing squamous cell carcinomas and adenocarcinomas of lung as well as adenocarcinomas of colon using convolutional neural networks by evalua…
▽ More
Lung and Colon cancer are one of the leading causes of mortality and morbidity in adults. Histopathological diagnosis is one of the key components to discern cancer type. The aim of the present research is to propose a computer aided diagnosis system for diagnosing squamous cell carcinomas and adenocarcinomas of lung as well as adenocarcinomas of colon using convolutional neural networks by evaluating the digital pathology images for these cancers. Hereby, rendering artificial intelligence as useful technology in the near future. A total of 2500 digital images were acquired from LC25000 dataset containing 5000 images for each class. A shallow neural network architecture was used classify the histopathological slides into squamous cell carcinomas, adenocarcinomas and benign for the lung. Similar model was used to classify adenocarcinomas and benign for colon. The diagnostic accuracy of more than 97% and 96% was recorded for lung and colon respectively.
△ Less
Submitted 8 September, 2020;
originally announced September 2020.
-
LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation
Authors:
Abhishek Chaurasia,
Eugenio Culurciello
Abstract:
Pixel-wise semantic segmentation for visual scene understanding not only needs to be accurate, but also efficient in order to find any use in real-time application. Existing algorithms even though are accurate but they do not focus on utilizing the parameters of neural network efficiently. As a result they are huge in terms of parameters and number of operations; hence slow too. In this paper, we…
▽ More
Pixel-wise semantic segmentation for visual scene understanding not only needs to be accurate, but also efficient in order to find any use in real-time application. Existing algorithms even though are accurate but they do not focus on utilizing the parameters of neural network efficiently. As a result they are huge in terms of parameters and number of operations; hence slow too. In this paper, we propose a novel deep neural network architecture which allows it to learn without any significant increase in number of parameters. Our network uses only 11.5 million parameters and 21.2 GFLOPs for processing an image of resolution 3x640x360. It gives state-of-the-art performance on CamVid and comparable results on Cityscapes dataset. We also compare our networks processing time on NVIDIA GPU and embedded system device with existing state-of-the-art architectures for different image resolutions.
△ Less
Submitted 14 June, 2017;
originally announced July 2017.
-
ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
Authors:
Adam Paszke,
Abhishek Chaurasia,
Sangpil Kim,
Eugenio Culurciello
Abstract:
The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural netwo…
▽ More
The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18$\times$ faster, requires 75$\times$ less FLOPs, has 79$\times$ less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.
△ Less
Submitted 7 June, 2016;
originally announced June 2016.
-
Arabic Interface Analysis Based on Cultural Markers
Authors:
Mohammadi Akheela Khanum,
Shameem Fatima,
Mousmi A. Chaurasia
Abstract:
This study examines the Arabic interface design elements that are largely influenced by the cultural values. Cultural markers are examined in websites from educational, business, and media. Cultural values analysis is based on Geert Hofstede's cultural dimensions. The findings show that there are cultural markers which are largely influenced by the culture and that the Hofstede's score for Arab co…
▽ More
This study examines the Arabic interface design elements that are largely influenced by the cultural values. Cultural markers are examined in websites from educational, business, and media. Cultural values analysis is based on Geert Hofstede's cultural dimensions. The findings show that there are cultural markers which are largely influenced by the culture and that the Hofstede's score for Arab countries is partially supported by the website design components examined in this study. Moderate support was also found for the long term orientation, for which Hoftsede has no score.
△ Less
Submitted 16 March, 2012;
originally announced March 2012.
-
XTile: An Error-Correction Package for DNA Self-Assembly
Authors:
Anshul Chaurasia,
Sudhanshu Dwivedi,
Prateek Jain,
Manish K. Gupta
Abstract:
Self assembly is a process by which supramolecular species form spontaneously from their components. This process is ubiquitous throughout the life chemistry and is central to biological information processing. It has been predicted that in future self assembly will become an important engineering discipline by combining the fields of bio molecular computation, nano technology and medicine. Howe…
▽ More
Self assembly is a process by which supramolecular species form spontaneously from their components. This process is ubiquitous throughout the life chemistry and is central to biological information processing. It has been predicted that in future self assembly will become an important engineering discipline by combining the fields of bio molecular computation, nano technology and medicine. However error control is a key challenge in realizing the potential of self assembly. Recently many authors have proposed several combinatorial error correction schemes to control errors which have a close analogy with the coding theory such as Winfree s proofreading scheme and its generalizations by Chen and Goel and compact scheme of Reif, Sahu and Yin. In this work, we present an error correction computational tool XTile that can be used to create input files to the Xgrow simulator of Winfree by providing the design logic of the tiles and it also allows the user to apply proofreading, snake and compact error correction schemes.
△ Less
Submitted 19 August, 2009;
originally announced August 2009.