Search | arXiv e-print repository

StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images

Authors: Rushikesh Zawar, Shaurya Dewan, Andrew F. Luo, Margaret M. Henderson, Michael J. Tarr, Leila Wehbe

Abstract: Understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. A key aspect of this challenge is that objects sharing similar semantic meanings or functions can exhibit striking visual differences, making accurate identification and categorization difficult. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statist… ▽ More Understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. A key aspect of this challenge is that objects sharing similar semantic meanings or functions can exhibit striking visual differences, making accurate identification and categorization difficult. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics. These frameworks account for the visual variability of objects, as well as complex object co-occurrences and sources of noise such as diverse lighting conditions. By leveraging large-scale datasets and cross-attention conditioning, these models generate detailed and contextually rich scene representations. This capability opens new avenues for improving object recognition and scene understanding in varied and challenging environments. Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks. We explicitly leverage human-generated prompts that correspond to visually interesting stable diffusion generations, provide 10 generations per phrase, and extract cross-attention maps for each image. We explore the semantic distribution of generated images, examine the distribution of objects within images, and benchmark captioning and open vocabulary segmentation methods on our data. To the best of our knowledge, we are the first to release a diffusion dataset with semantic attributions. We expect our proposed dataset to catalyze advances in visual semantic understanding and provide a foundation for develo** more sophisticated and effective visual models. Website: https://stablesemantics.github.io/StableSemantics △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Dataset website: https://stablesemantics.github.io/StableSemantics

arXiv:2406.05191 [pdf, other]

DiffusionPID: Interpreting Diffusion via Partial Information Decomposition

Authors: Shaurya Dewan, Rushikesh Zawar, Prakanshul Saxena, Yingshan Chang, Andrew Luo, Yonatan Bisk

Abstract: Text-to-image diffusion models have made significant progress in generating naturalistic images from textual inputs, and demonstrate the capacity to learn and represent complex visual-semantic relationships. While these diffusion models have achieved remarkable success, the underlying mechanisms driving their performance are not yet fully accounted for, with many unanswered questions surrounding w… ▽ More Text-to-image diffusion models have made significant progress in generating naturalistic images from textual inputs, and demonstrate the capacity to learn and represent complex visual-semantic relationships. While these diffusion models have achieved remarkable success, the underlying mechanisms driving their performance are not yet fully accounted for, with many unanswered questions surrounding what they learn, how they represent visual-semantic relationships, and why they sometimes fail to generalize. Our work presents Diffusion Partial Information Decomposition (DiffusionPID), a novel technique that applies information-theoretic principles to decompose the input text prompt into its elementary components, enabling a detailed examination of how individual tokens and their interactions shape the generated image. We introduce a formal approach to analyze the uniqueness, redundancy, and synergy terms by applying PID to the denoising model at both the image and pixel level. This approach enables us to characterize how individual tokens and their interactions affect the model output. We first present a fine-grained analysis of characteristics utilized by the model to uniquely localize specific concepts, we then apply our approach in bias analysis and show it can recover gender and ethnicity biases. Finally, we use our method to visually characterize word ambiguity and similarity from the model's perspective and illustrate the efficacy of our method for prompt intervention. Our results show that PID is a potent tool for evaluating and diagnosing text-to-image diffusion models. △ Less

Submitted 12 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

arXiv:2401.04198 [pdf, other]

Curiosity & Entropy Driven Unsupervised RL in Multiple Environments

Authors: Shaurya Dewan, Anisha Jain, Zoe LaLena, Lifan Yu

Abstract: The authors of 'Unsupervised Reinforcement Learning in Multiple environments' propose a method, alpha-MEPOL, to tackle unsupervised RL across multiple environments. They pre-train a task-agnostic exploration policy using interactions from an entire environment class and then fine-tune this policy for various tasks using supervision. We expanded upon this work, with the goal of improving performanc… ▽ More The authors of 'Unsupervised Reinforcement Learning in Multiple environments' propose a method, alpha-MEPOL, to tackle unsupervised RL across multiple environments. They pre-train a task-agnostic exploration policy using interactions from an entire environment class and then fine-tune this policy for various tasks using supervision. We expanded upon this work, with the goal of improving performance. We primarily propose and experiment with five new modifications to the original work: sampling trajectories using an entropy-based probability distribution, dynamic alpha, higher KL Divergence threshold, curiosity-driven exploration, and alpha-percentile sampling on curiosity. Dynamic alpha and higher KL-Divergence threshold both provided a significant improvement over the baseline from the earlier work. PDF-sampling failed to provide any improvement due to it being approximately equivalent to the baseline method when the sample space is small. In high-dimensional environments, the addition of curiosity-driven exploration enhances learning by encouraging the agent to seek diverse experiences and explore the unknown more. However, its benefits are limited in low-dimensional and simpler environments where exploration possibilities are constrained and there is little that is truly unknown to the agent. Overall, some of our experiments did boost performance over the baseline and there are a few directions that seem promising for further research. △ Less

Submitted 8 January, 2024; originally announced January 2024.

arXiv:2212.02493 [pdf, other]

Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields

Authors: Rohith Agaram, Shaurya Dewan, Rahul Sajnani, Adrien Poulenard, Madhava Krishna, Srinath Sridhar

Abstract: Coordinate-based implicit neural networks, or neural fields, have emerged as useful representations of shape and appearance in 3D computer vision. Despite advances, however, it remains challenging to build neural fields for categories of objects without datasets like ShapeNet that provide "canonicalized" object instances that are consistently aligned for their 3D position and orientation (pose). W… ▽ More Coordinate-based implicit neural networks, or neural fields, have emerged as useful representations of shape and appearance in 3D computer vision. Despite advances, however, it remains challenging to build neural fields for categories of objects without datasets like ShapeNet that provide "canonicalized" object instances that are consistently aligned for their 3D position and orientation (pose). We present Canonical Field Network (CaFi-Net), a self-supervised method to canonicalize the 3D pose of instances from an object category represented as neural fields, specifically neural radiance fields (NeRFs). CaFi-Net directly learns from continuous and noisy radiance fields using a Siamese network architecture that is designed to extract equivariant field features for category-level canonicalization. During inference, our method takes pre-trained neural radiance fields of novel object instances at arbitrary 3D pose and estimates a canonical field with consistent 3D pose across the entire category. Extensive experiments on a new dataset of 1300 NeRF models across 13 object categories show that our method matches or exceeds the performance of 3D point cloud-based methods. △ Less

Submitted 17 May, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

arXiv:2204.14078 [pdf, ps, other]

doi 10.5194/gc-5-119-2022

GC Insights: Space sector careers resources in the UK need a greater diversity of roles

Authors: Martin O. Archer, Cara L. Water, Shafiat Dewan, Simon Foster, Antonio Portas

Abstract: Educational research highlights that improved careers education is needed to increase participation in science, technology, engineering, and mathematics (STEM). Current UK careers resources concerning the space sector, however, are found to perhaps not best reflect the diversity of roles present and may in fact perpetuate misconceptions about the usefulness of science. We, therefore, compile a mor… ▽ More Educational research highlights that improved careers education is needed to increase participation in science, technology, engineering, and mathematics (STEM). Current UK careers resources concerning the space sector, however, are found to perhaps not best reflect the diversity of roles present and may in fact perpetuate misconceptions about the usefulness of science. We, therefore, compile a more diverse set of space-related jobs, which will be used in the development of a new space careers resource. △ Less

Submitted 29 April, 2022; originally announced April 2022.

Journal ref: Geosci. Commun., 5, 119-123

arXiv:1812.07197 [pdf]

doi 10.1063/1.5088532

High-Temperature Photocurrent Mechanism of \b{eta}-Ga2O3 Based MSM Solar-Blind Photodetectors

Authors: B. R. Tak, Manjari Garg, Sheetal Dewan, Carlos G. Torres-Castanedo, Kuang-Hui Li, Vinay Gupta, Xiaohang Li, R. Singh

Abstract: High-temperature operation of metal-semiconductor-metal (MSM) UV photodetectors fabricated on pulsed laser deposited \b{eta}-Ga2O3 thin films has been investigated. These photodetectors were operated up to 250 °C temperature under 255 nm illumination. The photo current to dark current (PDCR) ratio of about 7100 was observed at room temperature (RT) while it had a value 2.3 at 250 °C at 10 V applie… ▽ More High-temperature operation of metal-semiconductor-metal (MSM) UV photodetectors fabricated on pulsed laser deposited \b{eta}-Ga2O3 thin films has been investigated. These photodetectors were operated up to 250 °C temperature under 255 nm illumination. The photo current to dark current (PDCR) ratio of about 7100 was observed at room temperature (RT) while it had a value 2.3 at 250 °C at 10 V applied bias. A decline in photocurrent was observed from RT to 150 °C and then it increased with temperature up to 250 °C. The suppression of the blue band was also observed from 150 °C temperature which indicated that self-trapped holes in Ga2O3 became unstable. Temperature-dependent rise and decay times of carriers were analyzed to understand the photocurrent mechanism and persistence photocurrent at high temperatures. Coupled electron-phonon interaction with holes was found to influence the photoresponse in the devices. The obtained results are encouraging and significant for high-temperature applications of \b{eta}-Ga2O3 MSM deep UV photodetectors. △ Less

Submitted 18 December, 2018; originally announced December 2018.

arXiv:1811.01874 [pdf]

Chiral standing waves and its trap** force on chiral particles

Authors: Tianhang Zhang, M. R. C. Mahdy, Shadman Sajid Dewan, Md. Nayem Hossain, Hamim Mahmud Rivy, Nabila Masud, Ziaur Rahman Jony

Abstract: Up to now, in the literature of optical manipulation, optical force due to chirality usually coexists with the non-chiral force and the chiral force usually takes a very small portion of the total force. In this work, we investigate a case where the optical force exerted on an object is purely due to the chirality while there is zero force on non-chiral object. We find that a trap** force arises… ▽ More Up to now, in the literature of optical manipulation, optical force due to chirality usually coexists with the non-chiral force and the chiral force usually takes a very small portion of the total force. In this work, we investigate a case where the optical force exerted on an object is purely due to the chirality while there is zero force on non-chiral object. We find that a trap** force arises on chiral particles when it is placed in a field consisted of two orthogonally polarized counter-propagating plane waves. We have revealed the underlying physics of this force by modeling the particle as a chiral diploe and analytically study the optical force. We find besides chirality; the trap** force is also closely related to the dual electric-magnetic symmetry of field and dual asymmetry of material. We also demonstrate that the proposed idea is not restricted to dipolar chiral objects only. Chiral Mie objects can also be trapped based on the technique proposed in this article. Notably, such chiral trap** forces have been found robust by varying several parameters throughout the investigation. This trap** force may find applications in identifying object's chirality and the selective trap** of chiral objects. △ Less

Submitted 5 November, 2018; originally announced November 2018.

Showing 1–7 of 7 results for author: Dewan, S