Search | arXiv e-print repository

KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models

Authors: Zhiwei Jia, Pradyumna Narayana, Arjun R. Akula, Garima Pruthi, Hao Su, Sugato Basu, Varun Jampani

Abstract: Image ad understanding is a crucial task with wide real-world applications. Although highly challenging with the involvement of diverse atypical scenes, real-world entities, and reasoning over scene-texts, how to interpret image ads is relatively under-explored, especially in the era of foundational vision-language models (VLMs) featuring impressive generalizability and adaptability. In this paper… ▽ More Image ad understanding is a crucial task with wide real-world applications. Although highly challenging with the involvement of diverse atypical scenes, real-world entities, and reasoning over scene-texts, how to interpret image ads is relatively under-explored, especially in the era of foundational vision-language models (VLMs) featuring impressive generalizability and adaptability. In this paper, we perform the first empirical study of image ad understanding through the lens of pre-trained VLMs. We benchmark and reveal practical challenges in adapting these VLMs to image ad understanding. We propose a simple feature adaptation strategy to effectively fuse multimodal information for image ads and further empower it with knowledge of real-world entities. We hope our study draws more attention to image ad understanding which is broadly relevant to the advertising industry. △ Less

Submitted 28 May, 2023; originally announced May 2023.

Comments: ACL 2023

arXiv:2212.09898 [pdf, other]

MetaCLUE: Towards Comprehensive Visual Metaphors Research

Authors: Arjun R. Akula, Brendan Driscoll, Pradyumna Narayana, Soravit Changpinyo, Zhiwei Jia, Suyash Damle, Garima Pruthi, Sugato Basu, Leonidas Guibas, William T. Freeman, Yuanzhen Li, Varun Jampani

Abstract: Creativity is an indispensable part of human cognition and also an inherent part of how we make sense of the world. Metaphorical abstraction is fundamental in communicating creative ideas through nuanced relationships between abstract concepts such as feelings. While computer vision benchmarks and approaches predominantly focus on understanding and generating literal interpretations of images, met… ▽ More Creativity is an indispensable part of human cognition and also an inherent part of how we make sense of the world. Metaphorical abstraction is fundamental in communicating creative ideas through nuanced relationships between abstract concepts such as feelings. While computer vision benchmarks and approaches predominantly focus on understanding and generating literal interpretations of images, metaphorical comprehension of images remains relatively unexplored. Towards this goal, we introduce MetaCLUE, a set of vision tasks on visual metaphor. We also collect high-quality and rich metaphor annotations (abstract objects, concepts, relationships along with their corresponding object boxes) as there do not exist any datasets that facilitate the evaluation of these tasks. We perform a comprehensive analysis of state-of-the-art models in vision and language based on our annotations, highlighting strengths and weaknesses of current approaches in visual metaphor Classification, Localization, Understanding (retrieval, question answering, captioning) and gEneration (text-to-image synthesis) tasks. We hope this work provides a concrete step towards develo** AI systems with human-like creative capabilities. △ Less

Submitted 2 June, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

Comments: Accepted in CVPR 2023. Project page: https://metaclue.github.io/ , Video summary: https://youtu.be/V3TmeNETL-o

arXiv:2201.11194 [pdf, other]

Attention cannot be an Explanation

Authors: Arjun R Akula, Song-Chun Zhu

Abstract: Attention based explanations (viz. saliency maps), by providing interpretability to black box models such as deep neural networks, are assumed to improve human trust and reliance in the underlying models. Recently, it has been shown that attention weights are frequently uncorrelated with gradient-based measures of feature importance. Motivated by this, we ask a follow-up question: "Assuming that w… ▽ More Attention based explanations (viz. saliency maps), by providing interpretability to black box models such as deep neural networks, are assumed to improve human trust and reliance in the underlying models. Recently, it has been shown that attention weights are frequently uncorrelated with gradient-based measures of feature importance. Motivated by this, we ask a follow-up question: "Assuming that we only consider the tasks where attention weights correlate well with feature importance, how effective are these attention based explanations in increasing human trust and reliance in the underlying models?". In other words, can we use attention as an explanation? We perform extensive human study experiments that aim to qualitatively and quantitatively assess the degree to which attention based explanations are suitable in increasing human trust and reliance. Our experiment results show that attention cannot be used as an explanation. △ Less

Submitted 26 January, 2022; originally announced January 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2109.01401, arXiv:1909.06907

arXiv:2201.09639 [pdf, other]

Question Generation for Evaluating Cross-Dataset Shifts in Multi-modal Grounding

Authors: Arjun R. Akula

Abstract: Visual question answering (VQA) is the multi-modal task of answering natural language questions about an input image. Through cross-dataset adaptation methods, it is possible to transfer knowledge from a source dataset with larger train samples to a target dataset where training set is limited. Suppose a VQA model trained on one dataset train set fails in adapting to another, it is hard to identif… ▽ More Visual question answering (VQA) is the multi-modal task of answering natural language questions about an input image. Through cross-dataset adaptation methods, it is possible to transfer knowledge from a source dataset with larger train samples to a target dataset where training set is limited. Suppose a VQA model trained on one dataset train set fails in adapting to another, it is hard to identify the underlying cause of domain mismatch as there could exists a multitude of reasons such as image distribution mismatch and question distribution mismatch. At UCLA, we are working on a VQG module that facilitate in automatically generating OOD shifts that aid in systematically evaluating cross-dataset adaptation capabilities of VQA models. △ Less

Submitted 24 January, 2022; originally announced January 2022.

arXiv:2201.06207 [pdf, other]

Discourse Analysis for Evaluating Coherence in Video Paragraph Captions

Authors: Arjun R Akula, Song-Chun Zhu

Abstract: Video paragraph captioning is the task of automatically generating a coherent paragraph description of the actions in a video. Previous linguistic studies have demonstrated that coherence of a natural language text is reflected by its discourse structure and relations. However, existing video captioning methods evaluate the coherence of generated paragraphs by comparing them merely against human p… ▽ More Video paragraph captioning is the task of automatically generating a coherent paragraph description of the actions in a video. Previous linguistic studies have demonstrated that coherence of a natural language text is reflected by its discourse structure and relations. However, existing video captioning methods evaluate the coherence of generated paragraphs by comparing them merely against human paragraph annotations and fail to reason about the underlying discourse structure. At UCLA, we are currently exploring a novel discourse based framework to evaluate the coherence of video paragraphs. Central to our approach is the discourse representation of videos, which helps in modeling coherence of paragraphs conditioned on coherence of videos. We also introduce DisNet, a novel dataset containing the proposed visual discourse annotations of 3000 videos and their paragraphs. Our experiment results have shown that the proposed framework evaluates coherence of video paragraphs significantly better than all the baseline methods. We believe that many other multi-discipline Artificial Intelligence problems such as Visual Dialog and Visual Storytelling would also greatly benefit from the proposed visual discourse framework and the DisNet dataset. △ Less

Submitted 16 January, 2022; originally announced January 2022.

arXiv:2109.01401 [pdf, other]

CX-ToM: Counterfactual Explanations with Theory-of-Mind for Enhancing Human Trust in Image Recognition Models

Authors: Arjun R. Akula, Keze Wang, Changsong Liu, Sari Saba-Sadiya, Hong**g Lu, Sinisa Todorovic, Joyce Chai, Song-Chun Zhu

Abstract: We propose CX-ToM, short for counterfactual explanations with theory-of mind, a new explainable AI (XAI) framework for explaining decisions made by a deep convolutional neural network (CNN). In contrast to the current methods in XAI that generate explanations as a single shot response, we pose explanation as an iterative communication process, i.e. dialog, between the machine and human user. More… ▽ More We propose CX-ToM, short for counterfactual explanations with theory-of mind, a new explainable AI (XAI) framework for explaining decisions made by a deep convolutional neural network (CNN). In contrast to the current methods in XAI that generate explanations as a single shot response, we pose explanation as an iterative communication process, i.e. dialog, between the machine and human user. More concretely, our CX-ToM framework generates sequence of explanations in a dialog by mediating the differences between the minds of machine and human user. To do this, we use Theory of Mind (ToM) which helps us in explicitly modeling human's intention, machine's mind as inferred by the human as well as human's mind as inferred by the machine. Moreover, most state-of-the-art XAI frameworks provide attention (or heat map) based explanations. In our work, we show that these attention based explanations are not sufficient for increasing human trust in the underlying CNN model. In CX-ToM, we instead use counterfactual explanations called fault-lines which we define as follows: given an input image I for which a CNN classification model M predicts class c_pred, a fault-line identifies the minimal semantic-level features (e.g., stripes on zebra, pointed ears of dog), referred to as explainable concepts, that need to be added to or deleted from I in order to alter the classification category of I by M to another specified class c_alt. We argue that, due to the iterative, conceptual and counterfactual nature of CX-ToM explanations, our framework is practical and more natural for both expert and non-expert users to understand the internal workings of complex deep learning models. Extensive quantitative and qualitative experiments verify our hypotheses, demonstrating that our CX-ToM significantly outperforms the state-of-the-art explainable AI models. △ Less

Submitted 2 December, 2021; v1 submitted 3 September, 2021; originally announced September 2021.

Comments: Accepted by iScience Cell Press Journal 2021. arXiv admin note: text overlap with arXiv:1909.06907

arXiv:2107.14046 [pdf, other]

Audit and Assurance of AI Algorithms: A framework to ensure ethical algorithmic practices in Artificial Intelligence

Authors: Ramya Akula, Ivan Garibay

Abstract: Algorithms are becoming more widely used in business, and businesses are becoming increasingly concerned that their algorithms will cause significant reputational or financial damage. We should emphasize that any of these damages stem from situations in which the United States lacks strict legislative prohibitions or specified protocols for measuring damages. As a result, governments are enacting… ▽ More Algorithms are becoming more widely used in business, and businesses are becoming increasingly concerned that their algorithms will cause significant reputational or financial damage. We should emphasize that any of these damages stem from situations in which the United States lacks strict legislative prohibitions or specified protocols for measuring damages. As a result, governments are enacting legislation and enforcing prohibitions, regulators are fining businesses, and the judiciary is debating whether or not to make artificially intelligent computer models as the decision-makers in the eyes of the law. From autonomous vehicles and banking to medical care, housing, and legal decisions, there will soon be enormous amounts of algorithms that make decisions with limited human interference. Governments, businesses, and society would have an algorithm audit, which would have systematic verification that algorithms are lawful, ethical, and secure, similar to financial audits. A modern market, auditing, and assurance of algorithms developed to professionalize and industrialize AI, machine learning, and related algorithms. Stakeholders of this emerging field include policymakers and regulators, along with industry experts and entrepreneurs. In addition, we foresee audit thresholds and frameworks providing valuable information to all who are concerned with governance and standardization. This paper aims to review the critical areas required for auditing and assurance and spark discussion in this novel field of study and practice. △ Less

Submitted 14 July, 2021; originally announced July 2021.

Journal ref: International Conference on Human-Computer Interaction 2021

arXiv:2107.14044 [pdf, other]

Ethical AI for Social Good

Authors: Ramya Akula, Ivan Garibay

Abstract: The concept of AI for Social Good(AI4SG) is gaining momentum in both information societies and the AI community. Through all the advancement of AI-based solutions, it can solve societal issues effectively. To date, however, there is only a rudimentary grasp of what constitutes AI socially beneficial in principle, what constitutes AI4SG in reality, and what are the policies and regulations needed t… ▽ More The concept of AI for Social Good(AI4SG) is gaining momentum in both information societies and the AI community. Through all the advancement of AI-based solutions, it can solve societal issues effectively. To date, however, there is only a rudimentary grasp of what constitutes AI socially beneficial in principle, what constitutes AI4SG in reality, and what are the policies and regulations needed to ensure it. This paper fills the vacuum by addressing the ethical aspects that are critical for future AI4SG efforts. Some of these characteristics are new to AI, while others have greater importance due to its usage. △ Less

Submitted 14 July, 2021; originally announced July 2021.

Journal ref: International Conference on Human-Computer Interaction, 2021

arXiv:2101.05875 [pdf, other]

doi 10.3390/e23040394

Interpretable Multi-Head Self-Attention model for Sarcasm Detection in social media

Authors: Ramya Akula, Ivan Garibay

Abstract: Sarcasm is a linguistic expression often used to communicate the opposite of what is said, usually something that is very unpleasant with an intention to insult or ridicule. Inherent ambiguity in sarcastic expressions, make sarcasm detection very difficult. In this work, we focus on detecting sarcasm in textual conversations from various social networking platforms and online media. To this end, w… ▽ More Sarcasm is a linguistic expression often used to communicate the opposite of what is said, usually something that is very unpleasant with an intention to insult or ridicule. Inherent ambiguity in sarcastic expressions, make sarcasm detection very difficult. In this work, we focus on detecting sarcasm in textual conversations from various social networking platforms and online media. To this end, we develop an interpretable deep learning model using multi-head self-attention and gated recurrent units. Multi-head self-attention module aids in identifying crucial sarcastic cue-words from the input, and the recurrent units learn long-range dependencies between these cue-words to better classify the input text. We show the effectiveness of our approach by achieving state-of-the-art results on multiple datasets from social networking platforms and online media. Models trained using our proposed approach are easily interpretable and enable identifying sarcastic cues in the input text which contribute to the final classification score. We visualize the learned attention weights on few sample input texts to showcase the effectiveness and interpretability of our model. △ Less

Submitted 14 January, 2021; originally announced January 2021.

arXiv:2005.01655 [pdf, other]

Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions

Authors: Arjun R Akula, Spandana Gella, Yaser Al-Onaizan, Song-Chun Zhu, Siva Reddy

Abstract: Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7% of test instances do not require reasoning on linguistic structure, i.e., words are enough to identify the target object, the word order doesn't matter. To m… ▽ More Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7% of test instances do not require reasoning on linguistic structure, i.e., words are enough to identify the target object, the word order doesn't matter. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn't. Additionally, we create an out-of-distribution dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that the target object changes. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12% to 23% lower in performance than the established progress for this task. We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. Our datasets are publicly available at https://github.com/aws/aws-refcocog-adv △ Less

Submitted 4 May, 2020; originally announced May 2020.

Comments: ACL 2020

arXiv:1911.11642 [pdf, other]

System Performance with varying L1 Instruction and Data Cache Sizes: An Empirical Analysis

Authors: Ramya Akula, Kartik Jain, Deep Jigar Kotecha

Abstract: In this project, we investigate the fluctuations in performance caused by changing the Instruction (I-cache) size and the Data (D-cache) size in the L1 cache. We employ the Gem5 framework to simulate a system with varying specifications on a single host machine. We utilize the FreqMine benchmark available under the PARSEC suite as the workload program to benchmark our simulated system. The Out-ord… ▽ More In this project, we investigate the fluctuations in performance caused by changing the Instruction (I-cache) size and the Data (D-cache) size in the L1 cache. We employ the Gem5 framework to simulate a system with varying specifications on a single host machine. We utilize the FreqMine benchmark available under the PARSEC suite as the workload program to benchmark our simulated system. The Out-order CPU (O3) with Ruby memory model was simulated in a Full-System X86 environment with Linux OS. The chosen metrics deal with Hit Rate, Misses, Memory Latency, Instruction Rate, and Bus Traffic within the system. Performance observed by varying L1 size within a certain range of values was used to compute Confidence Interval based statistics for relevant metrics. Our expectations, corresponding experimental observations, and discrepancies are also discussed in this report. △ Less

Submitted 26 November, 2019; originally announced November 2019.

Comments: 5 Figures and 3 Tables

arXiv:1910.12589 [pdf, other]

Forecasting the Success of Television Series using Machine Learning

Authors: Ramya Akula, Zachary Wieselthier, Laura Martin, Ivan Garibay

Abstract: Television is an ever-evolving multi billion dollar industry. The success of a television show in an increasingly technological society is a vast multi-variable formula. The art of success is not just something that happens, but is studied, replicated, and applied. Hollywood can be unpredictable regarding success, as many movies and sitcoms that are hyped up and promise to be a hit end up being bo… ▽ More Television is an ever-evolving multi billion dollar industry. The success of a television show in an increasingly technological society is a vast multi-variable formula. The art of success is not just something that happens, but is studied, replicated, and applied. Hollywood can be unpredictable regarding success, as many movies and sitcoms that are hyped up and promise to be a hit end up being box office failures and complete disappointments. In current studies, linguistic exploration is being performed on the relationship between Television series and target community of viewers. Having a decision support system that can display sound and predictable results would be needed to build confidence in the investment of a new TV series. The models presented in this study use data to study and determine what makes a sitcom successful. In this paper, we use descriptive and predictive modeling techniques to assess the continuing success of television comedies: The Office, Big Bang Theory, Arrested Development, Scrubs, and South Park. The factors that are tested for statistical significance on episode ratings are character presence, director, and writer. These statistics show that while characters are indeed crucial to the shows themselves, the creation and direction of the shows pose implication upon the ratings and therefore the success of the shows. We use machine learning based forecasting models to accurately predict the success of shows. The models represent a baseline to understanding the success of a television show and how producers can increase the success of current television shows or utilize this data in the creation of future shows. Due to the many factors that go into a series, the empirical analysis in this work shows that there is no one-fits-all model to forecast the rating or success of a television show. △ Less

Submitted 17 October, 2019; originally announced October 2019.

Comments: 9 Pages, 10 Figures and 2 Tables

arXiv:1910.09356 [pdf, other]

Supervised Machine Learning based Ensemble Model for Accurate Prediction of Type 2 Diabetes

Authors: Ramya Akula, Ni Nguyen, Ivan Garibay

Abstract: According to the American Diabetes Association(ADA), 30.3 million people in the United States have diabetes, but only 7.2 million may be undiagnosed and unaware of their condition. Type 2 diabetes is usually diagnosed for most patients later on in life whereas the less common Type 1 diabetes is diagnosed early on in life. People can live healthy and happy lives while living with diabetes, but earl… ▽ More According to the American Diabetes Association(ADA), 30.3 million people in the United States have diabetes, but only 7.2 million may be undiagnosed and unaware of their condition. Type 2 diabetes is usually diagnosed for most patients later on in life whereas the less common Type 1 diabetes is diagnosed early on in life. People can live healthy and happy lives while living with diabetes, but early detection produces a better overall outcome on most patient's health. Thus, to test the accurate prediction of Type 2 diabetes, we use the patients' information from an electronic health records company called Practice Fusion, which has about 10,000 patient records from 2009 to 2012. This data contains individual key biometrics, including age, diastolic and systolic blood pressure, gender, height, and weight. We use this data on popular machine learning algorithms and for each algorithm, we evaluate the performance of every model based on their classification accuracy, precision, sensitivity, specificity/recall, negative predictive value, and F1 score. In our study, we find that all algorithms other than Naive Bayes suffered from very low precision. Hence, we take a step further and incorporate all the algorithms into a weighted average or soft voting ensemble model where each algorithm will count towards a majority vote towards the decision outcome of whether a patient has diabetes or not. The accuracy of the Ensemble model on Practice Fusion is 85\%, by far our ensemble approach is new in this space. We firmly believe that the weighted average ensemble model not only performed well in overall metrics but also helped to recover wrong predictions and aid in accurate prediction of Type 2 diabetes. Our accurate novel model can be used as an alert for the patients to seek medical evaluation in time. △ Less

Submitted 17 October, 2019; originally announced October 2019.

Comments: 9 Pages, # Tables and 8 Figures

arXiv:1910.07999 [pdf]

DeepFork: Supervised Prediction of Information Diffusion in GitHub

Authors: Ramya Akula, Niloofar Yousefi, Ivan Garibay

Abstract: Information spreads on complex social networks extremely fast, in other words, a piece of information can go viral within no time. Often it is hard to barricade this diffusion prior to the significant occurrence of chaos, be it a social media or an online coding platform. GitHub is one such trending online focal point for any business to reach their potential contributors and customers, simultaneo… ▽ More Information spreads on complex social networks extremely fast, in other words, a piece of information can go viral within no time. Often it is hard to barricade this diffusion prior to the significant occurrence of chaos, be it a social media or an online coding platform. GitHub is one such trending online focal point for any business to reach their potential contributors and customers, simultaneously. By exploiting such software development paradigm, millions of free software emerged lately in diverse communities. To understand human influence, information spread and evolution of transmitted information among assorted users in GitHub, we developed a deep neural network model: DeepFork, a supervised machine learning based approach that aims to predict information diffusion in complex social networks; considering node as well as topological features. In our empirical studies, we observed that information diffusion can be detected by link prediction using supervised learning. DeepFork outperforms other machine learning models as it better learns the discriminative patterns from the input features. DeepFork aids in understanding information spread and evolution through a bipartite network of users and repositories i.e., information flow from a user to repository to user. △ Less

Submitted 17 October, 2019; originally announced October 2019.

Comments: 12 Pages, 7 Figures, 2 Tables

arXiv:1909.06907 [pdf, other]

X-ToM: Explaining with Theory-of-Mind for Gaining Justified Human Trust

Authors: Arjun R. Akula, Changsong Liu, Sari Saba-Sadiya, Hong**g Lu, Sinisa Todorovic, Joyce Y. Chai, Song-Chun Zhu

Abstract: We present a new explainable AI (XAI) framework aimed at increasing justified human trust and reliance in the AI machine through explanations. We pose explanation as an iterative communication process, i.e. dialog, between the machine and human user. More concretely, the machine generates sequence of explanations in a dialog which takes into account three important aspects at each dialog turn: (a)… ▽ More We present a new explainable AI (XAI) framework aimed at increasing justified human trust and reliance in the AI machine through explanations. We pose explanation as an iterative communication process, i.e. dialog, between the machine and human user. More concretely, the machine generates sequence of explanations in a dialog which takes into account three important aspects at each dialog turn: (a) human's intention (or curiosity); (b) human's understanding of the machine; and (c) machine's understanding of the human user. To do this, we use Theory of Mind (ToM) which helps us in explicitly modeling human's intention, machine's mind as inferred by the human as well as human's mind as inferred by the machine. In other words, these explicit mental representations in ToM are incorporated to learn an optimal explanation policy that takes into account human's perception and beliefs. Furthermore, we also show that ToM facilitates in quantitatively measuring justified human trust in the machine by comparing all the three mental representations. We applied our framework to three visual recognition tasks, namely, image classification, action recognition, and human body pose estimation. We argue that our ToM based explanations are practical and more natural for both expert and non-expert users to understand the internal workings of complex machine learning models. To the best of our knowledge, this is the first work to derive explanations using ToM. Extensive human study experiments verify our hypotheses, showing that the proposed explanations significantly outperform the state-of-the-art XAI methods in terms of all the standard quantitative and qualitative XAI evaluation metrics including human trust, reliance, and explanation satisfaction. △ Less

Submitted 15 September, 2019; originally announced September 2019.

Comments: A short version of this was presented at CVPR 2019 Workshop on Explainable AI

arXiv:1903.05720 [pdf, other]

Natural Language Interaction with Explainable AI Models

Authors: Arjun R Akula, Sinisa Todorovic, Joyce Y Chai, Song-Chun Zhu

Abstract: This paper presents an explainable AI (XAI) system that provides explanations for its predictions. The system consists of two key components -- namely, the prediction And-Or graph (AOG) model for recognizing and localizing concepts of interest in input data, and the XAI model for providing explanations to the user about the AOG's predictions. In this work, we focus on the XAI model specified to in… ▽ More This paper presents an explainable AI (XAI) system that provides explanations for its predictions. The system consists of two key components -- namely, the prediction And-Or graph (AOG) model for recognizing and localizing concepts of interest in input data, and the XAI model for providing explanations to the user about the AOG's predictions. In this work, we focus on the XAI model specified to interact with the user in natural language, whereas the AOG's predictions are considered given and represented by the corresponding parse graphs (pg's) of the AOG. Our XAI model takes pg's as input and provides answers to the user's questions using the following types of reasoning: direct evidence (e.g., detection scores), part-based inference (e.g., detected parts provide evidence for the concept asked), and other evidences from spatio-temporal context (e.g., constraints from the spatio-temporal surround). We identify several correlations between user's questions and the XAI answers using Youtube Action dataset. △ Less

Submitted 7 July, 2019; v1 submitted 13 March, 2019; originally announced March 2019.

Journal ref: CVPR 2019 Workshop on Explainable AI

arXiv:1903.02252 [pdf, other]

Discourse Parsing in Videos: A Multi-modal Appraoch

Authors: Arjun R. Akula, Song-Chun Zhu

Abstract: Text-level discourse parsing aims to unmask how two sentences in the text are related to each other. We propose the task of Visual Discourse Parsing, which requires understanding discourse relations among scenes in a video. Here we use the term scene to refer to a subset of video frames that can better summarize the video. In order to collect a dataset for learning discourse cues from videos, one… ▽ More Text-level discourse parsing aims to unmask how two sentences in the text are related to each other. We propose the task of Visual Discourse Parsing, which requires understanding discourse relations among scenes in a video. Here we use the term scene to refer to a subset of video frames that can better summarize the video. In order to collect a dataset for learning discourse cues from videos, one needs to manually identify the scenes from a large pool of video frames and then annotate the discourse relations between them. This is clearly a time consuming, expensive and tedious task. In this work, we propose an approach to identify discourse cues from the videos without the need to explicitly identify and annotate the scenes. We also present a novel dataset containing 310 videos and the corresponding discourse cues to evaluate our approach. We believe that many of the multi-discipline AI problems such as Visual Dialog and Visual Storytelling would greatly benefit from the use of visual discourse cues. △ Less

Submitted 22 January, 2022; v1 submitted 6 March, 2019; originally announced March 2019.

Comments: Accepted in CVPR 2019 Workshop on Language and Vision (Oral Presentation)

Journal ref: CVPR 2019 Workshop on Language and Vision (Oral Presentation)

arXiv:1902.04003 [pdf, other]

Stabilized MorteX method for mesh tying along embedded interfaces

Authors: Basava Raju Akula, Julien Vignollet, Vladislav A. Yastrebov

Abstract: We present a unified framework to tie overlap** meshes in solid mechanics applications. This framework is a combination of the X-FEM method and the mortar method, which uses Lagrange multipliers to fulfill the tying constraints. As known, mixed formulations are prone to mesh locking which manifests itself by the emergence of spurious oscillations in the vicinity of the tying interface. To overco… ▽ More We present a unified framework to tie overlap** meshes in solid mechanics applications. This framework is a combination of the X-FEM method and the mortar method, which uses Lagrange multipliers to fulfill the tying constraints. As known, mixed formulations are prone to mesh locking which manifests itself by the emergence of spurious oscillations in the vicinity of the tying interface. To overcome this inherent difficulty, we suggest a new coarse-grained interpolation of Lagrange multipliers. This technique consists in selective assignment of Lagrange multipliers on nodes of the mortar side and in non-local interpolation of the associated traction field. The optimal choice of the coarse-graining spacing is guided solely by the mesh-density contrast between the mesh of the mortar side and the number of blending elements of the host mesh. The method is tested on two patch tests (compression and bending) for different interpolations and element types as well as for different material and mesh contrasts. The optimal mesh convergence and removal of spurious oscillations is also demonstrated on the Eshelby inclusion problem for high contrasts of inclusion/matrix materials. Few additional examples confirm the performance of the elaborated framework. △ Less

Submitted 3 February, 2019; originally announced February 2019.

Comments: 32 pages, 36 figures, 64 references

arXiv:1902.04000 [pdf, other]

MorteX method for contact along real and embedded surfaces: coupling X-FEM with the Mortar method

Authors: Basava Raju Akula, Julien Vignollet, Vladislav A. Yastrebov

Abstract: A method to treat frictional contact problems along embedded surfaces in the finite element framework is developed. Arbitrarily shaped embedded surfaces, cutting through finite element meshes, are handled by the X-FEM. The frictional contact problem is solved using the monolithic augmented Lagrangian method within the mortar framework which was adapted for handling embedded surfaces. We report tha… ▽ More A method to treat frictional contact problems along embedded surfaces in the finite element framework is developed. Arbitrarily shaped embedded surfaces, cutting through finite element meshes, are handled by the X-FEM. The frictional contact problem is solved using the monolithic augmented Lagrangian method within the mortar framework which was adapted for handling embedded surfaces. We report that the resulting mixed formulation is prone to mesh locking in case of high elastic and mesh density contrasts across the contact interface. The mesh locking manifests itself in spurious stress oscillations in the vicinity of the contact interface. We demonstrate that in the classical patch test, these oscillations can be removed simply by using triangular blending elements. In a more general case, the triangulation is shown inefficient, therefore stabilization of the problem is achieved by adopting a recently proposed coarse-graining interpolation of Lagrange multipliers. Moreover, we demonstrate that the coarse-graining is also beneficial for the classical mortar method to avoid spurious oscillations for contact interfaces with high elastic contrast. The performance of this novel method, called MorteX, is demonstrated on several examples which show as accurate treatment of frictional contact along embedded surfaces as the classical mortar method along boundary fitted surfaces. △ Less

Submitted 3 February, 2019; originally announced February 2019.

Comments: 30 pages, 28 figures, 58 references

Showing 1–19 of 19 results for author: Akula, R