Search | arXiv e-print repository

arXiv:2404.01299 [pdf, other]

CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes

Authors: Paritosh Parmar, Eric Peh, Ruirui Chen, Ting En Lam, Yuhan Chen, Elston Tan, Basura Fernando

Abstract: Causal video question answering (QA) has garnered increasing interest, yet existing datasets often lack depth in causal reasoning. To address this gap, we capitalize on the unique properties of cartoons and construct CausalChaos!, a novel, challenging causal Why-QA dataset built upon the iconic "Tom and Jerry" cartoon series. Cartoons use the principles of animation that allow animators to create… ▽ More Causal video question answering (QA) has garnered increasing interest, yet existing datasets often lack depth in causal reasoning. To address this gap, we capitalize on the unique properties of cartoons and construct CausalChaos!, a novel, challenging causal Why-QA dataset built upon the iconic "Tom and Jerry" cartoon series. Cartoons use the principles of animation that allow animators to create expressive, unambiguous causal relationships between events to form a coherent storyline. Utilizing these properties, along with thought-provoking questions and multi-level answers (answer and detailed causal explanation), our questions involve causal chains that interconnect multiple dynamic interactions between characters and visual scenes. These factors demand models to solve more challenging, yet well-defined causal relationships. We also introduce hard incorrect answer mining, including a causally confusing version that is even more challenging. While models perform well, there is much room for improvement, especially, on open-ended answers. We identify more advanced/explicit causal relationship modeling & joint modeling of vision and language as the immediate areas for future efforts to focus upon. Along with the other complementary datasets, our new challenging dataset will pave the way for these developments in the field. △ Less

Submitted 14 June, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: Project Page: https://github.com/LUNAProject22/CausalChaos

arXiv:2403.13798 [pdf, other]

Hierarchical NeuroSymbolic Approach for Comprehensive and Explainable Action Quality Assessment

Authors: Lauren Okamoto, Paritosh Parmar

Abstract: Action quality assessment (AQA) applies computer vision to quantitatively assess the performance or execution of a human action. Current AQA approaches are end-to-end neural models, which lack transparency and tend to be biased because they are trained on subjective human judgements as ground-truth. To address these issues, we introduce a neuro-symbolic paradigm for AQA, which uses neural networks… ▽ More Action quality assessment (AQA) applies computer vision to quantitatively assess the performance or execution of a human action. Current AQA approaches are end-to-end neural models, which lack transparency and tend to be biased because they are trained on subjective human judgements as ground-truth. To address these issues, we introduce a neuro-symbolic paradigm for AQA, which uses neural networks to abstract interpretable symbols from video data and makes quality assessments by applying rules to those symbols. We take diving as the case study. We found that domain experts prefer our system and find it more informative than purely neural approaches to AQA in diving. Our system also achieves state-of-the-art action recognition and temporal segmentation, and automatically generates a detailed report that breaks the dive down into its elements and provides objective scoring with visual evidence. As verified by a group of domain experts, this report may be used to assist judges in scoring, help train judges, and provide feedback to divers. Annotated training data and code: https://github.com/laurenok24/NSAQA. △ Less

Submitted 24 May, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

Comments: CVPR 2024 CVSports (Oral Presentation; 3/3 Strong Accepts) + Selected for CVPR 2024 Demos

arXiv:2401.10805 [pdf, other]

Learning to Visually Connect Actions and their Effects

Authors: Eric Peh, Paritosh Parmar, Basura Fernando

Abstract: In this work, we introduce the novel concept of visually Connecting Actions and Their Effects (CATE) in video understanding. CATE can have applications in areas like task planning and learning from demonstration. We identify and explore two different aspects of the concept of CATE: Action Selection and Effect-Affinity Assessment, where video understanding models connect actions and effects at sema… ▽ More In this work, we introduce the novel concept of visually Connecting Actions and Their Effects (CATE) in video understanding. CATE can have applications in areas like task planning and learning from demonstration. We identify and explore two different aspects of the concept of CATE: Action Selection and Effect-Affinity Assessment, where video understanding models connect actions and effects at semantic and fine-grained levels, respectively. We observe that different formulations produce representations capturing intuitive action properties. We also design various baseline models for Action Selection and Effect-Affinity Assessment. Despite the intuitive nature of the task, we observe that models struggle, and humans outperform them by a large margin. The study aims to establish a foundation for future efforts, showcasing the flexibility and versatility of connecting actions and effects in video understanding, with the hope of inspiring advanced formulations and models. △ Less

Submitted 26 April, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

arXiv:2205.04841 [pdf, other]

Object Detection in Indian Food Platters using Transfer Learning with YOLOv4

Authors: Deepanshu Pandey, Purva Parmar, Gauri Toshniwal, Mansi Goel, Vishesh Agrawal, Shivangi Dhiman, Lavanya Gupta, Ganesh Bagler

Abstract: Object detection is a well-known problem in computer vision. Despite this, its usage and pervasiveness in the traditional Indian food dishes has been limited. Particularly, recognizing Indian food dishes present in a single photo is challenging due to three reasons: 1. Lack of annotated Indian food datasets 2. Non-distinct boundaries between the dishes 3. High intra-class variation. We solve these… ▽ More Object detection is a well-known problem in computer vision. Despite this, its usage and pervasiveness in the traditional Indian food dishes has been limited. Particularly, recognizing Indian food dishes present in a single photo is challenging due to three reasons: 1. Lack of annotated Indian food datasets 2. Non-distinct boundaries between the dishes 3. High intra-class variation. We solve these issues by providing a comprehensively labelled Indian food dataset- IndianFood10, which contains 10 food classes that appear frequently in a staple Indian meal and using transfer learning with YOLOv4 object detector model. Our model is able to achieve an overall mAP score of 91.8% and f1-score of 0.90 for our 10 class dataset. We also provide an extension of our 10 class dataset- IndianFood20, which contains 10 more traditional Indian food classes. △ Less

Submitted 10 May, 2022; originally announced May 2022.

Comments: 6 pages, 7 figures, 38th IEEE International Conference on Data Engineering, 2022, DECOR Workshop

arXiv:2202.14019 [pdf, other]

Domain Knowledge-Informed Self-Supervised Representations for Workout Form Assessment

Authors: Paritosh Parmar, Amol Gharat, Helge Rhodin

Abstract: Maintaining proper form while exercising is important for preventing injuries and maximizing muscle mass gains. Detecting errors in workout form naturally requires estimating human's body pose. However, off-the-shelf pose estimators struggle to perform well on the videos recorded in gym scenarios due to factors such as camera angles, occlusion from gym equipment, illumination, and clothing. To agg… ▽ More Maintaining proper form while exercising is important for preventing injuries and maximizing muscle mass gains. Detecting errors in workout form naturally requires estimating human's body pose. However, off-the-shelf pose estimators struggle to perform well on the videos recorded in gym scenarios due to factors such as camera angles, occlusion from gym equipment, illumination, and clothing. To aggravate the problem, the errors to be detected in the workouts are very subtle. To that end, we propose to learn exercise-oriented image and video representations from unlabeled samples such that a small dataset annotated by experts suffices for supervised error detection. In particular, our domain knowledge-informed self-supervised approaches (pose contrastive learning and motion disentangling) exploit the harmonic motion of the exercise actions, and capitalize on the large variances in camera angles, clothes, and illumination to learn powerful representations. To facilitate our self-supervised pretraining, and supervised finetuning, we curated a new exercise dataset, \emph{Fitness-AQA} (\url{https://github.com/ParitoshParmar/Fitness-AQA}), comprising of three exercises: BackSquat, BarbellRow, and OverheadPress. It has been annotated by expert trainers for multiple crucial and typically occurring exercise errors. Experimental results show that our self-supervised representations outperform off-the-shelf 2D- and 3D-pose estimators and several other baselines. We also show that our approaches can be applied to other domains/tasks such as pose estimation and dive quality assessment. △ Less

Submitted 21 October, 2022; v1 submitted 28 February, 2022; originally announced February 2022.

arXiv:2102.07355 [pdf, other]

Win-Fail Action Recognition

Authors: Paritosh Parmar, Brendan Morris

Abstract: Current video/action understanding systems have demonstrated impressive performance on large recognition tasks. However, they might be limiting themselves to learning to recognize spatiotemporal patterns, rather than attempting to thoroughly understand the actions. To spur progress in the direction of a truer, deeper understanding of videos, we introduce the task of win-fail action recognition --… ▽ More Current video/action understanding systems have demonstrated impressive performance on large recognition tasks. However, they might be limiting themselves to learning to recognize spatiotemporal patterns, rather than attempting to thoroughly understand the actions. To spur progress in the direction of a truer, deeper understanding of videos, we introduce the task of win-fail action recognition -- differentiating between successful and failed attempts at various activities. We introduce a first of its kind paired win-fail action understanding dataset with samples from the following domains: "General Stunts," "Internet Wins-Fails," "Trick Shots," and "Party Games." Unlike existing action recognition datasets, intra-class variation is high making the task challenging, yet feasible. We systematically analyze the characteristics of the win-fail task/dataset with prototypical action recognition networks and a novel video retrieval task. While current action recognition methods work well on our task/dataset, they still leave a large gap to achieve high performance. We hope to motivate more work towards the true understanding of actions/videos. Dataset will be available from https://github.com/ParitoshParmar/Win-Fail-Action-Recognition. △ Less

Submitted 15 February, 2021; originally announced February 2021.

arXiv:2101.04884 [pdf, other]

Piano Skills Assessment

Authors: Paritosh Parmar, Jaiden Reddy, Brendan Morris

Abstract: Can a computer determine a piano player's skill level? Is it preferable to base this assessment on visual analysis of the player's performance or should we trust our ears over our eyes? Since current CNNs have difficulty processing long video videos, how can shorter clips be sampled to best reflect the players skill level? In this work, we collect and release a first-of-its-kind dataset for multim… ▽ More Can a computer determine a piano player's skill level? Is it preferable to base this assessment on visual analysis of the player's performance or should we trust our ears over our eyes? Since current CNNs have difficulty processing long video videos, how can shorter clips be sampled to best reflect the players skill level? In this work, we collect and release a first-of-its-kind dataset for multimodal skill assessment focusing on assessing piano player's skill level, answer the asked questions, initiate work in automated evaluation of piano playing skills and provide baselines for future work. Dataset is available from: https://github.com/ParitoshParmar/Piano-Skills-Assessment. △ Less

Submitted 20 June, 2021; v1 submitted 13 January, 2021; originally announced January 2021.

Comments: Dataset is available from: https://github.com/ParitoshParmar/Piano-Skills-Assessment

arXiv:2007.10021 [pdf, other]

Voice@SRIB at SemEval-2020 Task 9 and 12: Stacked Ensembling method for Sentiment and Offensiveness detection in Social Media

Authors: Abhishek Singh, Surya Pratap Singh Parmar

Abstract: In social-media platforms such as Twitter, Facebook, and Reddit, people prefer to use code-mixed language such as Spanish-English, Hindi-English to express their opinions. In this paper, we describe different models we used, using the external dataset to train embeddings, ensembling methods for Sentimix, and OffensEval tasks. The use of pre-trained embeddings usually helps in multiple tasks such a… ▽ More In social-media platforms such as Twitter, Facebook, and Reddit, people prefer to use code-mixed language such as Spanish-English, Hindi-English to express their opinions. In this paper, we describe different models we used, using the external dataset to train embeddings, ensembling methods for Sentimix, and OffensEval tasks. The use of pre-trained embeddings usually helps in multiple tasks such as sentence classification, and machine translation. In this experiment, we haveused our trained code-mixed embeddings and twitter pre-trained embeddings to SemEval tasks. We evaluate our models on macro F1-score, precision, accuracy, and recall on the datasets. We intend to show that hyper-parameter tuning and data pre-processing steps help a lot in improving the scores. In our experiments, we are able to achieve 0.886 F1-Macro on OffenEval Greek language subtask post-evaluation, whereas the highest is 0.852 during the Evaluation Period. We stood third in Spanglish competition with our best F1-score of 0.756. Codalab username is asking28. △ Less

Submitted 11 October, 2020; v1 submitted 20 July, 2020; originally announced July 2020.

Comments: Changed title and few more changes. This version will be published in SemEval2020. Added code Link

arXiv:1912.04430 [pdf, other]

HalluciNet-ing Spatiotemporal Representations Using a 2D-CNN

Authors: Paritosh Parmar, Brendan Morris

Abstract: Spatiotemporal representations learned using 3D convolutional neural networks (CNN) are currently used in state-of-the-art approaches for action related tasks. However, 3D-CNN are notorious for being memory and compute resource intensive as compared with more simple 2D-CNN architectures. We propose to hallucinate spatiotemporal representations from a 3D-CNN teacher with a 2D-CNN student. By requir… ▽ More Spatiotemporal representations learned using 3D convolutional neural networks (CNN) are currently used in state-of-the-art approaches for action related tasks. However, 3D-CNN are notorious for being memory and compute resource intensive as compared with more simple 2D-CNN architectures. We propose to hallucinate spatiotemporal representations from a 3D-CNN teacher with a 2D-CNN student. By requiring the 2D-CNN to predict the future and intuit upcoming activity, it is encouraged to gain a deeper understanding of actions and how they evolve. The hallucination task is treated as an auxiliary task, which can be used with any other action related task in a multitask learning setting. Thorough experimental evaluation shows that the hallucination task indeed helps improve performance on action recognition, action quality assessment, and dynamic scene recognition tasks. From a practical standpoint, being able to hallucinate spatiotemporal representations without an actual 3D-CNN can enable deployment in resource-constrained scenarios, such as with limited computing power and/or lower bandwidth. Codebase is available here: https://github.com/ParitoshParmar/HalluciNet. △ Less

Submitted 21 October, 2020; v1 submitted 9 December, 2019; originally announced December 2019.

Comments: Codebase: https://github.com/ParitoshParmar/HalluciNet

arXiv:1904.04346 [pdf, other]

What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment

Authors: Paritosh Parmar, Brendan Tran Morris

Abstract: Can performance on the task of action quality assessment (AQA) be improved by exploiting a description of the action and its quality? Current AQA and skills assessment approaches propose to learn features that serve only one task - estimating the final score. In this paper, we propose to learn spatio-temporal features that explain three related tasks - fine-grained action recognition, commentary g… ▽ More Can performance on the task of action quality assessment (AQA) be improved by exploiting a description of the action and its quality? Current AQA and skills assessment approaches propose to learn features that serve only one task - estimating the final score. In this paper, we propose to learn spatio-temporal features that explain three related tasks - fine-grained action recognition, commentary generation, and estimating the AQA score. A new multitask-AQA dataset, the largest to date, comprising of 1412 diving samples was collected to evaluate our approach (https://github.com/ParitoshParmar/MTL-AQA). We show that our MTL approach outperforms STL approach using two different kinds of architectures: C3D-AVG and MSCADC. The C3D-AVG-MTL approach achieves the new state-of-the-art performance with a rank correlation of 90.44%. Detailed experiments were performed to show that MTL offers better generalization than STL, and representations from action recognition models are not sufficient for the AQA task and instead should be learned. △ Less

Submitted 14 June, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

Comments: CVPR 2019. Dataset temporarily made available at https://github.com/ParitoshParmar/MTL-AQA

arXiv:1812.06367 [pdf, other]

Action Quality Assessment Across Multiple Actions

Authors: Paritosh Parmar, Brendan Tran Morris

Abstract: Can learning to measure the quality of an action help in measuring the quality of other actions? If so, can consolidated samples from multiple actions help improve the performance of current approaches? In this paper, we carry out experiments to see if knowledge transfer is possible in the action quality assessment (AQA) setting. Experiments are carried out on our newly released AQA dataset (http:… ▽ More Can learning to measure the quality of an action help in measuring the quality of other actions? If so, can consolidated samples from multiple actions help improve the performance of current approaches? In this paper, we carry out experiments to see if knowledge transfer is possible in the action quality assessment (AQA) setting. Experiments are carried out on our newly released AQA dataset (http://rtis.oit.unlv.edu/datasets.html) consisting of 1106 action samples from seven actions with quality scores as measured by expert human judges. Our experimental results show that there is utility in learning a single model across multiple actions. △ Less

Submitted 8 April, 2019; v1 submitted 15 December, 2018; originally announced December 2018.

Comments: WACV 2019

arXiv:1611.05125 [pdf, other]

Learning To Score Olympic Events

Authors: Paritosh Parmar, Brendan Tran Morris

Abstract: Estimating action quality, the process of assigning a "score" to the execution of an action, is crucial in areas such as sports and health care. Unlike action recognition, which has millions of examples to learn from, the action quality datasets that are currently available are small -- typically comprised of only a few hundred samples. This work presents three frameworks for evaluating Olympic sp… ▽ More Estimating action quality, the process of assigning a "score" to the execution of an action, is crucial in areas such as sports and health care. Unlike action recognition, which has millions of examples to learn from, the action quality datasets that are currently available are small -- typically comprised of only a few hundred samples. This work presents three frameworks for evaluating Olympic sports which utilize spatiotemporal features learned using 3D convolutional neural networks (C3D) and perform score regression with i) SVR, ii) LSTM, and iii) LSTM followed by SVR. An efficient training mechanism for the limited data scenarios is presented for clip-based training with LSTM. The proposed systems show significant improvement over existing quality assessment approaches on the task of predicting scores of Olympic events {diving, vault, figure skating}. While the SVR-based frameworks yield better results, LSTM-based frameworks are more natural for describing an action and can be used for improvement feedback. △ Less

Submitted 18 May, 2017; v1 submitted 15 November, 2016; originally announced November 2016.

Comments: CVPR 2017 - CVSports Workshop

arXiv:1608.09005 [pdf, other]

Measuring the Quality of Exercises

Authors: Paritosh Parmar, Brendan Tran Morris

Abstract: This work explores the problem of exercise quality measurement since it is essential for effective management of diseases like cerebral palsy (CP). This work examines the assessment of quality of large amplitude movement (LAM) exercises designed to treat CP in an automated fashion. Exercise data was collected by trained participants to generate ideal examples to use as a positive samples for machi… ▽ More This work explores the problem of exercise quality measurement since it is essential for effective management of diseases like cerebral palsy (CP). This work examines the assessment of quality of large amplitude movement (LAM) exercises designed to treat CP in an automated fashion. Exercise data was collected by trained participants to generate ideal examples to use as a positive samples for machine learning. Following that, subjects were asked to deliberately make subtle errors during the exercise, such as restricting movements, as is commonly seen in cases of patients suffering from CP. The quality measurement problem was then posed as a classification to determine whether an example exercise was either "good" or "bad". Popular machine learning techniques for classification, including support vector machines (SVM), single and doublelayered neural networks (NN), boosted decision trees, and dynamic time war** (DTW), were compared. The AdaBoosted tree performed best with an accuracy of 94.68% demonstrating the feasibility of assessing exercise quality. △ Less

Submitted 31 August, 2016; originally announced August 2016.

Comments: EMBC'16 (The 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society)

arXiv:1405.4802 [pdf]

doi 10.1109/ICIIP.2013.6707551

Use of Computer Vision to Detect Tangles in Tangled Objects

Authors: Paritosh Parmar

Abstract: Untangling of structures like ropes and wires by autonomous robots can be useful in areas such as personal robotics, industries and electrical wiring & repairing by robots. This problem can be tackled by using computer vision system in robot. This paper proposes a computer vision based method for analyzing visual data acquired from camera for perceiving the overlap of wires, ropes, hoses i.e. dete… ▽ More Untangling of structures like ropes and wires by autonomous robots can be useful in areas such as personal robotics, industries and electrical wiring & repairing by robots. This problem can be tackled by using computer vision system in robot. This paper proposes a computer vision based method for analyzing visual data acquired from camera for perceiving the overlap of wires, ropes, hoses i.e. detecting tangles. Information obtained after processing image according to the proposed method comprises of position of tangles in tangled object and which wire passes over which wire. This information can then be used to guide robot to untangle wire/s. Given an image, preprocessing is done to remove noise. Then edges of wire are detected. After that, the image is divided into smaller blocks and each block is checked for wire overlap/s and finding other relevant information. TANGLED-100 dataset was introduced, which consists of images of tangled linear deformable objects. Method discussed in here was tested on the TANGLED-100 dataset. Accuracy achieved during experiments was found to be 74.9%. Robotic simulations were carried out to demonstrate the use of the proposed method in applications of robot. Proposed method is a general method that can be used by robots working in different situations. △ Less

Submitted 11 October, 2014; v1 submitted 19 May, 2014; originally announced May 2014.

Comments: IEEE International Conference on Image Information Processing; untangle; untangling; computer vision; robotic vision; untangling by robot; Tangled-100 dataset; tangled linear deformable objects; personal robotics; image processing

Showing 1–14 of 14 results for author: Parmar, P