Search | arXiv e-print repository

A Perspective on Deep Vision Performance with Standard Image and Video Codecs

Authors: Christoph Reich, Oliver Hahn, Daniel Cremers, Stefan Roth, Biplob Debnath

Abstract: Resource-constrained hardware, such as edge devices or cell phones, often rely on cloud servers to provide the required computational resources for inference in deep vision models. However, transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs, such as JPEG or H.264, is prevalent and requir… ▽ More Resource-constrained hardware, such as edge devices or cell phones, often rely on cloud servers to provide the required computational resources for inference in deep vision models. However, transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs, such as JPEG or H.264, is prevalent and required to ensure interoperability. This paper aims to examine the implications of employing standardized codecs within deep vision pipelines. We find that using JPEG and H.264 coding significantly deteriorates the accuracy across a broad range of vision tasks and models. For instance, strong compression rates reduce semantic segmentation accuracy by more than 80% in mIoU. In contrast to previous findings, our analysis extends beyond image and action classification to localization and dense prediction tasks, thus providing a more comprehensive perspective. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: Accepted at CVPR 2024 Workshop on AI for Streaming (AIS)

arXiv:2404.12309 [pdf, other]

iRAG: An Incremental Retrieval Augmented Generation System for Videos

Authors: Md Adnan Arefeen, Biplob Debnath, Md Yusuf Sarwar Uddin, Srimat Chakradhar

Abstract: Retrieval augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for combined understanding of multimodal data such as text, images and videos is appealing but two critical limitations exist: one-time, upfront capture of all content in large multimodal data as text descriptions entail… ▽ More Retrieval augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for combined understanding of multimodal data such as text, images and videos is appealing but two critical limitations exist: one-time, upfront capture of all content in large multimodal data as text descriptions entails high processing times, and not all information in the rich multimodal data is typically in the text descriptions. Since the user queries are not known apriori, develo** a system for multimodal to text conversion and interactive querying of multimodal data is challenging. To address these limitations, we propose iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of large corpus of multimodal data. Unlike traditional RAG, iRAG quickly indexes large repositories of multimodal data, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the multimodal data to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long multimodal to text conversion times, overcomes information loss issues by doing on-demand query-specific extraction of details in multimodal data, and ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of large, real-world multimodal data. Experimental results on real-world long videos demonstrate 23x to 25x faster video to text ingestion, while ensuring that quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any querying. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2309.06978 [pdf, other]

doi 10.1109/WACV57701.2024.00408

Differentiable JPEG: The Devil is in the Details

Authors: Christoph Reich, Biplob Debnath, Deep Patel, Srimat Chakradhar

Abstract: JPEG remains one of the most widespread lossy image coding methods. However, the non-differentiable nature of JPEG restricts the application in deep learning pipelines. Several differentiable approximations of JPEG have recently been proposed to address this issue. This paper conducts a comprehensive review of existing diff. JPEG approaches and identifies critical details that have been missed by… ▽ More JPEG remains one of the most widespread lossy image coding methods. However, the non-differentiable nature of JPEG restricts the application in deep learning pipelines. Several differentiable approximations of JPEG have recently been proposed to address this issue. This paper conducts a comprehensive review of existing diff. JPEG approaches and identifies critical details that have been missed by previous methods. To this end, we propose a novel diff. JPEG approach, overcoming previous limitations. Our approach is differentiable w.r.t. the input image, the JPEG quality, the quantization tables, and the color conversion parameters. We evaluate the forward and backward performance of our diff. JPEG approach against existing methods. Additionally, extensive ablations are performed to evaluate crucial design choices. Our proposed diff. JPEG resembles the (non-diff.) reference implementation best, significantly surpassing the recent-best diff. approach by $3.47$dB (PSNR) on average. For strong compression rates, we can even improve PSNR by $9.51$dB. Strong adversarial attack results are yielded by our diff. JPEG, demonstrating the effective gradient approximation. Our code is available at https://github.com/necla-ml/Diff-JPEG. △ Less

Submitted 22 December, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: Accepted at WACV 2024. Project page: https://christophreich1996.github.io/differentiable_jpeg/ WACV paper: https://openaccess.thecvf.com/content/WACV2024/html/Reich_Differentiable_JPEG_The_Devil_Is_in_the_Details_WACV_2024_paper.html

arXiv:2309.00841 [pdf, other]

LeanContext: Cost-Efficient Domain-Specific Question Answering Using LLMs

Authors: Md Adnan Arefeen, Biplob Debnath, Srimat Chakradhar

Abstract: Question-answering (QA) is a significant application of Large Language Models (LLMs), sha** chatbot capabilities across healthcare, education, and customer service. However, widespread LLM integration presents a challenge for small businesses due to the high expenses of LLM API usage. Costs rise rapidly when domain-specific data (context) is used alongside queries for accurate domain-specific LL… ▽ More Question-answering (QA) is a significant application of Large Language Models (LLMs), sha** chatbot capabilities across healthcare, education, and customer service. However, widespread LLM integration presents a challenge for small businesses due to the high expenses of LLM API usage. Costs rise rapidly when domain-specific data (context) is used alongside queries for accurate domain-specific LLM responses. One option is to summarize the context by using LLMs and reduce the context. However, this can also filter out useful information that is necessary to answer some domain-specific queries. In this paper, we shift from human-oriented summarizers to AI model-friendly summaries. Our approach, LeanContext, efficiently extracts $k$ key sentences from the context that are closely aligned with the query. The choice of $k$ is neither static nor random; we introduce a reinforcement learning technique that dynamically determines $k$ based on the query and context. The rest of the less important sentences are reduced using a free open source text reduction method. We evaluate LeanContext against several recent query-aware and query-unaware context reduction approaches on prominent datasets (arxiv papers and BBC news articles). Despite cost reductions of $37.29\%$ to $67.81\%$, LeanContext's ROUGE-1 score decreases only by $1.41\%$ to $2.65\%$ compared to a baseline that retains the entire context (no summarization). Additionally, if free pretrained LLM-based summarizers are used to reduce context (into human consumable summaries), LeanContext can further modify the reduced context to enhance the accuracy (ROUGE-1 score) by $13.22\%$ to $24.61\%$. △ Less

Submitted 2 September, 2023; originally announced September 2023.

Comments: The paper is under review

arXiv:2308.16215 [pdf, other]

Deep Video Codec Control for Vision Models

Authors: Christoph Reich, Biplob Debnath, Deep Patel, Tim Prangemeier, Daniel Cremers, Srimat Chakradhar

Abstract: Standardized lossy video coding is at the core of almost all real-world video processing pipelines. Rate control is used to enable standard codecs to adapt to different network bandwidth conditions or storage constraints. However, standard video codecs (e.g., H.264) and their rate control modules aim to minimize video distortion w.r.t. human quality assessment. We demonstrate empirically that stan… ▽ More Standardized lossy video coding is at the core of almost all real-world video processing pipelines. Rate control is used to enable standard codecs to adapt to different network bandwidth conditions or storage constraints. However, standard video codecs (e.g., H.264) and their rate control modules aim to minimize video distortion w.r.t. human quality assessment. We demonstrate empirically that standard-coded videos vastly deteriorate the performance of deep vision models. To overcome the deterioration of vision performance, this paper presents the first end-to-end learnable deep video codec control that considers both bandwidth constraints and downstream deep vision performance, while adhering to existing standardization. We demonstrate that our approach better preserves downstream deep vision performance than traditional standard video coding. △ Less

Submitted 16 April, 2024; v1 submitted 30 August, 2023; originally announced August 2023.

Comments: Accepted at CVPR 2024 Workshop on AI for Streaming (AIS)

arXiv:2304.09617 [pdf, other]

Towards Autonomous Selective Harvesting: A Review of Robot Perception, Robot Design, Motion Planning and Control

Authors: Vishnu Rajendran S, Bappaditya Debnath, Bappaditya Debnath, Sariah Mghames, Willow Mandil, Soran Parsa, Simon Parsons, Amir Ghalamzan-E

Abstract: This paper provides an overview of the current state-of-the-art in selective harvesting robots (SHRs) and their potential for addressing the challenges of global food production. SHRs have the potential to increase productivity, reduce labour costs, and minimise food waste by selectively harvesting only ripe fruits and vegetables. The paper discusses the main components of SHRs, including percepti… ▽ More This paper provides an overview of the current state-of-the-art in selective harvesting robots (SHRs) and their potential for addressing the challenges of global food production. SHRs have the potential to increase productivity, reduce labour costs, and minimise food waste by selectively harvesting only ripe fruits and vegetables. The paper discusses the main components of SHRs, including perception, gras**, cutting, motion planning, and control. It also highlights the challenges in develo** SHR technologies, particularly in the areas of robot design, motion planning and control. The paper also discusses the potential benefits of integrating AI and soft robots and data-driven methods to enhance the performance and robustness of SHR systems. Finally, the paper identifies several open research questions in the field and highlights the need for further research and development efforts to advance SHR technologies to meet the challenges of global food production. Overall, this paper provides a starting point for researchers and practitioners interested in develo** SHRs and highlights the need for more research in this field. △ Less

Submitted 19 April, 2023; originally announced April 2023.

Comments: Preprint: to be appeared in Journal of Field Robotics

arXiv:2301.03947 [pdf, other]

Autonomous Strawberry Picking Robotic System (Robofruit)

Authors: Soran Parsa, Bappaditya Debnath, Muhammad Arshad Khan, Amir Ghalamzan E.

Abstract: Challenges in strawberry picking made selective harvesting robotic technology demanding. However, selective harvesting of strawberries is complicated forming a few scientific research questions. Most available solutions only deal with a specific picking scenario, e.g., picking only a single variety of fruit in isolation. Nonetheless, most economically viable (e.g. high-yielding and/or disease-resi… ▽ More Challenges in strawberry picking made selective harvesting robotic technology demanding. However, selective harvesting of strawberries is complicated forming a few scientific research questions. Most available solutions only deal with a specific picking scenario, e.g., picking only a single variety of fruit in isolation. Nonetheless, most economically viable (e.g. high-yielding and/or disease-resistant) varieties of strawberry are grown in dense clusters. The current perception technology in such use cases is inefficient. In this work, we developed a novel system capable of harvesting strawberries with several unique features. The features allow the system to deal with very complex picking scenarios, e.g. dense clusters. Our concept of a modular system makes our system reconfigurable to adapt to different picking scenarios. We designed, manufactured, and tested a picking head with 2.5 DOF (2 independent mechanisms and 1 dependent cutting system) capable of removing possible occlusions and harvesting targeted strawberries without contacting fruit flesh to avoid damage and bruising. In addition, we developed a novel perception system to localise strawberries and detect their key points, picking points, and determine their ripeness. For this purpose, we introduced two new datasets. Finally, we tested the system in a commercial strawberry growing field and our research farm with three different strawberry varieties. The results show the effectiveness and reliability of the proposed system. The designed picking head was able to remove occlusions and harvest strawberries effectively. The perception system was able to detect and determine the ripeness of strawberries with 95% accuracy. In total, the system was able to harvest 87% of all detected strawberries with a success rate of 83% for all pluckable fruits. We also discuss a series of open research questions in the discussion section. △ Less

Submitted 10 January, 2023; originally announced January 2023.

Comments: To appear in the Journal of Field Robotics (Accepted) Please watch the video at https://www.youtube.com/watch?v=v8gGAvsISXU

arXiv:2208.09074 [pdf, other]

dPMP-Deep Probabilistic Motion Planning: A use case in Strawberry Picking Robot

Authors: Alessandra Tafuro, Bappaditya Debnath, Andrea M. Zanchettin, Amir Ghalamzan E

Abstract: This paper presents a novel probabilistic approach to deep robot learning from demonstrations (LfD). Deep movement primitives (DMPs) are deterministic LfD model that maps visual information directly into a robot trajectory. This paper extends DMPs and presents a deep probabilistic model that maps the visual information into a distribution of effective robot trajectories. The architecture that lead… ▽ More This paper presents a novel probabilistic approach to deep robot learning from demonstrations (LfD). Deep movement primitives (DMPs) are deterministic LfD model that maps visual information directly into a robot trajectory. This paper extends DMPs and presents a deep probabilistic model that maps the visual information into a distribution of effective robot trajectories. The architecture that leads to the highest level of trajectory accuracy is presented and compared with the existing methods. Moreover, this paper introduces a novel training method for learning domain-specific latent features. We show the superiority of the proposed probabilistic approach and novel latent space learning in the lab's real-robot task of strawberry harvesting. The experimental results demonstrate that latent space learning can significantly improve model prediction performances. The proposed approach allows to sample trajectories from distribution and optimises the robot trajectory to meet a secondary objective, e.g. collision avoidance. △ Less

Submitted 18 August, 2022; originally announced August 2022.

Comments: To appear In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022

arXiv:2109.01733 [pdf, other]

doi 10.1109/SMARTCOMP52413.2021.00060

F3S: Free Flow Fever Screening

Authors: Kunal Rao, Giuseppe Coviello, Min Feng, Biplob Debnath, Wang-Pin Hsiung, Murugan Sankaradas, Yi Yang, Oliver Po, Utsav Drolia, Srimat Chakradhar

Abstract: Identification of people with elevated body temperature can reduce or dramatically slow down the spread of infectious diseases like COVID-19. We present a novel fever-screening system, F3S, that uses edge machine learning techniques to accurately measure core body temperatures of multiple individuals in a free-flow setting. F3S performs real-time sensor fusion of visual camera with thermal camera… ▽ More Identification of people with elevated body temperature can reduce or dramatically slow down the spread of infectious diseases like COVID-19. We present a novel fever-screening system, F3S, that uses edge machine learning techniques to accurately measure core body temperatures of multiple individuals in a free-flow setting. F3S performs real-time sensor fusion of visual camera with thermal camera data streams to detect elevated body temperature, and it has several unique features: (a) visual and thermal streams represent very different modalities, and we dynamically associate semantically-equivalent regions across visual and thermal frames by using a new, dynamic alignment technique that analyzes content and context in real-time, (b) we track people through occlusions, identify the eye (inner canthus), forehead, face and head regions where possible, and provide an accurate temperature reading by using a prioritized refinement algorithm, and (c) we robustly detect elevated body temperature even in the presence of personal protective equipment like masks, or sunglasses or hats, all of which can be affected by hot weather and lead to spurious temperature readings. F3S has been deployed at over a dozen large commercial establishments, providing contact-less, free-flow, real-time fever screening for thousands of employees and customers in indoors and outdoor settings. △ Less

Submitted 3 September, 2021; originally announced September 2021.

arXiv:2009.14326 [pdf, other]

Attention-Driven Body Pose Encoding for Human Activity Recognition

Authors: B Debnath, M O'brien, S Kumar, A Behera

Abstract: This article proposes a novel attention-based body pose encoding for human activity recognition that presents a enriched representation of body-pose that is learned. The enriched data complements the 3D body joint position data and improves model performance. In this paper, we propose a novel approach that learns enhanced feature representations from a given sequence of 3D body joints. To achieve… ▽ More This article proposes a novel attention-based body pose encoding for human activity recognition that presents a enriched representation of body-pose that is learned. The enriched data complements the 3D body joint position data and improves model performance. In this paper, we propose a novel approach that learns enhanced feature representations from a given sequence of 3D body joints. To achieve this encoding, the approach exploits 1) a spatial stream which encodes the spatial relationship between various body joints at each time point to learn spatial structure involving the spatial distribution of different body joints 2) a temporal stream that learns the temporal variation of individual body joints over the entire sequence duration to present a temporally enhanced representation. Afterwards, these two pose streams are fused with a multi-head attention mechanism. % adapted from neural machine translation. We also capture the contextual information from the RGB video stream using a Inception-ResNet-V2 model combined with a multi-head attention and a bidirectional Long Short-Term Memory (LSTM) network. %Moreover, we whose performance is enhanced through the multi-head attention mechanism. Finally, the RGB video stream is combined with the fused body pose stream to give a novel end-to-end deep model for effective human activity recognition. △ Less

Submitted 2 October, 2020; v1 submitted 29 September, 2020; originally announced September 2020.

Comments: This paper has been accepted for publication at the IAPR IEEE/Computer Society International Conference on Pattern Recognition (ICPR), Milan, 2021

Journal ref: IAPR IEEE/Computer Society International Conference on Pattern Recognition (ICPR), Milan, 2021

Showing 1–10 of 10 results for author: Debnath, B