Skip to main content

Showing 1–6 of 6 results for author: Sardari, F

.
  1. arXiv:2406.06499  [pdf, other

    cs.CV cs.HC

    NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative

    Authors: Asmar Nadeem, Faegheh Sardari, Robert Dawes, Syed Sameed Husain, Adrian Hilton, Armin Mustafa

    Abstract: Existing video captioning benchmarks and models lack coherent representations of causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content. To address this gap, w… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  2. arXiv:2406.06187  [pdf, other

    cs.CV

    An Effective-Efficient Approach for Dense Multi-Label Action Detection

    Authors: Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

    Abstract: Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap. To address this challenging task, it is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships. Recent approaches model temporal information by extracting multi-scale features through hierarc… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 14 pages. arXiv admin note: substantial text overlap with arXiv:2308.05051

  3. arXiv:2405.10690  [pdf, other

    cs.CV

    CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

    Authors: Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

    Abstract: Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts… ▽ More

    Submitted 7 July, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

    Comments: Accepted at ECCV 2024

  4. arXiv:2308.05051  [pdf, other

    cs.CV

    PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

    Authors: Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton

    Abstract: We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-att… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

  5. arXiv:2109.08730  [pdf, ps, other

    cs.CV

    Unsupervised View-Invariant Human Posture Representation

    Authors: Faegheh Sardari, Björn Ommer, Majid Mirmehdi

    Abstract: Most recent view-invariant action recognition and performance assessment approaches rely on a large amount of annotated 3D skeleton data to extract view-invariant features. However, acquiring 3D skeleton data can be cumbersome, if not impractical, in in-the-wild scenarios. To overcome this problem, we present a novel unsupervised approach that learns to extract view-invariant 3D human pose represe… ▽ More

    Submitted 8 July, 2024; v1 submitted 17 September, 2021; originally announced September 2021.

    Comments: Accpeted at BMVC 2021

  6. arXiv:2008.04999  [pdf, ps, other

    cs.CV

    VI-Net: View-Invariant Quality of Human Movement Assessment

    Authors: Faegheh Sardari, Adeline Paiement, Sion Hannuna, Majid Mirmehdi

    Abstract: We propose a view-invariant method towards the assessment of the quality of human movements which does not rely on skeleton data. Our end-to-end convolutional neural network consists of two stages, where at first a view-invariant trajectory descriptor for each body joint is generated from RGB images, and then the collection of trajectories for all joints are processed by an adapted, pre-trained 2D… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    Comments: 13 pages, 6 figures, 7 tables